## Machine Learning
The solution empowers the user to predict the molecular properties as bioactivity and toxicity of novel molecules against a target protein, accelerating drug discovery and lead optimization. This process leverages machine learning to analyze large amounts of molecular data, uncovering hidden patterns and relationships between molecular properties and bioactivity. The Dataiku [LAB analysis](analysis:tmLswnKG) provides a user-friendly interface for both data scientists and computational chemists to experiment and select the best performing models. 

1. Regression model for  **pIC50** :
  - Input features: [molecular descriptors](article:19) and [fingerprints](article:20) from [train_dataset](dataset:train_dataset). 
  - Target column: [pIC50](article:16).

2. Classification model for **molexular toxicity** :
  - Input features: [fingerprints](article:20) from [train_dataset](dataset:train_dataset). 
  - Target column: CT_TOX from [ClinTox Dataset](article:23)

For both cases:
- Select appropriate machine learning algorithms (e.g., Support Vector Machine, Gradient Boosting,  Ridge Regression, XGBoost) and hyperparameters based on your data and expertise in the [LAB](analysis:tmLswnKG).
- Train each model to learn the relationship between molecular properties and the target columns. 
- [Deploy](saved_model:zqa8kTkx) each model to the flow. 

## Clustering
To visualize and explore the underlying relationships between molecular properties and bioactivity the solution uses dimensionality reduction and clustering techniques.

### t-SNE (t-Distributed Stochastic Neighbor Embedding):
This technique projects high-dimensional molecular data (fingerprints) into a 2D space, allowing for visualization of complex relationships. [Tsne coordinates](dataset:molecular_properties) are used to group molecules based on their similarity in properties, providing a structured view of the chemical space. User has the option to select different [ clustering algorithms](analysis:o462HF12) (e.g., K-Means, Hierarchical) and discover distinct groups of molecules with potentially shared bioactivity profiles. Clustering can also identify potential outliers that might require further investigation. A model is [deployed](saved_model:8QR5xk5y) on the flow to score the train_dataset. 

