## Case Study: Accelerating Drug Discovery with Dataiku's Molecular Property Prediction Solution

### Background & Business Challenge
PharmaTech Innovations, a mid-sized pharmaceutical research company, specializes in developing new small-molecule inhibitors for cancer treatment. Their R&D team is working on identifying promising drug candidates targeting Aromatase (CYP19A1), a key enzyme in estrogen biosynthesis linked to hormone-sensitive cancers.

### Initial Situation
Dr. Lisa Chen, a computational chemist at PharmaTech, works for a lab that experiments with early-stage drug discovery. The process involves extensive experimental screening of thousands of molecular compounds, which is both time-consuming and expensive. 

### Goal 
To accelerate this process, PharmaTech Innovations seeks a data-driven approach that can predict molecular bioactivity, evaluate toxicity risks, and analyze compound similarity before committing to costly experiments.
- Prioritizing which molecules to synthesize and test in wet-lab experiments.
- Predicting bioactivity (pIC50) before experimental validation.
- Ensuring low toxicity to prevent failure in later clinical stages.
- Assessing chemical similarity to known active compounds for validation.

**Step 0: Data Preparation and Experiment Setup**
Using PharmaTech's pre-studied molecules and historical data records, Dr. Lisa uploads a test dataset of novel molecules to evaluate. The [test_data](dataset:test_data) file includes novel molecules to test with data structure:
| Column Name        | Column Type |
|--------------------|------------|
| molecule_id       | STRING      |
| canonical_SMILES  | STRING      |

also, replaces the [clintox_datasetch](dataset:clintox_datasetch) with their own  information of previously studied molecules that failed trials because of toxicity reactions. (instructions [Input Data](article:23))
| Column Name               | Column Type
| ------------ | ------------ | 
| canonical_simles | STRING |
| CT_TOX | BINARY |


**Step 1: Querying & Feature Extraction**
After preparing the dataset, Lisa proceeds with the first stage: extracting relevant molecular features to identify potential Aromatase inhibitors. By using the Dataiku Application (see [Dataiku Application](article:12)) in this solution she begins by:
 1. Querying ChEMBL for compounds with experimentally known ```pIC50``` values against Aromatase with the database accession protein code ```P11511```.
 1. Generating molecular fingerprints with ChemBERTa featurizer and 1D molecular descriptors (e.g., molecular weight, logP, functional groups).
 1. Specifying  the parameters required for data preparation as the threshold for molecular bioactivity and a novel molecule id from test_data to initiate the molecular similarity with previously studied molecules.
 ![dkuapp.png](Thva56CPDiRS)
 
✅   Benefit : This automated extraction process saves weeks of manual literature searches and pre-processing

**Step 2: Training & Evaluating ML Models**
Using Dataiku’s built-in AutoML, Lisa selects two models to train:
1. Regression model (pIC50 Prediction): Estimates bioactivity potency of new molecules.
2. Classification model (Toxicity Prediction): Identifies molecules that might fail due to toxicity.
The models are trained on historical compounds, using fingerprints such as Morgan, MACCS keys, and ChemBERTa embeddings for feature representation and other descriptors.

✅ Benefit: PharmaTech can now predict key molecular properties instead of relying solely on expensive lab experiments.

**Step 3: Exploring Studied & Novel Molecules**
Dr. Lisa now tests new candidate molecules that PharmaTech’s bench chemists have designed.
Explores the Chemical Space Analysis tab to assess descriptor distributions and their correlation with bioactivity. 
 ![dashboard.png](K1nDygIrohJj)
Uses the Novel Molecule Search tab to rank molecules based on: High predicted pIC50 (potency), Low toxicity risk (safety) and Drug-likeness score (QED) for clinical viability.
![Screenshot 2025-02-09 at 5.19.56 PM.png](kOL4PgjYJW8r)
📌 Outcome: Lisa narrows the list to 18 top-ranked molecules for synthesis—reducing experimental workload by 97%!

✅ Benefit: PharmaTech accelerates its drug discovery pipeline, focusing only on high-potential candidates.

**Step 4: Validating with Molecular Similarity Analysis**
Before finalizing, Lisa uses the Molecular Similarity tab to:
- Compare each novel molecule to known successful inhibitors.
- Visualize structurally related compounds using a Tanimoto coefficient-based similarity graph.
- Ensure the selected molecules shares key functional motifs with validated drugs.
![dashboard3.png](4KWuPbpaBoAz)
🔬 Insight: The top 3 prioritized molecules show ≥85% similarity to previously approved Aromatase inhibitors—boosting confidence in the selection.

✅ Benefit: This step prevents wasted resources on molecules with low structural relevance.

### Final Step: Decision-Making & Dashboard Review
- Dr.Lisa presents her findings to the R&D team using the Molecular Prediction Dashboard, which includes:
  - Target Protein Analysis – metadata, correlation matrices, parallel coordinate plots.
  - Chemical Space Analysis – t-SNE embeddings for compound distributions.
  - Novel Molecule Search – ranked molecule predictions with filtering options.
  - Molecular Similarity – visualization of known vs. novel compounds.

🚀 PharmaTech’s leadership approves 3 molecules that qualify to the Lead Idintefication stage, cutting down initial experimental costs by 80%!

### Business Impact & Key Benefits
✅ Faster Drug Discovery – Reduces candidate selection time from months to days.
✅ Cost Savings – Avoids unnecessary synthesis of low-potential molecules.
✅ Higher Success Rate – Increases likelihood of selecting potent, non-toxic compounds.
✅ Data-Driven Decision Making – Ensures scientific confidence before committing resources.

### Conclusion
This data-driven approach exemplifies how machine learning revolutionizes early-stage drug discovery, optimizing resources and increasing success rates.

Would your team benefit from a similar solution? Explore how Dataiku can transform your drug discovery pipeline today! 🚀
