In this section, we outline each aspect of the data preparation and analytics flow.

![Screenshot 2025-03-17 at 1.53.21 PM.png](iEE3nYVfOILl)

[Target Search](flow_zone:q3b2nta) runs a [python script](recipe:compute_database_query) that queries previously studied molecule parameters from a public database for a target protein as defined by the user in [Solution Variables](article:17) with variables ```database``` and ```accession_protein_code```.
 
[Data Preparation/Featurization](flow_zone:default) applies data preprocessing and feautre generation at the previously studied molecules. 
 - Recipe [compute_molecules_bioactivity](recipe:compute_molecules_bioactivity) cleans any negative values from the ```standard_value``` (IC50) column and converts the value to the bioactivity feature ```pIC50```. Also labels the molecules to  _active, intermediate and inactive_  based on user defined thresholds  in [Solution Variables](article:17) ```bioactivity_class_active```, ```bioactivity_class_inactive``` (see [article](article:16)). 
 - Python recipe [compute_molecular_features](recipe:compute_molecular_features) consists one of the key components of this projects. From the canonical smiles notation it generates [Molecular Descriptors](article:19) and based on user defined featurizer in [Solution Variables](article:17) ```transformer_type``` calculates the [Fingerprint Descriptors](article:20). Moreover it applies t-sne component reduction on the fingerprint features for visualisations and clustering. Both output datasets [train_dataset](dataset:train_dataset) and [molecular_properties](dataset:molecular_properties) feed machine learning task in the next flow zones. 
 
[Potential Drug Candidates Preprocessing](flow_zone:SjUREsm) 
 - User is required to replace [test_data](dataset:test_data) with their own data. These may be  known compounds or drug repurposing candidates that wants to test and verify against the protein of interest.
 - A set of automatic preprocesses is applied on the data to align with the train data in Data Preparation/Featurization flow zone. 
   - Use Chemoinformatic python libraries as RDKit computes molecular descriptors (volume, surface area, atomic structure).
   - Use pretrained LLM ChemBERTa (HuggingFace model) generates fingerprint descriptors that indicate the presence or absence of specific chemical features. (Model details are in the code environment if needed.)


[Data Analytics/Statistics](flow_zone:cyB1Ebi) applied [Clustering](article:21) on the t-sne coordinates in [molecular_properties](dataset:molecular_properties). 

[Molecule Toxicity](flow_zone:7uEgjAZ)
 - Use the ClinTox dataset (described in [Input Data](article:23)) to train a classification model that predicts if a molecule might fail clinical trials due to toxicity concerns.

[Molecule Bioactivity/Similarity Scoring](flow_zone:4Xmv0l5)
 - The [deployed regression model ](saved_model:zqa8kTkx) is the applied on the test data to score the novel molecules and predict their bioactivity againts the target protein.
 - [The recipe](recipe:compute_new_molecules_filtered) filters the data to the novel molecule that the user is interested to explore similarities with the studied molecules as defined in [Solution Variables](article:17) ```molecule_filter```. 
 - [Compute_molecular_similarity](recipe:compute_molecular_similarity) python recipe filters the top 200 studied molecule with highest bioactivity pIC50 values and applies [tanimoto similarity](article:22) to the novel molecule against those ones. It exports a table with similarity scores and a graph representation insight in Dashboard that showcases the 6 most similar molecules to the selected one. The process in further automated through a scenario on the dashboard where user can explore the rest of the novel molecules and their similarities. 
 
[Visualizations](flow_zone:fP2POpp) applies statistical analysis as principle component analysis and statistical testing to visualize the results and create the final template story on [Molecular Property Prediction Dashboard](dashboard:XvuFy57)