Getting Started
In short
This tutorial gives an overview of the basic concepts in the `auroris` library.
On the nuances of curation
How to best curate a dataset is highly situation-dependent. The `auroris` library includes some useful tools, but blindly applying them won't necessarily lead to good datasets. To learn more, visit the Polaris Hub for extensive resources and documentation on dataset curation and more.
Data curation is concerned with analyzing and processing an existing dataset to maximize its quality. Within drug discovery, this can imply many things, such as filtering out outliers or flagging activity-cliffs. High-quality, well-curated datasets are the foundation upon which we can build realistic, impactful benchmarks for drug discovery. This notebook demonstrates how to curate your dataset with the Polaris data curation API for small molecules.
Curating a toy dataset¶
Let's learn about the basic concepts of the auroris
library by curating a toy dataset. For the sake of simplicity, we will use the solubility dataset from Datamol. It is worth noting that this dataset is only meant to be used as a toy dataset for pedagogic and testing purposes. It is not a dataset for benchmarking, analysis or model training. Curation can only take us so far. For impactful benchmarks, we rely on high-quality data sources to begin with.
import datamol as dm
# Load your data set
# See more details of the dataset at https://docs.datamol.io/stable/api/datamol.data.html
data = dm.data.solubility()
data.head(5)
mol | ID | NAME | SOL | SOL_classification | smiles | split | |
---|---|---|---|---|---|---|---|
0 | <rdkit.Chem.rdchem.Mol object at 0x173b7c2e0> | 1 | n-pentane | -3.18 | (A) low | CCCCC | train |
1 | <rdkit.Chem.rdchem.Mol object at 0x173b7c430> | 2 | cyclopentane | -2.64 | (B) medium | C1CCCC1 | train |
2 | <rdkit.Chem.rdchem.Mol object at 0x173b7c4a0> | 3 | n-hexane | -3.84 | (A) low | CCCCCC | train |
3 | <rdkit.Chem.rdchem.Mol object at 0x173b7c510> | 4 | 2-methylpentane | -3.74 | (A) low | CCCC(C)C | train |
4 | <rdkit.Chem.rdchem.Mol object at 0x173b7c580> | 6 | 2,2-dimethylbutane | -3.55 | (A) low | CCC(C)(C)C | train |
Using the Curator
API¶
The recommended way to specify curation workflows is through the Curator
API:
- A
Curator
object defines a number of curation steps. - Each step should inherit from
auroris.curation.actions.BaseAction
. - The
Curator
object is serializable. You can thus easily save and load it from JSON, which makes it easy to reproduce a curation workflow. - Finally, the
Curator
produces aCurationReport
which summarizes the changes made to a dataset.
Let's define a simple workflow with three steps:
- Curate the chemical structures
- Detect outliers
- Bin the regression column
from auroris.curation import Curator
from auroris.curation.actions import MoleculeCuration, OutlierDetection, Discretization
# Define the curation workflow
curator = Curator(
steps=[
MoleculeCuration(input_column="smiles"),
OutlierDetection(method="zscore", columns=["SOL"]),
Discretization(input_column="SOL", thresholds=[-3]),
],
parallelized_kwargs={"n_jobs": -1},
)
# Run the curation
dataset, report = curator(data)
2024-08-02 12:26:54.316 | INFO | auroris.curation._curator:transform:106 - Performing step: mol_curation 2024-08-02 12:27:12.343 | INFO | auroris.curation._curator:transform:106 - Performing step: outlier_detection 2024-08-02 12:27:12.400 | INFO | auroris.curation._curator:transform:106 - Performing step: discretize
The report can be exported ("broadcaster") to a variety of different formats. Let's simply log it to the CLI for now.
from auroris.report.broadcaster import LoggerBroadcaster
broadcaster = LoggerBroadcaster(report)
broadcaster.broadcast()
===== Curation Report ===== Time: 2024-08-02 12:26:54 Version: 0.1.4.dev0+g7127343.d20240707 ===== mol_curation ===== [LOG]: Couldn't preprocess 18 / 1282 molecules. [LOG]: New column added: MOL_smiles [LOG]: New column added: MOL_molhash_id [LOG]: New column added: MOL_molhash_id_no_stereo [LOG]: New column added: MOL_num_stereoisomers [LOG]: New column added: MOL_num_undefined_stereoisomers [LOG]: New column added: MOL_num_defined_stereo_center [LOG]: New column added: MOL_num_undefined_stereo_center [LOG]: New column added: MOL_num_stereo_center [LOG]: New column added: MOL_undefined_E_D [LOG]: New column added: MOL_undefined_E/Z [LOG]: Default `ecfp` fingerprint is used to visualize the chemical space. [LOG]: Molecules with undefined stereocenter detected: 253. [IMG]: Dimensions 1200 x 600 [IMG]: Dimensions 1200 x 2400 ===== outlier_detection ===== [LOG]: New column added: OUTLIER_SOL [LOG]: Found 7 potential outliers with respect to the SOL column for review. [IMG]: Dimensions 1200 x 600 ===== discretize ===== [LOG]: New column added: CLS_SOL [IMG]: Dimensions 1200 x 600 ===== Curation Report END =====
We can see that there is also images in the report! More advanced broadcasters will display these, such as the HTMLBroadcaster
.
from auroris.report.broadcaster import HTMLBroadcaster
import tempfile
temp_dir = tempfile.TemporaryDirectory().name
broadcaster = HTMLBroadcaster(report=report, destination=temp_dir, embed_images=True)
broadcaster.broadcast()
'/var/folders/_7/ffxc1f251dbb5msn977xl4sm0000gr/T/tmps2tt3jrb/index.html'
One can review the above HTML report with embedded visualizations and share it with collaborators.
Let's also look at a single row of the new curated dataset!
dataset.iloc[0]
mol <rdkit.Chem.rdchem.Mol object at 0x173b7c2e0> ID 1 NAME n-pentane SOL -3.18 SOL_classification (A) low smiles CCCCC split train MOL_smiles CCCCC MOL_molhash_id 3cb2e0cf1b50d8f954891abc5dcce90d543cd3d7 MOL_molhash_id_no_stereo 36551d628217a351e720cdbe676fca3067730a91 MOL_num_stereoisomers 1.0 MOL_num_undefined_stereoisomers 1.0 MOL_num_defined_stereo_center 0.0 MOL_num_undefined_stereo_center 0.0 MOL_num_stereo_center 0.0 MOL_undefined_E_D False MOL_undefined_E/Z 0 OUTLIER_SOL False CLS_SOL 0.0 Name: 0, dtype: object
Using the functional API¶
auroris
provides a functional API to easily and quickly run some curation steps. Let's look at an oulier detection example.
from auroris.curation.functional import detect_outliers
from auroris.visualization import visualize_distribution_with_outliers
y = dataset["SOL"].values
is_outlier = detect_outliers(y, method="zscore")
visualize_distribution_with_outliers(y, is_outlier);
Depending on the type of bioactivity and its distribution, the above plot helps to highlight data points that are potential outliers (data outside the acceptable range) or strong signals.
Reviewing these data points, and removing them if they are truely outliers, can be beneficial for QSAR modeling.
The End.