Getting Started

In short

This tutorial gives an overview of the basic concepts in the `auroris` library.

On the nuances of curation

How to best curate a dataset is highly situation-dependent. The `auroris` library includes some useful tools, but blindly applying them won't necessarily lead to good datasets. To learn more, visit the Polaris Hub for extensive resources and documentation on dataset curation and more.

Data curation is concerned with analyzing and processing an existing dataset to maximize its quality. Within drug discovery, this can imply many things, such as filtering out outliers or flagging activity-cliffs. High-quality, well-curated datasets are the foundation upon which we can build realistic, impactful benchmarks for drug discovery. This notebook demonstrates how to curate your dataset with the Polaris data curation API for small molecules.

Curating a toy dataset¶

Let's learn about the basic concepts of the auroris library by curating a toy dataset. For the sake of simplicity, we will use the solubility dataset from Datamol. It is worth noting that this dataset is only meant to be used as a toy dataset for pedagogic and testing purposes. It is not a dataset for benchmarking, analysis or model training. Curation can only take us so far. For impactful benchmarks, we rely on high-quality data sources to begin with.

In [3]:

Copied!

import datamol as dm
import datamol as dm

In [4]:

Copied!





# Load your data set
# See more details of the dataset at https://docs.datamol.io/stable/api/datamol.data.html
data = dm.data.solubility()
data.head(5)
# Load your data set
# See more details of the dataset at https://docs.datamol.io/stable/api/datamol.data.html
data = dm.data.solubility()
data.head(5)

Out[4]:

	mol	ID	NAME	SOL	SOL_classification	smiles	split
0	<rdkit.Chem.rdchem.Mol object at 0x173b7c2e0>	1	n-pentane	-3.18	(A) low	CCCCC	train
1	<rdkit.Chem.rdchem.Mol object at 0x173b7c430>	2	cyclopentane	-2.64	(B) medium	C1CCCC1	train
2	<rdkit.Chem.rdchem.Mol object at 0x173b7c4a0>	3	n-hexane	-3.84	(A) low	CCCCCC	train
3	<rdkit.Chem.rdchem.Mol object at 0x173b7c510>	4	2-methylpentane	-3.74	(A) low	CCCC(C)C	train
4	<rdkit.Chem.rdchem.Mol object at 0x173b7c580>	6	2,2-dimethylbutane	-3.55	(A) low	CCC(C)(C)C	train

Using the `Curator` API¶

The recommended way to specify curation workflows is through the Curator API:

A Curator object defines a number of curation steps.
Each step should inherit from auroris.curation.actions.BaseAction.
The Curator object is serializable. You can thus easily save and load it from JSON, which makes it easy to reproduce a curation workflow.
Finally, the Curator produces a CurationReport which summarizes the changes made to a dataset.

Let's define a simple workflow with three steps:

Curate the chemical structures
Detect outliers
Bin the regression column

In [5]:

Copied!





from auroris.curation import Curator
from auroris.curation.actions import MoleculeCuration, OutlierDetection, Discretization

# Define the curation workflow
curator = Curator(
    steps=[
        MoleculeCuration(input_column="smiles"),
        OutlierDetection(method="zscore", columns=["SOL"]),
        Discretization(input_column="SOL", thresholds=[-3]),
    ],
    parallelized_kwargs={"n_jobs": -1},
)

# Run the curation
dataset, report = curator(data)
from auroris.curation import Curator
from auroris.curation.actions import MoleculeCuration, OutlierDetection, Discretization

# Define the curation workflow
curator = Curator(
    steps=[
        MoleculeCuration(input_column="smiles"),
        OutlierDetection(method="zscore", columns=["SOL"]),
        Discretization(input_column="SOL", thresholds=[-3]),
    ],
    parallelized_kwargs={"n_jobs": -1},
)

# Run the curation
dataset, report = curator(data)

2024-08-02 12:26:54.316 | INFO     | auroris.curation._curator:transform:106 - Performing step: mol_curation
2024-08-02 12:27:12.343 | INFO     | auroris.curation._curator:transform:106 - Performing step: outlier_detection
2024-08-02 12:27:12.400 | INFO     | auroris.curation._curator:transform:106 - Performing step: discretize

The report can be exported ("broadcaster") to a variety of different formats. Let's simply log it to the CLI for now.

In [6]:

Copied!

from auroris.report.broadcaster import LoggerBroadcaster

broadcaster = LoggerBroadcaster(report)
broadcaster.broadcast()
from auroris.report.broadcaster import LoggerBroadcaster

broadcaster = LoggerBroadcaster(report)
broadcaster.broadcast()

===== Curation Report =====
Time: 2024-08-02 12:26:54
Version: 0.1.4.dev0+g7127343.d20240707
===== mol_curation =====
[LOG]: Couldn't preprocess 18 / 1282 molecules.
[LOG]: New column added: MOL_smiles
[LOG]: New column added: MOL_molhash_id
[LOG]: New column added: MOL_molhash_id_no_stereo
[LOG]: New column added: MOL_num_stereoisomers
[LOG]: New column added: MOL_num_undefined_stereoisomers
[LOG]: New column added: MOL_num_defined_stereo_center
[LOG]: New column added: MOL_num_undefined_stereo_center
[LOG]: New column added: MOL_num_stereo_center
[LOG]: New column added: MOL_undefined_E_D
[LOG]: New column added: MOL_undefined_E/Z
[LOG]: Default `ecfp` fingerprint is used to visualize the chemical space.
[LOG]: Molecules with undefined stereocenter detected: 253.
[IMG]: Dimensions 1200 x 600
[IMG]: Dimensions 1200 x 2400
===== outlier_detection =====
[LOG]: New column added: OUTLIER_SOL
[LOG]: Found 7 potential outliers with respect to the SOL column for review.
[IMG]: Dimensions 1200 x 600
===== discretize =====
[LOG]: New column added: CLS_SOL
[IMG]: Dimensions 1200 x 600
===== Curation Report END =====

We can see that there is also images in the report! More advanced broadcasters will display these, such as the HTMLBroadcaster.

In [7]:

Copied!

from auroris.report.broadcaster import HTMLBroadcaster
import tempfile

temp_dir = tempfile.TemporaryDirectory().name

broadcaster = HTMLBroadcaster(report=report, destination=temp_dir, embed_images=True)
broadcaster.broadcast()
from auroris.report.broadcaster import HTMLBroadcaster
import tempfile

temp_dir = tempfile.TemporaryDirectory().name

broadcaster = HTMLBroadcaster(report=report, destination=temp_dir, embed_images=True)
broadcaster.broadcast()

Out[7]:

'/var/folders/_7/ffxc1f251dbb5msn977xl4sm0000gr/T/tmps2tt3jrb/index.html'

One can review the above HTML report with embedded visualizations and share it with collaborators.

Let's also look at a single row of the new curated dataset!

In [8]:

Copied!

dataset.iloc[0]
dataset.iloc[0]

Out[8]:

mol                                <rdkit.Chem.rdchem.Mol object at 0x173b7c2e0>
ID                                                                             1
NAME                                                                   n-pentane
SOL                                                                        -3.18
SOL_classification                                                       (A) low
smiles                                                                     CCCCC
split                                                                      train
MOL_smiles                                                                 CCCCC
MOL_molhash_id                          3cb2e0cf1b50d8f954891abc5dcce90d543cd3d7
MOL_molhash_id_no_stereo                36551d628217a351e720cdbe676fca3067730a91
MOL_num_stereoisomers                                                        1.0
MOL_num_undefined_stereoisomers                                              1.0
MOL_num_defined_stereo_center                                                0.0
MOL_num_undefined_stereo_center                                              0.0
MOL_num_stereo_center                                                        0.0
MOL_undefined_E_D                                                          False
MOL_undefined_E/Z                                                              0
OUTLIER_SOL                                                                False
CLS_SOL                                                                      0.0
Name: 0, dtype: object

Using the functional API¶

auroris provides a functional API to easily and quickly run some curation steps. Let's look at an oulier detection example.

In [9]:

Copied!





from auroris.curation.functional import detect_outliers
from auroris.visualization import visualize_distribution_with_outliers

y = dataset["SOL"].values
is_outlier = detect_outliers(y, method="zscore")
visualize_distribution_with_outliers(y, is_outlier);
from auroris.curation.functional import detect_outliers
from auroris.visualization import visualize_distribution_with_outliers

y = dataset["SOL"].values
is_outlier = detect_outliers(y, method="zscore")
visualize_distribution_with_outliers(y, is_outlier);

No description has been provided for this image

Depending on the type of bioactivity and its distribution, the above plot helps to highlight data points that are potential outliers (data outside the acceptable range) or strong signals.

Reviewing these data points, and removing them if they are truely outliers, can be beneficial for QSAR modeling.

The End.

Getting Started

Curating a toy dataset¶

Using the Curator API¶

Using the functional API¶

Using the `Curator` API¶