Impacts of stereochemistry and activity cliffs in quantitative structure-activity relationships (QSAR) models

Activity cliffs¶

Activity Cliffs (ACs) were generally defined as pairs of structurally similar compounds showing a marked difference in potency. The existence of ACs directly defies the intuitive idea that chemical compounds with similar structures should have similar activities, often referred to as the molecular similarity principle, and highlights the remarkable specificity of their interaction with biological systems (e.g., proteins). As such, ACs received strong attention in computational chemistry and drug design Stumpfe et al. 2019, Tilborg et al. 2022.

Activity cliff can be defined based on the substructure of molecules or the similarity computed by molecular representations.

The figure below shows selected examples of activity cliffs from Stumpfe et al. 2019.

Stereoisomerism¶

Stereoisomerism refers to the phenomenon where molecules with the same molecular formula and connectivity (i.e., the same number and type of atoms bound in the same sequence) differ in the spatial arrangement of their atoms, leading to the existence of stereoisomers. Stereoisomers are a type of isomer where the differences arise not from the sequence of bonds but from the orientation of the atoms in three-dimensional space.

🔔¶

In this notebook, we will explore different types of stereoisomerism and how the stereochemical information can be captured through molecular representations. We will also examine the impact of stereoisomers that exhibit significant activity cliffs (ACs) in the context of QSAR modeling.

In [1]:

Copied!





# Note: Cell is tagged to not show up in the mkdocs build
%load_ext autoreload
%autoreload 2
import numpy as np
import pandas as pd
from copy import copy
import nglview as nv

import datamol as dm
from molfeat.trans.fp import FPVecTransformer

from utils import add_3d_coord
# Note: Cell is tagged to not show up in the mkdocs build
%load_ext autoreload
%autoreload 2
import numpy as np
import pandas as pd
from copy import copy
import nglview as nv

import datamol as dm
from molfeat.trans.fp import FPVecTransformer

from utils import add_3d_coord

Types of stereoisomerisms¶

Optical isomerism¶

Let's look at molecule Ibuprofen CC(C)CC1=CC=C(C=C1)C(C)C(=O)O, a common and well known compound for treating fever and mild pain.

In [2]:

Copied!

ibu = dm.to_mol("CC(C)CC1=CC=C(C=C1)C(C)C(=O)O")
dm.to_image(mols=[ibu], highlight_atom=[10])
ibu = dm.to_mol("CC(C)CC1=CC=C(C=C1)C(C)C(=O)O")
dm.to_image(mols=[ibu], highlight_atom=[10])

Out[2]:

No description has been provided for this image

The molecule is chiral with one stereocenter at atom 10, meaning that that atom is bound to 4 different groups: a methyl, a carboxylic acid, a substituted aromatic ring, and a hydrogen atom (the latter usually not shown in 2D representations).

Chirality is the property of compounds that possess the same molecular formula and atom connectivity, but exist in different 3D configurations that are a non-superimposable mirror image of each other, similarly to your left and right hand! The different stereoisomers of chiral compounds interact differently with beams of polarized light in a phenomenon called optical isomerism.

Let's have a look at all the potential stereoisomers for Ibuprofen.

In [3]:

Copied!





ibu_isomers = dm.isomers.enumerate_stereoisomers(ibu)

dm.to_image(
    ibu_isomers,
    highlight_atom=[[10], [10]],
    legends=["(S)-Ibuprofen", "(R)-Ibuprofen"],
    align=True,
)
ibu_isomers = dm.isomers.enumerate_stereoisomers(ibu)

dm.to_image(
    ibu_isomers,
    highlight_atom=[[10], [10]],
    legends=["(S)-Ibuprofen", "(R)-Ibuprofen"],
    align=True,
)

Out[3]:

If we 'flat' the aromatic and carboxylic acid groups in the plane of your screen, the methyl group attached to the chiral carbon atom can be pointing up towards you or down away from you (remember that C atoms have a tetrahedral disposition of the four groups bound to it; hydrogen atoms are usually omitted in 2D representations).

The molecule possesses a single chiral center that originates two possible stereoisomers. These two stereoisomers are the non-superimposable mirror image of each other thus they are commonly referred as enantiomers.

Let's look at the same enantiomers in 3D:

In [7]:

Copied!

# view (S)-Ibuprofen in 3D
view1 = nv.show_rdkit(ibu_isomers[0])
# view1
# view (S)-Ibuprofen in 3D
view1 = nv.show_rdkit(ibu_isomers[0])
# view1

In [11]:

Copied!

# view (R)-Ibuprofen in 3D
view2 = nv.show_rdkit(ibu_isomers[1])
# view2
# view (R)-Ibuprofen in 3D
view2 = nv.show_rdkit(ibu_isomers[1])
# view2

Although 2D and 3D representations often make for nice images, in cheminformatics we often represent molecules as SMILES strings.

The disposition of the groups on a chiral center in a SMILES string is specified using a single or double @ symbol. In our case the chiral carbon atom would be written either C@H or C@@H.

In the case of the two enantiomers of Ibuprofen, the SMILES for each enantiomers are:

In [12]:

Copied!

[dm.to_smiles(isomer) for isomer in ibu_isomers]
[dm.to_smiles(isomer) for isomer in ibu_isomers]

Out[12]:

['CC(C)Cc1ccc([C@H](C)C(=O)O)cc1', 'CC(C)Cc1ccc([C@@H](C)C(=O)O)cc1']

Stereochemistry information can be also embedded in other type of molecular representation other than SMILES.

For example, if we write the two stereoisomers of Ibuprofen in Inchi strings, we see that there is a small difference to identify the chirality of the carbon atom:

In [13]:

Copied!

[dm.to_inchi(isomer) for isomer in ibu_isomers]
[dm.to_inchi(isomer) for isomer in ibu_isomers]

Out[13]:

['InChI=1S/C13H18O2/c1-9(2)8-11-4-6-12(7-5-11)10(3)13(14)15/h4-7,9-10H,8H2,1-3H3,(H,14,15)/t10-/m0/s1',
 'InChI=1S/C13H18O2/c1-9(2)8-11-4-6-12(7-5-11)10(3)13(14)15/h4-7,9-10H,8H2,1-3H3,(H,14,15)/t10-/m1/s1']

Molecules are often converted to rdkit.mol objects before being transformed other advanced features vectors. Different stereoisomers are detected as different rdkit.mol object (as it should be!):

In [14]:

Copied!

if not dm.same_mol(ibu_isomers[0], ibu_isomers[1]):
    print("The two molecules are not identical!")
if not dm.same_mol(ibu_isomers[0], ibu_isomers[1]):
    print("The two molecules are not identical!")

The two molecules are not identical!

E/Z Stereoisomerism¶

We have briefly explored stereoisomers generated by the presence of a chiral atom, but that is not the only case in which a compound can exist as different stereoisomers.

Another common case of stereoisomerism is the E/Z isomerism. This form of stereoisomerism arises from the different arrangement of groups bound via a double bond. The most common cases are compounds containing a C=C double bond, but also C=N and N=N can be commonly found in nature. The E/Z nomenclature derives from German for Entegen (opposite) and Zusammen (together). In the past another nomenclature for double bond stereochemistry was utilized where E and Z were replaced by 'trans' and 'cis', respectively, but this type of notation is now deprecated and utilized for other scopes.

Due to the nature of double bonds, the four substituents (two for each carbon atom) are all placed in the same plane but they can point in different directions such as in the case of 1-bromo-2-chloropropene:

In [15]:

Copied!





mol = dm.to_mol("CC(=CBr)Cl")
isomers = dm.isomers.enumerate_stereoisomers(mol)

dm.to_image(
    isomers,
    highlight_bond=[[1], [1]],
    legends=["(E)-1-bromo-2-chloroprop-1-ene", "(S)-1-bromo-2-chloroprop-1-ene"],
    align=True,
)
mol = dm.to_mol("CC(=CBr)Cl")
isomers = dm.isomers.enumerate_stereoisomers(mol)

dm.to_image(
    isomers,
    highlight_bond=[[1], [1]],
    legends=["(E)-1-bromo-2-chloroprop-1-ene", "(S)-1-bromo-2-chloroprop-1-ene"],
    align=True,
)

Out[15]:

As before, we can see that the two possible stereoisomers have the same type of atoms and the same connectivity, but quite evidently the two halogen atoms (bromo and chloro) can either be placed on the same or the opposite 'side' of the C=C double bond'

Similarly to chiral centers, SMILES strings have a special symbol to identify E/Z isomers: \ or /:

In [16]:

Copied!

[dm.to_smiles(isomer) for isomer in isomers]
[dm.to_smiles(isomer) for isomer in isomers]

Out[16]:

['C/C(Cl)=C\\Br', 'C/C(Cl)=C/Br']

And as for two enantiomers, E and Z isomers are not the same molecule!

In [17]:

Copied!

if not dm.same_mol(isomers[0], isomers[1]):
    print("The two molecules are not identical!")
if not dm.same_mol(isomers[0], isomers[1]):
    print("The two molecules are not identical!")

The two molecules are not identical!

In [18]:

Copied!

assert dm.to_inchi(ibu_isomers[0]) != dm.to_inchi(ibu_isomers[1])
assert dm.to_inchi(ibu_isomers[0]) != dm.to_inchi(ibu_isomers[1])

We have seen molecules that contain a single chiral center or a double bond and how we define their corresponding stereoisomers. Nature is very often a bit more creative, and it is very common to encounter compounds that contains multiple chiral atoms and/or one or more double bonds!

Please, refer to the diagram at the beginning of this tutorial for a complete overview of all the possibilities but here a couple of examples of compounds containing multiple types of stereochemical information:

In [19]:

Copied!

# Multiple stereocenters (chiral atoms) in one molecule
mols_1 = dm.to_mol("CC(C(=O)O)C(N)C")
dm.to_image(dm.isomers.enumerate_stereoisomers(mols_1), align=True)
# Multiple stereocenters (chiral atoms) in one molecule
mols_1 = dm.to_mol("CC(C(=O)O)C(N)C")
dm.to_image(dm.isomers.enumerate_stereoisomers(mols_1), align=True)

Out[19]:

In [20]:

Copied!

# Optical isomerism and E/Z isomerism in one molecule
mols_2 = dm.to_mol("O=C1OC(=CI)CCC1c1cccc2ccccc12")
dm.to_image(dm.isomers.enumerate_stereoisomers(mols_2), align=True)
# Optical isomerism and E/Z isomerism in one molecule
mols_2 = dm.to_mol("O=C1OC(=CI)CCC1c1cccc2ccccc12")
dm.to_image(dm.isomers.enumerate_stereoisomers(mols_2), align=True)

Out[20]:

Is stereochemistry always important in QSAR?¶

Short answer: YES! The 3D arrangement of groups in a molecule has a profound impact on how different stereoisomers interact with biological targets. Remember that every single protein is nothing more that a very big molecule that contains a large amount of chiral centers, therefore proteins and biological entities are stereoisomers themselves. Different stereoisomers (e.g., enantiomers, diastereomers) will interact differently with their biological targets showing vastly different pharmacology, binding affinity, activity and potency, etc...

Examples: The binding site of the β2-adrenergic receptor has a specific 3D shape where (R)-albuterol fits into more precisely than the (S)-albuterol. (R)-albuterol provides more effective treatment with fewer side effects targeting β2-adrenergic receptors in comparison to (S)-albuterol. Another example is Penicillamine, where the (S) enantiomer shows the desired antiarthritic activity, whereas the (R) enantiomer is extremely toxic!

In the rare cases where stereochemistry is found not to affect the biological activity of a series of compounds, the stereochemical information may be unnecessary in handling these datasets (always check before removing or ignoring stereochemistry).

Furthermore, stereochemistry may have also an impact on the physical (e.g., solubility, melting point) and physicochemical (e.g., logP, TPSA) properties of a molecule, but not always: - Enantiomers always show the same physical and physicochemical properties; - Diastereomers often show different physical properties but they have the same physicochemical properties.

In conclusion, it is always a good (and safe!) choice to include stereochemical information when possible, but in certain cases and for certain type of analyses it may not be necessary.

Stereochemistry in cheminformatics

As mentioned above, when studying the interaction between small molecules and proteins, it is very important to define ALL the stereochemical information of a molecule, if possible. Missing/unknown information or limitations in the chemical synthesis and purification of chemical compounds may result in the impossibility to fully define the stereochemistry of these molecules.

Undefined or partially defined stereochemical information is unfortunately quite common, especially in the early stages of a drug development program, and it usually challenges efforts in QSAR modeling.

It is worth mentioning that, in early-stage drug discovery programs, where high-throughput screening (HTS) is utilized to identify hit compounds from very large libraries, the focus is more oriented towards broad activity trends rather than precise stereochemical interactions. As a result, at this stage, stereochemistry may be ignored by routinely testing racemic mixtures (mixtures containing equal amounts of the possible enantiomers), and mixtures of diastereomers. Therefore, when utilizing models trained on such data lacking accurate stereochemical information, it is important to specify the limitations of these models, particularly regarding their applicability to other molecular databases.

Stereoisomers with activity cliff and the impact on QSAR¶

Stereoisomers might exhibit activity cliffs: different stereoisomers of the same molecule showing large differences in biological activity. Stereoisomers associated to activity cliffs can pose challenges to QSAR modeling by creating sharp contrasts in biological activity vs chemical connectivity that are not easily captured by standard models.

Precise representation of stereochemistry and careful model design are crucial for addressing these challenges and improving the predictive accuracy and interpretability of QSAR models.

Choice of molecular representations for stereoisomerism¶

Here we use featurizers(molecular transformers) from molfeat which is a hub of molecular featurizers. It supports a wide variety of out-of-the-box molecular featurizers from Fingerprints, pretrained-featurizers, deep embeddings (proteins+small mol), to molecular descriptors.

In [21]:

Copied!





# generate conformers/3d coordinates
mol1_3d = dm.conformers.generate(ibu_isomers[0])
mol2_3d = dm.conformers.generate(ibu_isomers[1])
mol3_3d = dm.conformers.generate(isomers[0])
mol4_3d = dm.conformers.generate(isomers[1])
# generate conformers/3d coordinates
mol1_3d = dm.conformers.generate(ibu_isomers[0])
mol2_3d = dm.conformers.generate(ibu_isomers[1])
mol3_3d = dm.conformers.generate(isomers[0])
mol4_3d = dm.conformers.generate(isomers[1])

Featurizer 1: Let’s use the ECFP featurizer, which encodes the presence of specific substructures or features in a molecule. It is one of the most commonly used molecular representations in QSAR modeling.

In [22]:

Copied!





transfomer_1 = FPVecTransformer(kind="ecfp:4", length=2042)
fp1 = transfomer_1(mol1_3d)
fp2 = transfomer_1(mol2_3d)
if np.array_equal(fp1, fp2):
    print(f"This featurizer '{transfomer_1.kind}' can't differentiate the two enantiomers!")
transfomer_1 = FPVecTransformer(kind="ecfp:4", length=2042)
fp1 = transfomer_1(mol1_3d)
fp2 = transfomer_1(mol2_3d)
if np.array_equal(fp1, fp2):
    print(f"This featurizer '{transfomer_1.kind}' can't differentiate the two enantiomers!")

This featurizer 'ecfp:4' can't differentiate the two enantiomers!

In [23]:

Copied!





fp3 = transfomer_1(mol3_3d)
fp4 = transfomer_1(mol4_3d)
if np.array_equal(fp3, fp4):
    print(f"This featurizer '{transfomer_1.kind}' can't differentiate the two diastereomers!")
fp3 = transfomer_1(mol3_3d)
fp4 = transfomer_1(mol4_3d)
if np.array_equal(fp3, fp4):
    print(f"This featurizer '{transfomer_1.kind}' can't differentiate the two diastereomers!")

This featurizer 'ecfp:4' can't differentiate the two diastereomers!

Now let’s explore more sophisticated featurizers that captures the 3D information of molecules.

Featurizer 2: 3D version of the CATS pharmacophores computed with the Pharm2D module in RDKit.

In [24]:

Copied!





from molfeat.trans.base import MoleculeTransformer
from molfeat.calc.pharmacophore import Pharmacophore3D

# define transformer
transformer_2 = MoleculeTransformer(featurizer=Pharmacophore3D(factory="cats"), dtype=float)
feat1 = transformer_2(mol1_3d)
feat2 = transformer_2(mol2_3d)
if not np.array_equal(feat1, feat2):
    print(f"This featurizer '{transformer_2.featurizer.factory}' can differentiate the enantiomers!")
from molfeat.trans.base import MoleculeTransformer
from molfeat.calc.pharmacophore import Pharmacophore3D

# define transformer
transformer_2 = MoleculeTransformer(featurizer=Pharmacophore3D(factory="cats"), dtype=float)
feat1 = transformer_2(mol1_3d)
feat2 = transformer_2(mol2_3d)
if not np.array_equal(feat1, feat2):
    print(f"This featurizer '{transformer_2.featurizer.factory}' can differentiate the enantiomers!")

This featurizer 'cats' can differentiate the enantiomers!

In [25]:

Copied!





feat3 = transformer_2(mol3_3d)
feat4 = transformer_2(mol4_3d)
if not np.array_equal(feat1, feat2):
    print(f"This featurizer '{transformer_2.featurizer.factory}' can differentiate the diastereomers!")
feat3 = transformer_2(mol3_3d)
feat4 = transformer_2(mol4_3d)
if not np.array_equal(feat1, feat2):
    print(f"This featurizer '{transformer_2.featurizer.factory}' can differentiate the diastereomers!")

This featurizer 'cats' can differentiate the diastereomers!

Please find more details about featurizer pharm3D-cats and explore other available featurizers on molfeat page.

In [ ]:

QSAR modeling with stereoisomers exhibiting activity cliffs¶

In this section, we’ll take a closer look at how activity cliffs between stereoisomers can impact the predictive performance of QSAR modeling.

Example case: Dopamine receptor D2¶

The dopamine D2 receptor, like many G-protein-coupled receptors (GPCRs), exhibits stereoselectivity, meaning that stereoisomers A of a small molecule may bind more effectively than stereoisomer B, C, of the same molecule. In these common cases, stereoselectivity plays a crucial role in the pharmacological effects of drugs targeting the D2 receptor Xiao et al. 2014.

Here, our goal is to train predictive models that perform well on both molecules that exhibit activity cliffs and those that do not.

Note: The dataset used in this notebook is heterogeneous and extracted from CHEMBL for demonstration purpose.

In [20]:

Copied!

# Load data set of small molecule inhibitors of dopamine receptor D2
data = pd.read_csv("data/dopamine_d2_data.csv")
# Load data set of small molecule inhibitors of dopamine receptor D2
data = pd.read_csv("data/dopamine_d2_data.csv")

In [3]:

Copied!





mol_col = "MOL_smiles"  # molecule smiles strings
val_col = "CLS_pChEMBL Value"  # class label for binding affinity

mols = dm.utils.parallelized(
    fn=lambda mol: add_3d_coord(mol),
    inputs_list=data[mol_col],
    progress=True,
    n_jobs=10,
)
data["mol"] = mols
mol_col = "MOL_smiles"  # molecule smiles strings
val_col = "CLS_pChEMBL Value"  # class label for binding affinity

mols = dm.utils.parallelized(
    fn=lambda mol: add_3d_coord(mol),
    inputs_list=data[mol_col],
    progress=True,
    n_jobs=10,
)
data["mol"] = mols

In [22]:

Copied!





## `dm.conformers.generate` is a better choice for 3d coordinates and conformer generation.
## Note: this function might take longer time
# mols = dm.utils.parallelized(
#     fn=lambda mol: dm.conformers.generate(dm.to_mol(mol)),
#     inputs_list=data[mol_col],
#     progress=True,
#     n_jobs=10,
# )
# data["mol"] = mols
## `dm.conformers.generate` is a better choice for 3d coordinates and conformer generation.
## Note: this function might take longer time
# mols = dm.utils.parallelized(
#     fn=lambda mol: dm.conformers.generate(dm.to_mol(mol)),
#     inputs_list=data[mol_col],
#     progress=True,
#     n_jobs=10,
# )
# data["mol"] = mols

In [23]:

Copied!

# the dataset was pre-split to train/test based on the presence of activity cliffs
data_train = data.query("split == 'train'")
data_test = data.query("split == 'test'")
# the dataset was pre-split to train/test based on the presence of activity cliffs
data_train = data.query("split == 'train'")
data_test = data.query("split == 'test'")

In [24]:

Copied!

col_AC = "AC_CLS_pChEMBL Value"  # boolean: whether the molecule show activity cliff in the dataset
data.groupby("split")[col_AC].value_counts()
col_AC = "AC_CLS_pChEMBL Value"  # boolean: whether the molecule show activity cliff in the dataset
data.groupby("split")[col_AC].value_counts()

Out[24]:

split  AC_CLS_pChEMBL Value
test   False                   1361
       True                     103
train  False                   5451
       True                     102
Name: count, dtype: int64

Training with simple RandomForest model¶

To evaluate the impact of including stereochemistry information on prediction performance, we will train two QSAR models using the same machine learning algorithm and parameters.

One model will use the ECFP:4 featurizer, which does not capture stereochemistry, while the other will use the pharm3D-cats featurizer, which includes 3D stereochemical information.

In [25]:

Copied!





# Define the model with specified parameters

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(
    n_estimators=100,
    max_depth=None,
    min_samples_split=2,
    min_samples_leaf=1,
    max_features="sqrt",
    bootstrap=True,
    criterion="gini",
    verbose=True,
    n_jobs=-1,
)
# Define the model with specified parameters

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(
    n_estimators=100,
    max_depth=None,
    min_samples_split=2,
    min_samples_leaf=1,
    max_features="sqrt",
    bootstrap=True,
    criterion="gini",
    verbose=True,
    n_jobs=-1,
)

Model 1: RandomForestClassifier + ECFP:4

In [26]:

Copied!





# Transform with featurizer which can't differe stereoisomers
transformer_1 = FPVecTransformer(kind="ecfp:4", dtype=float)
X_train = transformer_1(data_train["mol"].values)

# Define the labels
y_train = data_train[val_col].values

model_1 = copy(model)
# Fit the model
model_1.fit(X_train, y_train)
# Transform with featurizer which can't differe stereoisomers
transformer_1 = FPVecTransformer(kind="ecfp:4", dtype=float)
X_train = transformer_1(data_train["mol"].values)

# Define the labels
y_train = data_train[val_col].values

model_1 = copy(model)
# Fit the model
model_1.fit(X_train, y_train)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 10 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:    0.3s finished

Out[26]:

RandomForestClassifier(n_jobs=-1, verbose=True)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Model 2: RandomForestClassifier + pharm3D-cats

In [27]:

Copied!





X_train = transformer_2(data_train["mol"].values)

# Define the labels
y_train = data_train[val_col].values
model_2 = copy(model)
# Fit the model
model_2.fit(X_train, y_train)
X_train = transformer_2(data_train["mol"].values)

# Define the labels
y_train = data_train[val_col].values
model_2 = copy(model)
# Fit the model
model_2.fit(X_train, y_train)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 10 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:    0.3s finished

Out[27]:

RandomForestClassifier(n_jobs=-1, verbose=True)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Compare the predictive performance on test set¶

In [28]:

Copied!

results = {"overall": {}, "AC": {}}
results = {"overall": {}, "AC": {}}

Overall performance on entire test set¶

In [29]:

Copied!

from utils import evaluate

results["overall"]["rf+ecfp"] = evaluate(data_test, transformer_1, model_1, mol_col="mol", val_col=val_col)

results["overall"]["rf+pharm3Dcats"] = evaluate(
    data_test, transformer_2, model_2, mol_col="mol", val_col=val_col
)
from utils import evaluate

results["overall"]["rf+ecfp"] = evaluate(data_test, transformer_1, model_1, mol_col="mol", val_col=val_col)

results["overall"]["rf+pharm3Dcats"] = evaluate(
    data_test, transformer_2, model_2, mol_col="mol", val_col=val_col
)

[Parallel(n_jobs=10)]: Using backend ThreadingBackend with 10 concurrent workers.
[Parallel(n_jobs=10)]: Done  30 tasks      | elapsed:    0.0s
[Parallel(n_jobs=10)]: Done 100 out of 100 | elapsed:    0.0s finished
[Parallel(n_jobs=10)]: Using backend ThreadingBackend with 10 concurrent workers.
[Parallel(n_jobs=10)]: Done  30 tasks      | elapsed:    0.0s
[Parallel(n_jobs=10)]: Done 100 out of 100 | elapsed:    0.0s finished

In [30]:

Copied!

pd.DataFrame(results["overall"])
pd.DataFrame(results["overall"])

Out[30]:

	rf+ecfp	rf+pharm3Dcats
precision	0.807947	0.749466
recall	0.798116	0.705505
f1-score	0.802038	0.711946
support	1464.000000	1464.000000

Performance on stereoisomers which show activity cliff¶

In [31]:

Copied!





results["AC"]["rf+ecfp"] = evaluate(
    data_test.query("`AC_CLS_pChEMBL Value` == True"),
    transformer_1,
    model_1,
    mol_col="mol",
    val_col=val_col,
)
results["AC"]["rf+pharm3Dcats"] = evaluate(
    data_test.query("`AC_CLS_pChEMBL Value` == True"),
    transformer_2,
    model_2,
    mol_col="mol",
    val_col=val_col,
)
results["AC"]["rf+ecfp"] = evaluate(
    data_test.query("`AC_CLS_pChEMBL Value` == True"),
    transformer_1,
    model_1,
    mol_col="mol",
    val_col=val_col,
)
results["AC"]["rf+pharm3Dcats"] = evaluate(
    data_test.query("`AC_CLS_pChEMBL Value` == True"),
    transformer_2,
    model_2,
    mol_col="mol",
    val_col=val_col,
)

[Parallel(n_jobs=10)]: Using backend ThreadingBackend with 10 concurrent workers.
[Parallel(n_jobs=10)]: Done  30 tasks      | elapsed:    0.0s
[Parallel(n_jobs=10)]: Done 100 out of 100 | elapsed:    0.0s finished
[Parallel(n_jobs=10)]: Using backend ThreadingBackend with 10 concurrent workers.
[Parallel(n_jobs=10)]: Done  30 tasks      | elapsed:    0.0s
[Parallel(n_jobs=10)]: Done 100 out of 100 | elapsed:    0.0s finished

In [32]:

Copied!

pd.DataFrame(results["AC"])
pd.DataFrame(results["AC"])

Out[32]:

	rf+ecfp	rf+pharm3Dcats
precision	0.290427	0.480769
recall	0.292642	0.481887
f1-score	0.290192	0.475751
support	103.000000	103.000000

QSAR modeling involving stereoisomers and activity cliffs is always challenging. The results above demonstrate that, while the model trained with pharm3D-cats underperformed in overall prediction accuracy, it significantly outperformed when predicting the activity of stereoisomers with activity cliffs. This highlights the importance of using proper molecular representations that effectively capture stereochemical information.

When applying predictive models to a different set of molecules, it is crucial to carefully select the appropriate model for deployment. In this example, if we aim to make predictions on a lead series of molecules that contains stereochemical features that are essential for binding to the D2 target, the model rf+pharm3Dcats should be chosen. Conversely, for datasets consisting of non-chiral molecules or compounds that do not contain any stereogenic feature, the rf+ecfp model would be sufficient.

Final notes¶

Careful data curation and auditing is essential to determine whether the observed bioactivity of a series of compounds is highly dependent on their stereochemistry or not. Auroris simplifies and streamlines this process by providing the StereoIsomerACDetection module and the corresponding detect_stereoisomer_activity_cliff function to easily assess this instances.

For guidance on how to combine different Auroris modules for small molecule dataset curation, please checkout this tutorial.

The end.