Data Models

In short

This tutorial walks you through the dataset and benchmark data-structures. After creating our own custom dataset and benchmark, we will learn how to upload it to the Hub!

We have already seen how easy it is to load a benchmark or dataset from the Polaris Hub. Let's now learn a bit more about the underlying data model by creating our own dataset and benchmark!

Create the dataset¶

A dataset in Polaris is at its core a tabular data-structure in which each row stores a single datapoint. For this example, we will process a multi-task DMPK dataset from Fang et al.. For the sake of simplicity, we don't do any curation and will just download the dataset as-is from their Github.

The importance of curation

While we do not address it in this tutorial, data curation is essential to an impactful benchmark. Because of this, we have not just made several high-quality benchmarks readily available on the Polaris Hub, but also open-sourced some of the tools we've built to curate these datasets.

In [3]:

Copied!





import pandas as pd

PATH = (
    "https://raw.githubusercontent.com/molecularinformatics/Computational-ADME/main/ADME_public_set_3521.csv"
)
table = pd.read_csv(PATH)
table.head(5)
import pandas as pd

PATH = (
    "https://raw.githubusercontent.com/molecularinformatics/Computational-ADME/main/ADME_public_set_3521.csv"
)
table = pd.read_csv(PATH)
table.head(5)

Out[3]:

	Internal ID	Vendor ID	SMILES	CollectionName	LOG HLM_CLint (mL/min/kg)	LOG MDR1-MDCK ER (B-A/A-B)	LOG SOLUBILITY PH 6.8 (ug/mL)	LOG PLASMA PROTEIN BINDING (HUMAN) (% unbound)	LOG PLASMA PROTEIN BINDING (RAT) (% unbound)	LOG RLM_CLint (mL/min/kg)
0	Mol1	317714313	CNc1cc(Nc2cccn(-c3ccccn3)c2=O)nn2c(C(=O)N[C@@H...	emolecules	0.675687	1.493167	0.089905	0.991226	0.518514	1.392169
1	Mol2	324056965	CCOc1cc2nn(CCC(C)(C)O)cc2cc1NC(=O)c1cccc(C(F)F)n1	emolecules	0.675687	1.040780	0.550228	0.099681	0.268344	1.027920
2	Mol3	304005766	CN(c1ncc(F)cn1)[C@H]1CCCNC1	emolecules	0.675687	-0.358806	NaN	2.000000	2.000000	1.027920
3	Mol4	194963090	CC(C)(Oc1ccc(-c2cnc(N)c(-c3ccc(Cl)cc3)c2)cc1)C...	emolecules	0.675687	1.026662	1.657056	-1.158015	-1.403403	1.027920
4	Mol5	324059015	CC(C)(O)CCn1cc2cc(NC(=O)c3cccc(C(F)(F)F)n3)c(C...	emolecules	0.996380	1.010597	NaN	1.015611	1.092264	1.629093

While not required, a good dataset will specify additional meta-data to give further explanations on the data is contained within the dataset. This can be done on both the column level and on the dataset level.

In [4]:

Copied!





from polaris.dataset import ColumnAnnotation

# Additional meta-data on the column level
# Of course, for a real dataset we should annotate all columns.
annotations = {
    "LOG HLM_CLint (mL/min/kg)": ColumnAnnotation(
        desription="Microsomal stability",
        user_attributes={"unit": "mL/min/kg"},
    ),
    "SMILES": ColumnAnnotation(desription="Molecule SMILES string", modality="molecule"),
}
from polaris.dataset import ColumnAnnotation

# Additional meta-data on the column level
# Of course, for a real dataset we should annotate all columns.
annotations = {
    "LOG HLM_CLint (mL/min/kg)": ColumnAnnotation(
        desription="Microsomal stability",
        user_attributes={"unit": "mL/min/kg"},
    ),
    "SMILES": ColumnAnnotation(desription="Molecule SMILES string", modality="molecule"),
}

In [5]:

Copied!





from polaris.dataset import Dataset
from polaris.utils.types import HubOwner

dataset = Dataset(
    # The table is the core data-structure required to construct a dataset
    table=table,
    # Additional meta-data on the dataset level.
    name="Fang_2023_DMPK",
    description="120 prospective data sets, collected over 20 months across six ADME in vitro endpoints",
    source="https://doi.org/10.1021/acs.jcim.3c00160",
    annotations=annotations,
    tags=["DMPK", "ADME"],
    owner=HubOwner(user_id="cwognum", slug="cwognum"),
    license="CC-BY-4.0",
    user_attributes={"year": "2023"},
)
from polaris.dataset import Dataset
from polaris.utils.types import HubOwner

dataset = Dataset(
    # The table is the core data-structure required to construct a dataset
    table=table,
    # Additional meta-data on the dataset level.
    name="Fang_2023_DMPK",
    description="120 prospective data sets, collected over 20 months across six ADME in vitro endpoints",
    source="https://doi.org/10.1021/acs.jcim.3c00160",
    annotations=annotations,
    tags=["DMPK", "ADME"],
    owner=HubOwner(user_id="cwognum", slug="cwognum"),
    license="CC-BY-4.0",
    user_attributes={"year": "2023"},
)

Save and load the dataset¶

We can now save the dataset either to a local path or directly to the hub!

In [6]:

Copied!

import tempfile

temp_dir = tempfile.TemporaryDirectory().name
import tempfile

temp_dir = tempfile.TemporaryDirectory().name

In [7]:

Copied!

import datamol as dm

save_dir = dm.fs.join(temp_dir, "dataset")
import datamol as dm

save_dir = dm.fs.join(temp_dir, "dataset")

In [8]:

Copied!

path = dataset.to_json(save_dir)
path = dataset.to_json(save_dir)

Looking at the save destination, we see this created two files: A JSON with all the meta-data and a .parquet file with the tabular data.

In [9]:

Copied!

fs = dm.fs.get_mapper(save_dir).fs
fs.ls(save_dir)
fs = dm.fs.get_mapper(save_dir).fs
fs.ls(save_dir)

Out[9]:

['/var/folders/1y/1v1blh6x56zdn027g5g9bwph0000gr/T/tmpe_g26lrl/dataset/table.parquet',
 '/var/folders/1y/1v1blh6x56zdn027g5g9bwph0000gr/T/tmpe_g26lrl/dataset/dataset.json']

Loading the dataset can be done through this JSON file.

In [10]:

Copied!

import polaris as po

dataset = po.load_dataset(path)
import polaris as po

dataset = po.load_dataset(path)

We can also upload the dataset to the hub!

In [11]:

Copied!

# from polaris.hub.client import PolarisHubClient

# NOTE: Commented out to not flood the DB
# with PolarisHubClient() as client:
#     client.upload_dataset(dataset=dataset)
# from polaris.hub.client import PolarisHubClient

# NOTE: Commented out to not flood the DB
# with PolarisHubClient() as client:
#     client.upload_dataset(dataset=dataset)

Create the benchmark specification¶

A benchmark is represented by the BenchmarkSpecification, which wraps a Dataset with additional data to produce a benchmark.

It specifies:

Which dataset to use (see Dataset);
Which columns are used as input and which columns are used as target;
Which metrics should be used to evaluate performance on this task;
A predefined, static train-test split to use during evaluation.

In [12]:

Copied!





import numpy as np
from polaris.benchmark import SingleTaskBenchmarkSpecification

# For the sake of simplicity, we use a very simple, ordered split
split = (np.arange(3000).tolist(), (np.arange(521) + 3000).tolist())  # train  # test

benchmark = SingleTaskBenchmarkSpecification(
    dataset=dataset,
    target_cols="LOG SOLUBILITY PH 6.8 (ug/mL)",
    input_cols="SMILES",
    split=split,
    metrics="mean_absolute_error",
)
import numpy as np
from polaris.benchmark import SingleTaskBenchmarkSpecification

# For the sake of simplicity, we use a very simple, ordered split
split = (np.arange(3000).tolist(), (np.arange(521) + 3000).tolist())  # train  # test

benchmark = SingleTaskBenchmarkSpecification(
    dataset=dataset,
    target_cols="LOG SOLUBILITY PH 6.8 (ug/mL)",
    input_cols="SMILES",
    split=split,
    metrics="mean_absolute_error",
)

Metrics should be supported in the polaris framework.

For more information, see the Metric class.

In [13]:

Copied!

from polaris.evaluate import Metric

list(Metric)
from polaris.evaluate import Metric

list(Metric)

Out[13]:

[<Metric.mean_absolute_error: MetricInfo(fn=<function mean_absolute_error at 0x169779c60>, is_multitask=False)>,
 <Metric.mean_squared_error: MetricInfo(fn=<function mean_squared_error at 0x16977a020>, is_multitask=False)>,
 <Metric.accuracy: MetricInfo(fn=<function accuracy_score at 0x169758540>, is_multitask=False)>]

To support the vast flexibility in specifying a benchmark, we have different classes that correspond to different types of benchmarks. Each of these subclasses makes the data-model or logic more specific to a particular case. For example, trying to create a multitask benchmark with the same arguments as we used above will throw an error as there is just a single target column specified.

In [14]:

Copied!





from polaris.benchmark import MultiTaskBenchmarkSpecification

benchmark = MultiTaskBenchmarkSpecification(
    dataset=dataset,
    target_cols="LOG SOLUBILITY PH 6.8 (ug/mL)",
    input_cols="SMILES",
    split=split,
    metrics="mean_absolute_error",
)
from polaris.benchmark import MultiTaskBenchmarkSpecification

benchmark = MultiTaskBenchmarkSpecification(
    dataset=dataset,
    target_cols="LOG SOLUBILITY PH 6.8 (ug/mL)",
    input_cols="SMILES",
    split=split,
    metrics="mean_absolute_error",
)

---------------------------------------------------------------------------
ValidationError                           Traceback (most recent call last)
/Users/cas.wognum/Documents/repositories/polaris/docs/tutorials/custom_dataset_benchmark.ipynb Cell 25 line 3
      <a href='vscode-notebook-cell:/Users/cas.wognum/Documents/repositories/polaris/docs/tutorials/custom_dataset_benchmark.ipynb#X33sZmlsZQ%3D%3D?line=0'>1</a> from polaris.benchmark import MultiTaskBenchmarkSpecification
----> <a href='vscode-notebook-cell:/Users/cas.wognum/Documents/repositories/polaris/docs/tutorials/custom_dataset_benchmark.ipynb#X33sZmlsZQ%3D%3D?line=2'>3</a> benchmark = MultiTaskBenchmarkSpecification(
      <a href='vscode-notebook-cell:/Users/cas.wognum/Documents/repositories/polaris/docs/tutorials/custom_dataset_benchmark.ipynb#X33sZmlsZQ%3D%3D?line=3'>4</a>     dataset=dataset,
      <a href='vscode-notebook-cell:/Users/cas.wognum/Documents/repositories/polaris/docs/tutorials/custom_dataset_benchmark.ipynb#X33sZmlsZQ%3D%3D?line=4'>5</a>     target_cols="LOG SOLUBILITY PH 6.8 (ug/mL)",
      <a href='vscode-notebook-cell:/Users/cas.wognum/Documents/repositories/polaris/docs/tutorials/custom_dataset_benchmark.ipynb#X33sZmlsZQ%3D%3D?line=5'>6</a>     input_cols="SMILES",
      <a href='vscode-notebook-cell:/Users/cas.wognum/Documents/repositories/polaris/docs/tutorials/custom_dataset_benchmark.ipynb#X33sZmlsZQ%3D%3D?line=6'>7</a>     split=split,
      <a href='vscode-notebook-cell:/Users/cas.wognum/Documents/repositories/polaris/docs/tutorials/custom_dataset_benchmark.ipynb#X33sZmlsZQ%3D%3D?line=7'>8</a>     metrics="mean_absolute_error",
      <a href='vscode-notebook-cell:/Users/cas.wognum/Documents/repositories/polaris/docs/tutorials/custom_dataset_benchmark.ipynb#X33sZmlsZQ%3D%3D?line=8'>9</a> )

File ~/micromamba/envs/polaris/lib/python3.11/site-packages/pydantic/main.py:164, in BaseModel.__init__(__pydantic_self__, **data)
    162 # `__tracebackhide__` tells pytest and some other tools to omit this function from tracebacks
    163 __tracebackhide__ = True
--> 164 __pydantic_self__.__pydantic_validator__.validate_python(data, self_instance=__pydantic_self__)

ValidationError: 1 validation error for MultiTaskBenchmarkSpecification
target_cols
  Value error, A multi-task benchmark should specify at least two target columns [type=value_error, input_value='LOG SOLUBILITY PH 6.8 (ug/mL)', input_type=str]
    For further information visit https://errors.pydantic.dev/2.4/v/value_error

Save and load the benchmark¶

Saving the benchmark is easy and can be done with a single line of code.

In [15]:

Copied!

save_dir = dm.fs.join(temp_dir, "benchmark")
save_dir = dm.fs.join(temp_dir, "benchmark")

In [16]:

Copied!

path = benchmark.to_json(save_dir)
path = benchmark.to_json(save_dir)

In [17]:

Copied!

fs = dm.fs.get_mapper(save_dir).fs
fs.ls(save_dir)
fs = dm.fs.get_mapper(save_dir).fs
fs.ls(save_dir)

Out[17]:

['/var/folders/1y/1v1blh6x56zdn027g5g9bwph0000gr/T/tmpe_g26lrl/benchmark/table.parquet',
 '/var/folders/1y/1v1blh6x56zdn027g5g9bwph0000gr/T/tmpe_g26lrl/benchmark/benchmark.json',
 '/var/folders/1y/1v1blh6x56zdn027g5g9bwph0000gr/T/tmpe_g26lrl/benchmark/dataset.json']

This created three files. Two json files and a single parquet file. The parquet file saves the tabular structure at the base of the Dataset class, whereas the json files save all the meta-data for the Dataset and BenchmarkSpecification.

As before, loading the benchmark can be done through the JSON file.

In [18]:

Copied!

benchmark = po.load_benchmark(path)
benchmark = po.load_benchmark(path)

And as before, we can also upload the benchmark directly to the hub.

In [19]:

Copied!

# NOTE: Commented out to not flood the DB
# with PolarisHubClient() as client:
#     client.upload_benchmark(dataset=dataset)
# NOTE: Commented out to not flood the DB
# with PolarisHubClient() as client:
#     client.upload_benchmark(dataset=dataset)

The End.