Data Models
In short
This tutorial walks you through the dataset and benchmark data-structures. After creating our own custom dataset and benchmark, we will learn how to upload it to the Hub!
We have already seen how easy it is to load a benchmark or dataset from the Polaris Hub. Let's now learn a bit more about the underlying data model by creating our own dataset and benchmark!
Create the dataset¶
A dataset in Polaris is at its core a tabular data-structure in which each row stores a single datapoint. For this example, we will process a multi-task DMPK dataset from Fang et al.. For the sake of simplicity, we don't do any curation and will just download the dataset as-is from their Github.
The importance of curation
While we do not address it in this tutorial, data curation is essential to an impactful benchmark. Because of this, we have not just made several high-quality benchmarks readily available on the Polaris Hub, but also open-sourced some of the tools we've built to curate these datasets.
import pandas as pd
PATH = (
"https://raw.githubusercontent.com/molecularinformatics/Computational-ADME/main/ADME_public_set_3521.csv"
)
table = pd.read_csv(PATH)
table.head(5)
Internal ID | Vendor ID | SMILES | CollectionName | LOG HLM_CLint (mL/min/kg) | LOG MDR1-MDCK ER (B-A/A-B) | LOG SOLUBILITY PH 6.8 (ug/mL) | LOG PLASMA PROTEIN BINDING (HUMAN) (% unbound) | LOG PLASMA PROTEIN BINDING (RAT) (% unbound) | LOG RLM_CLint (mL/min/kg) | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Mol1 | 317714313 | CNc1cc(Nc2cccn(-c3ccccn3)c2=O)nn2c(C(=O)N[C@@H... | emolecules | 0.675687 | 1.493167 | 0.089905 | 0.991226 | 0.518514 | 1.392169 |
1 | Mol2 | 324056965 | CCOc1cc2nn(CCC(C)(C)O)cc2cc1NC(=O)c1cccc(C(F)F)n1 | emolecules | 0.675687 | 1.040780 | 0.550228 | 0.099681 | 0.268344 | 1.027920 |
2 | Mol3 | 304005766 | CN(c1ncc(F)cn1)[C@H]1CCCNC1 | emolecules | 0.675687 | -0.358806 | NaN | 2.000000 | 2.000000 | 1.027920 |
3 | Mol4 | 194963090 | CC(C)(Oc1ccc(-c2cnc(N)c(-c3ccc(Cl)cc3)c2)cc1)C... | emolecules | 0.675687 | 1.026662 | 1.657056 | -1.158015 | -1.403403 | 1.027920 |
4 | Mol5 | 324059015 | CC(C)(O)CCn1cc2cc(NC(=O)c3cccc(C(F)(F)F)n3)c(C... | emolecules | 0.996380 | 1.010597 | NaN | 1.015611 | 1.092264 | 1.629093 |
While not required, a good dataset will specify additional meta-data to give further explanations on the data is contained within the dataset. This can be done on both the column level and on the dataset level.
from polaris.dataset import ColumnAnnotation
# Additional meta-data on the column level
# Of course, for a real dataset we should annotate all columns.
annotations = {
"LOG HLM_CLint (mL/min/kg)": ColumnAnnotation(
desription="Microsomal stability",
user_attributes={"unit": "mL/min/kg"},
),
"SMILES": ColumnAnnotation(desription="Molecule SMILES string", modality="molecule"),
}
from polaris.dataset import Dataset
from polaris.utils.types import HubOwner
dataset = Dataset(
# The table is the core data-structure required to construct a dataset
table=table,
# Additional meta-data on the dataset level.
name="Fang_2023_DMPK",
description="120 prospective data sets, collected over 20 months across six ADME in vitro endpoints",
source="https://doi.org/10.1021/acs.jcim.3c00160",
annotations=annotations,
tags=["DMPK", "ADME"],
owner=HubOwner(user_id="cwognum", slug="cwognum"),
license="CC-BY-4.0",
user_attributes={"year": "2023"},
)
Save and load the dataset¶
We can now save the dataset either to a local path or directly to the hub!
import tempfile
temp_dir = tempfile.TemporaryDirectory().name
import datamol as dm
save_dir = dm.fs.join(temp_dir, "dataset")
path = dataset.to_json(save_dir)
Looking at the save destination, we see this created two files: A JSON with all the meta-data and a .parquet
file with the tabular data.
fs = dm.fs.get_mapper(save_dir).fs
fs.ls(save_dir)
['/var/folders/1y/1v1blh6x56zdn027g5g9bwph0000gr/T/tmpe_g26lrl/dataset/table.parquet', '/var/folders/1y/1v1blh6x56zdn027g5g9bwph0000gr/T/tmpe_g26lrl/dataset/dataset.json']
Loading the dataset can be done through this JSON file.
import polaris as po
dataset = po.load_dataset(path)
We can also upload the dataset to the hub!
# from polaris.hub.client import PolarisHubClient
# NOTE: Commented out to not flood the DB
# with PolarisHubClient() as client:
# client.upload_dataset(dataset=dataset)
Create the benchmark specification¶
A benchmark is represented by the BenchmarkSpecification
, which wraps a Dataset
with additional data to produce a benchmark.
It specifies:
- Which dataset to use (see Dataset);
- Which columns are used as input and which columns are used as target;
- Which metrics should be used to evaluate performance on this task;
- A predefined, static train-test split to use during evaluation.
import numpy as np
from polaris.benchmark import SingleTaskBenchmarkSpecification
# For the sake of simplicity, we use a very simple, ordered split
split = (np.arange(3000).tolist(), (np.arange(521) + 3000).tolist()) # train # test
benchmark = SingleTaskBenchmarkSpecification(
dataset=dataset,
target_cols="LOG SOLUBILITY PH 6.8 (ug/mL)",
input_cols="SMILES",
split=split,
metrics="mean_absolute_error",
)
Metrics should be supported in the polaris framework.
For more information, see the Metric
class.
from polaris.evaluate import Metric
list(Metric)
[<Metric.mean_absolute_error: MetricInfo(fn=<function mean_absolute_error at 0x169779c60>, is_multitask=False)>, <Metric.mean_squared_error: MetricInfo(fn=<function mean_squared_error at 0x16977a020>, is_multitask=False)>, <Metric.accuracy: MetricInfo(fn=<function accuracy_score at 0x169758540>, is_multitask=False)>]
To support the vast flexibility in specifying a benchmark, we have different classes that correspond to different types of benchmarks. Each of these subclasses makes the data-model or logic more specific to a particular case. For example, trying to create a multitask benchmark with the same arguments as we used above will throw an error as there is just a single target column specified.
from polaris.benchmark import MultiTaskBenchmarkSpecification
benchmark = MultiTaskBenchmarkSpecification(
dataset=dataset,
target_cols="LOG SOLUBILITY PH 6.8 (ug/mL)",
input_cols="SMILES",
split=split,
metrics="mean_absolute_error",
)
--------------------------------------------------------------------------- ValidationError Traceback (most recent call last) /Users/cas.wognum/Documents/repositories/polaris/docs/tutorials/custom_dataset_benchmark.ipynb Cell 25 line 3 <a href='vscode-notebook-cell:/Users/cas.wognum/Documents/repositories/polaris/docs/tutorials/custom_dataset_benchmark.ipynb#X33sZmlsZQ%3D%3D?line=0'>1</a> from polaris.benchmark import MultiTaskBenchmarkSpecification ----> <a href='vscode-notebook-cell:/Users/cas.wognum/Documents/repositories/polaris/docs/tutorials/custom_dataset_benchmark.ipynb#X33sZmlsZQ%3D%3D?line=2'>3</a> benchmark = MultiTaskBenchmarkSpecification( <a href='vscode-notebook-cell:/Users/cas.wognum/Documents/repositories/polaris/docs/tutorials/custom_dataset_benchmark.ipynb#X33sZmlsZQ%3D%3D?line=3'>4</a> dataset=dataset, <a href='vscode-notebook-cell:/Users/cas.wognum/Documents/repositories/polaris/docs/tutorials/custom_dataset_benchmark.ipynb#X33sZmlsZQ%3D%3D?line=4'>5</a> target_cols="LOG SOLUBILITY PH 6.8 (ug/mL)", <a href='vscode-notebook-cell:/Users/cas.wognum/Documents/repositories/polaris/docs/tutorials/custom_dataset_benchmark.ipynb#X33sZmlsZQ%3D%3D?line=5'>6</a> input_cols="SMILES", <a href='vscode-notebook-cell:/Users/cas.wognum/Documents/repositories/polaris/docs/tutorials/custom_dataset_benchmark.ipynb#X33sZmlsZQ%3D%3D?line=6'>7</a> split=split, <a href='vscode-notebook-cell:/Users/cas.wognum/Documents/repositories/polaris/docs/tutorials/custom_dataset_benchmark.ipynb#X33sZmlsZQ%3D%3D?line=7'>8</a> metrics="mean_absolute_error", <a href='vscode-notebook-cell:/Users/cas.wognum/Documents/repositories/polaris/docs/tutorials/custom_dataset_benchmark.ipynb#X33sZmlsZQ%3D%3D?line=8'>9</a> ) File ~/micromamba/envs/polaris/lib/python3.11/site-packages/pydantic/main.py:164, in BaseModel.__init__(__pydantic_self__, **data) 162 # `__tracebackhide__` tells pytest and some other tools to omit this function from tracebacks 163 __tracebackhide__ = True --> 164 __pydantic_self__.__pydantic_validator__.validate_python(data, self_instance=__pydantic_self__) ValidationError: 1 validation error for MultiTaskBenchmarkSpecification target_cols Value error, A multi-task benchmark should specify at least two target columns [type=value_error, input_value='LOG SOLUBILITY PH 6.8 (ug/mL)', input_type=str] For further information visit https://errors.pydantic.dev/2.4/v/value_error
Save and load the benchmark¶
Saving the benchmark is easy and can be done with a single line of code.
save_dir = dm.fs.join(temp_dir, "benchmark")
path = benchmark.to_json(save_dir)
fs = dm.fs.get_mapper(save_dir).fs
fs.ls(save_dir)
['/var/folders/1y/1v1blh6x56zdn027g5g9bwph0000gr/T/tmpe_g26lrl/benchmark/table.parquet', '/var/folders/1y/1v1blh6x56zdn027g5g9bwph0000gr/T/tmpe_g26lrl/benchmark/benchmark.json', '/var/folders/1y/1v1blh6x56zdn027g5g9bwph0000gr/T/tmpe_g26lrl/benchmark/dataset.json']
This created three files. Two json
files and a single parquet
file. The parquet
file saves the tabular structure at the base of the Dataset
class, whereas the json
files save all the meta-data for the Dataset
and BenchmarkSpecification
.
As before, loading the benchmark can be done through the JSON file.
benchmark = po.load_benchmark(path)
And as before, we can also upload the benchmark directly to the hub.
# NOTE: Commented out to not flood the DB
# with PolarisHubClient() as client:
# client.upload_benchmark(dataset=dataset)
The End.