The Basics¶

In short

This tutorial walks you through the basic usage of Polaris. We will first login to the hub and will then see how easy it is to load a dataset or benchmark from it. Finally, we will train a simple baseline to submit a first set of results!

Polaris is designed to standardize the process of constructing datasets, specifying benchmarks and evaluating novel machine learning techniques within the realm of drug discovery.

While the Polaris library can be used independently from the Polaris Hub, the two were designed to seamlessly work together. The hub provides various pre-made, high quality datasets and benchmarks to develop and evaluate novel ML methods. In this tutorial, we will see how easy it is to load and use these datasets and benchmarks.

In [2]:

Copied!

import polaris as po
from polaris.hub.client import PolarisHubClient
import polaris as po
from polaris.hub.client import PolarisHubClient

To be able to complete this step, you will require a Polaris Hub account. Go to https://polarishub.io/ to create one. You only have to log in once at the start or when you haven't used your account in a while.

In [ ]:

Copied!

client = PolarisHubClient()
client.login()
client = PolarisHubClient()
client.login()

Instead of through the Python API, you could also use the Polaris CLI. See:

polaris login --help

Load from the Hub¶

Both datasets and benchmarks are identified by a owner/name id. You can easily find and copy these through the Hub. Once you have the id, loading a dataset or benchmark is incredibly easy.

In [4]:

Copied!

dataset = po.load_dataset("polaris/hello-world")
benchmark = po.load_benchmark("polaris/hello-world-benchmark")
dataset = po.load_dataset("polaris/hello-world")
benchmark = po.load_benchmark("polaris/hello-world-benchmark")

2024-06-26 09:52:08.706 | INFO     | polaris._artifact:_validate_version:66 - The version of Polaris that was used to create the artifact (0.0.0) is different from the currently installed version of Polaris (0.0.2.dev191+g82e7db2).
2024-06-26 09:52:10.327 | INFO     | polaris._artifact:_validate_version:66 - The version of Polaris that was used to create the artifact (0.0.0) is different from the currently installed version of Polaris (0.0.2.dev191+g82e7db2).
2024-06-26 09:52:10.338 | INFO     | polaris._artifact:_validate_version:66 - The version of Polaris that was used to create the artifact (0.0.0) is different from the currently installed version of Polaris (0.0.2.dev191+g82e7db2).

Use the benchmark¶

The polaris library is designed to make it easy to participate in a benchmark. In just a few lines of code, we can get the train and test partition, access the associated data in various ways and evaluate our predictions. There's two main API endpoints.

get_train_test_split(): For creating objects through which we can access the different dataset partitions.
evaluate(): For evaluating a set of predictions in accordance with the benchmark protocol.

In [5]:

Copied!

train, test = benchmark.get_train_test_split()
train, test = benchmark.get_train_test_split()

The created objects support various flavours to access the data.

The objects are iterable;
The objects can be indexed;
The objects have properties to access all data at once.

In [6]:

Copied!

for x, y in train:
    pass
for x, y in train:
    pass

In [7]:

Copied!

for i in range(len(train)):
    x, y = train[i]
for i in range(len(train)):
    x, y = train[i]

In [8]:

Copied!

x = train.inputs
y = train.targets
x = train.inputs
y = train.targets

To avoid accidental access to the test targets, the test object does not expose the labels and will throw an error if you try access them explicitly.

In [9]:

Copied!

for x in test:
    pass
for x in test:
    pass

In [10]:

Copied!

for i in range(len(test)):
    x = test[i]
for i in range(len(test)):
    x = test[i]

In [11]:

Copied!

x = test.inputs

# NOTE: The below will throw an error!
# y = test.targets
x = test.inputs

# NOTE: The below will throw an error!
# y = test.targets

Partake in the benchmark¶

To complete our example, let's participate in the benchmark. We will train a simple random forest model on the ECFP representation through scikit-learn and datamol.

In [12]:

Copied!





import datamol as dm
from sklearn.ensemble import RandomForestRegressor

# Load the benchmark (automatically loads the underlying dataset as well)
benchmark = po.load_benchmark("polaris/hello-world-benchmark")

# Get the split and convert SMILES to ECFP fingerprints by specifying an featurize function.
train, test = benchmark.get_train_test_split(featurization_fn=dm.to_fp)

# Define a model and train
model = RandomForestRegressor(max_depth=2, random_state=0)
model.fit(train.X, train.y)
import datamol as dm
from sklearn.ensemble import RandomForestRegressor

# Load the benchmark (automatically loads the underlying dataset as well)
benchmark = po.load_benchmark("polaris/hello-world-benchmark")

# Get the split and convert SMILES to ECFP fingerprints by specifying an featurize function.
train, test = benchmark.get_train_test_split(featurization_fn=dm.to_fp)

# Define a model and train
model = RandomForestRegressor(max_depth=2, random_state=0)
model.fit(train.X, train.y)

2024-06-26 09:52:12.003 | INFO     | polaris._artifact:_validate_version:66 - The version of Polaris that was used to create the artifact (0.0.0) is different from the currently installed version of Polaris (0.0.2.dev191+g82e7db2).
2024-06-26 09:52:12.014 | INFO     | polaris._artifact:_validate_version:66 - The version of Polaris that was used to create the artifact (0.0.0) is different from the currently installed version of Polaris (0.0.2.dev191+g82e7db2).

Out[12]:

RandomForestRegressor(max_depth=2, random_state=0)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

To evaluate a model within Polaris, you should use the evaluate() endpoint. This requires you to just provide the predictions. The targets of the test set are automatically extracted so that the chance of the user accessing the test labels is minimal

In [13]:

Copied!

predictions = model.predict(test.X)
predictions = model.predict(test.X)

In [14]:

Copied!

results = benchmark.evaluate(predictions)
results
results = benchmark.evaluate(predictions)
results

Out[14]:

name

None

description

The Basics¶

Login¶

Load from the Hub¶

Use the benchmark¶

Partake in the benchmark¶