The Basics¶
In short
This tutorial walks you through the basic usage of Polaris. We will first login to the hub and will then see how easy it is to load a dataset or benchmark from it. Finally, we will train a simple baseline to submit a first set of results!
Polaris is designed to standardize the process of constructing datasets, specifying benchmarks and evaluating novel machine learning techniques within the realm of drug discovery.
While the Polaris library can be used independently from the Polaris Hub, the two were designed to seamlessly work together. The hub provides various pre-made, high quality datasets and benchmarks to develop and evaluate novel ML methods. In this tutorial, we will see how easy it is to load and use these datasets and benchmarks.
import polaris as po
from polaris.hub.client import PolarisHubClient
Login¶
To be able to complete this step, you will require a Polaris Hub account. Go to https://polarishub.io/ to create one. You only have to log in once at the start or when you haven't used your account in a while.
client = PolarisHubClient()
client.login()
Instead of through the Python API, you could also use the Polaris CLI. See:
polaris login --help
Load from the Hub¶
Both datasets and benchmarks are identified by a owner/name
id. You can easily find and copy these through the Hub. Once you have the id, loading a dataset or benchmark is incredibly easy.
dataset = po.load_dataset("polaris/hello-world")
benchmark = po.load_benchmark("polaris/hello-world-benchmark")
2024-06-26 09:52:08.706 | INFO | polaris._artifact:_validate_version:66 - The version of Polaris that was used to create the artifact (0.0.0) is different from the currently installed version of Polaris (0.0.2.dev191+g82e7db2). 2024-06-26 09:52:10.327 | INFO | polaris._artifact:_validate_version:66 - The version of Polaris that was used to create the artifact (0.0.0) is different from the currently installed version of Polaris (0.0.2.dev191+g82e7db2). 2024-06-26 09:52:10.338 | INFO | polaris._artifact:_validate_version:66 - The version of Polaris that was used to create the artifact (0.0.0) is different from the currently installed version of Polaris (0.0.2.dev191+g82e7db2).
Use the benchmark¶
The polaris library is designed to make it easy to participate in a benchmark. In just a few lines of code, we can get the train and test partition, access the associated data in various ways and evaluate our predictions. There's two main API endpoints.
get_train_test_split()
: For creating objects through which we can access the different dataset partitions.evaluate()
: For evaluating a set of predictions in accordance with the benchmark protocol.
train, test = benchmark.get_train_test_split()
The created objects support various flavours to access the data.
- The objects are iterable;
- The objects can be indexed;
- The objects have properties to access all data at once.
for x, y in train:
pass
for i in range(len(train)):
x, y = train[i]
x = train.inputs
y = train.targets
To avoid accidental access to the test targets, the test object does not expose the labels and will throw an error if you try access them explicitly.
for x in test:
pass
for i in range(len(test)):
x = test[i]
x = test.inputs
# NOTE: The below will throw an error!
# y = test.targets
Partake in the benchmark¶
To complete our example, let's participate in the benchmark. We will train a simple random forest model on the ECFP representation through scikit-learn and datamol.
import datamol as dm
from sklearn.ensemble import RandomForestRegressor
# Load the benchmark (automatically loads the underlying dataset as well)
benchmark = po.load_benchmark("polaris/hello-world-benchmark")
# Get the split and convert SMILES to ECFP fingerprints by specifying an featurize function.
train, test = benchmark.get_train_test_split(featurization_fn=dm.to_fp)
# Define a model and train
model = RandomForestRegressor(max_depth=2, random_state=0)
model.fit(train.X, train.y)
2024-06-26 09:52:12.003 | INFO | polaris._artifact:_validate_version:66 - The version of Polaris that was used to create the artifact (0.0.0) is different from the currently installed version of Polaris (0.0.2.dev191+g82e7db2). 2024-06-26 09:52:12.014 | INFO | polaris._artifact:_validate_version:66 - The version of Polaris that was used to create the artifact (0.0.0) is different from the currently installed version of Polaris (0.0.2.dev191+g82e7db2).
RandomForestRegressor(max_depth=2, random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestRegressor(max_depth=2, random_state=0)
To evaluate a model within Polaris, you should use the evaluate()
endpoint. This requires you to just provide the predictions. The targets of the test set are automatically extracted so that the chance of the user accessing the test labels is minimal
predictions = model.predict(test.X)
results = benchmark.evaluate(predictions)
results
name | None | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
description | |||||||||||||
tags | |||||||||||||
user_attributes | |||||||||||||
owner | None | ||||||||||||
polaris_version | 0.0.2.dev191+g82e7db2 | ||||||||||||
benchmark_name | hello-world-benchmark | ||||||||||||
benchmark_owner |
| ||||||||||||
github_url | None | ||||||||||||
paper_url | None | ||||||||||||
contributors | None | ||||||||||||
artifact_id | None | ||||||||||||
benchmark_artifact_id | polaris/hello-world-benchmark | ||||||||||||
results |
|
Before uploading the results to the Hub, you can provide some additional information about the results that will be displayed on the Polaris Hub.
# For a complete list of meta-data, check out the BenchmarkResults object
results.name = "hello-world-result"
results.github_url = "https://github.com/polaris-hub/polaris-hub"
results.paper_url = "https://polarishub.io/"
results.description = "Hello, World!"
Finally, let's upload the results to the Hub! The result will be private, but visiting the link in the logs you can decide to make it public through the Hub.
client.upload_results(results, owner="cwognum")
client.close()
That's it! Just like that you have partaken in your first Polaris benchmark. In next tutorials, we will consider more advanced use cases of Polaris, such as creating and uploading your own datasets and benchmarks.
The End.