Skip to content

Base class

polaris.benchmark.BenchmarkSpecification

Bases: BaseArtifactModel, ChecksumMixin

This class wraps a Dataset with additional data to specify the evaluation logic.

Specifically, it specifies:

  1. Which dataset to use (see Dataset);
  2. Which columns are used as input and which columns are used as target;
  3. Which metrics should be used to evaluate performance on this task;
  4. A predefined, static train-test split to use during evaluation.
Subclasses

Polaris includes various subclasses of the BenchmarkSpecification that provide a more precise data-model or additional logic, e.g. SingleTaskBenchmarkSpecification.

Examples:

Basic API usage:

import polaris as po

# Load the benchmark from the Hub
benchmark = po.load_benchmark("polaris/hello-world-benchmark")

# Get the train and test data-loaders
train, test = benchmark.get_train_test_split()

# Use the training data to train your model
# Get the input as an array with 'train.inputs' and 'train.targets'
# Or simply iterate over the train object.
for x, y in train:
    ...

# Work your magic to accurately predict the test set
predictions = [0.0 for x in test]

# Evaluate your predictions
results = benchmark.evaluate(predictions)

# Submit your results
results.upload_to_hub(owner="dummy-user")

Attributes:

Name Type Description
dataset Union[DatasetV1, CompetitionDataset, str, dict[str, Any]]

The dataset the benchmark specification is based on.

target_cols ColumnsType

The column(s) of the original dataset that should be used as target.

input_cols ColumnsType

The column(s) of the original dataset that should be used as input.

split SplitType

The predefined train-test split to use for evaluation.

metrics Union[str, Metric, list[str | Metric]]

The metrics to use for evaluating performance

main_metric str | Metric | None

The main metric used to rank methods. If None, the first of the metrics field.

readme str

Markdown text that can be used to provide a formatted description of the benchmark. If using the Polaris Hub, it is worth noting that this field is more easily edited through the Hub UI as it provides a rich text editor for writing markdown.

target_types dict[str, Union[TargetType, str, None]]

A dictionary that maps target columns to their type. If not specified, this is automatically inferred.

For additional meta-data attributes, see the BaseArtifactModel class.

n_train_datapoints property

n_train_datapoints: int

The size of the train set.

n_test_sets property

n_test_sets: int

The number of test sets

n_test_datapoints property

n_test_datapoints: dict[str, int]

The size of (each of) the test set(s).

n_classes property

n_classes: dict[str, int]

The number of classes for each of the target columns.

task_type property

task_type: str

The high-level task type of the benchmark.

get_train_test_split

get_train_test_split(featurization_fn: Optional[Callable] = None) -> tuple[Subset, Union[Subset, dict[str, Subset]]]

Construct the train and test sets, given the split in the benchmark specification.

Returns Subset objects, which offer several ways of accessing the data and can thus easily serve as a basis to build framework-specific (e.g. PyTorch, Tensorflow) data-loaders on top of.

Parameters:

Name Type Description Default
featurization_fn Optional[Callable]

A function to apply to the input data. If a multi-input benchmark, this function expects an input in the format specified by the input_format parameter.

None

Returns:

Type Description
tuple[Subset, Union[Subset, dict[str, Subset]]]

A tuple with the train Subset and test Subset objects. If there are multiple test sets, these are returned in a dictionary and each test set has an associated name. The targets of the test set can not be accessed.

evaluate

evaluate(y_pred: Optional[PredictionsType] = None, y_prob: Optional[PredictionsType] = None) -> BenchmarkResults

Execute the evaluation protocol for the benchmark, given a set of predictions.

What about y_true?

Contrary to other frameworks that you might be familiar with, we opted for a signature that includes just the predictions. This reduces the chance of accidentally using the test targets during training.

Expected structure for y_pred and y_prob arguments

The supplied y_pred and y_prob arguments must adhere to a certain structure depending on the number of tasks and test sets included in the benchmark. Refer to the following for guidance on the correct structure when creating your y_pred and y_prod objects:

  • Single task, single set: [values...]
  • Multi-task, single set: {task_name_1: [values...], task_name_2: [values...]}
  • Single task, multi-set: {test_set_1: {task_name: [values...]}, test_set_2: {task_name: [values...]}}
  • Multi-task, multi-set: {test_set_1: {task_name_1: [values...], task_name_2: [values...]}, test_set_2: {task_name_1: [values...], task_name_2: [values...]}}

For this method, we make the following assumptions:

  1. There can be one or multiple test set(s);
  2. There can be one or multiple target(s);
  3. The metrics are constant across test sets;
  4. The metrics are constant across targets;
  5. There can be metrics which measure across tasks.

Parameters:

Name Type Description Default
y_pred Optional[PredictionsType]

The predictions for the test set, as NumPy arrays. If there are multiple targets, the predictions should be wrapped in a dictionary with the target labels as keys. If there are multiple test sets, the predictions should be further wrapped in a dictionary with the test subset labels as keys.

None
y_prob Optional[PredictionsType]

The predicted probabilities for the test set, as NumPy arrays.

None

Returns:

Type Description
BenchmarkResults

A BenchmarkResults object. This object can be directly submitted to the Polaris Hub.

Examples:

  1. For regression benchmarks: pred_scores = your_model.predict_score(molecules) # predict continuous score values benchmark.evaluate(y_pred=pred_scores)
  2. For classification benchmarks:
    • If roc_auc and pr_auc are in the metric list, both class probabilities and label predictions are required: pred_probs = your_model.predict_proba(molecules) # predict probablities pred_labels = your_model.predict_labels(molecules) # predict class labels benchmark.evaluate(y_pred=pred_labels, y_prob=pred_probs)
    • Otherwise: benchmark.evaluate(y_pred=pred_labels)

upload_to_hub

upload_to_hub(settings: Optional[PolarisHubSettings] = None, cache_auth_token: bool = True, access: Optional[AccessType] = 'private', owner: Union[HubOwner, str, None] = None, **kwargs: dict)

Very light, convenient wrapper around the PolarisHubClient.upload_benchmark method.

to_json

to_json(destination: str) -> str

Save the benchmark to a destination directory as a JSON file.

Multiple files

Perhaps unintuitive, this method creates multiple files in the destination directory as it also saves the dataset it is based on to the specified destination. See the docstring of Dataset.to_json for more information.

Parameters:

Name Type Description Default
destination str

The directory to save the associated data to.

required

Returns:

Type Description
str

The path to the JSON file.


Subclasses

polaris.benchmark.SingleTaskBenchmarkSpecification

Bases: BenchmarkSpecification

Subclass for any single-task benchmark specification

In addition to the data-model and logic of the base-class, this class verifies that there is just a single target-column.

task_type property

task_type: str

The high-level task type of the benchmark.


polaris.benchmark.MultiTaskBenchmarkSpecification

Bases: BenchmarkSpecification

Subclass for any multi-task benchmark specification

In addition to the data-model and logic of the base-class, this class verifies that there are multiple target-columns.

task_type property

task_type: str

The high-level task type of the benchmark.