Base class
polaris.benchmark.BenchmarkSpecification
Bases: BaseArtifactModel
, ChecksumMixin
This class wraps a Dataset
with additional data
to specify the evaluation logic.
Specifically, it specifies:
- Which dataset to use (see
Dataset
); - Which columns are used as input and which columns are used as target;
- Which metrics should be used to evaluate performance on this task;
- A predefined, static train-test split to use during evaluation.
Subclasses
Polaris includes various subclasses of the BenchmarkSpecification
that provide a more precise data-model or
additional logic, e.g. SingleTaskBenchmarkSpecification
.
Examples:
Basic API usage:
import polaris as po
# Load the benchmark from the Hub
benchmark = po.load_benchmark("polaris/hello-world-benchmark")
# Get the train and test data-loaders
train, test = benchmark.get_train_test_split()
# Use the training data to train your model
# Get the input as an array with 'train.inputs' and 'train.targets'
# Or simply iterate over the train object.
for x, y in train:
...
# Work your magic to accurately predict the test set
predictions = [0.0 for x in test]
# Evaluate your predictions
results = benchmark.evaluate(predictions)
# Submit your results
results.upload_to_hub(owner="dummy-user")
Attributes:
Name | Type | Description |
---|---|---|
dataset |
Union[DatasetV1, CompetitionDataset, str, dict[str, Any]]
|
The dataset the benchmark specification is based on. |
target_cols |
ColumnsType
|
The column(s) of the original dataset that should be used as target. |
input_cols |
ColumnsType
|
The column(s) of the original dataset that should be used as input. |
split |
SplitType
|
The predefined train-test split to use for evaluation. |
metrics |
set[Metric]
|
The metrics to use for evaluating performance |
main_metric |
Metric | str
|
The main metric used to rank methods. If |
readme |
str
|
Markdown text that can be used to provide a formatted description of the benchmark. If using the Polaris Hub, it is worth noting that this field is more easily edited through the Hub UI as it provides a rich text editor for writing markdown. |
target_types |
dict[str, Union[TargetType, str, None]]
|
A dictionary that maps target columns to their type. If not specified, this is automatically inferred. |
For additional meta-data attributes, see the BaseArtifactModel
class.
get_train_test_split
get_train_test_split(featurization_fn: Optional[Callable] = None) -> tuple[Subset, Union[Subset, dict[str, Subset]]]
Construct the train and test sets, given the split in the benchmark specification.
Returns Subset
objects, which offer several ways of accessing the data
and can thus easily serve as a basis to build framework-specific (e.g. PyTorch, Tensorflow)
data-loaders on top of.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
featurization_fn
|
Optional[Callable]
|
A function to apply to the input data. If a multi-input benchmark, this function
expects an input in the format specified by the |
None
|
Returns:
Type | Description |
---|---|
tuple[Subset, Union[Subset, dict[str, Subset]]]
|
A tuple with the train |
evaluate
evaluate(y_pred: IncomingPredictionsType | None = None, y_prob: IncomingPredictionsType | None = None) -> BenchmarkResults
Execute the evaluation protocol for the benchmark, given a set of predictions.
What about y_true
?
Contrary to other frameworks that you might be familiar with, we opted for a signature that includes just the predictions. This reduces the chance of accidentally using the test targets during training.
For this method, we make the following assumptions:
- There can be one or multiple test set(s);
- There can be one or multiple target(s);
- The metrics are constant across test sets;
- The metrics are constant across targets;
- There can be metrics which measure across tasks.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
y_pred
|
IncomingPredictionsType | None
|
The predictions for the test set, as NumPy arrays. If there are multiple targets, the predictions should be wrapped in a dictionary with the target labels as keys. If there are multiple test sets, the predictions should be further wrapped in a dictionary with the test subset labels as keys. |
None
|
y_prob
|
IncomingPredictionsType | None
|
The predicted probabilities for the test set, formatted similarly to predictions, based on the number of tasks and test sets. |
None
|
Returns:
Type | Description |
---|---|
BenchmarkResults
|
A |
Examples:
- For regression benchmarks: pred_scores = your_model.predict_score(molecules) # predict continuous score values benchmark.evaluate(y_pred=pred_scores)
- For classification benchmarks:
- If
roc_auc
andpr_auc
are in the metric list, both class probabilities and label predictions are required: pred_probs = your_model.predict_proba(molecules) # predict probablities pred_labels = your_model.predict_labels(molecules) # predict class labels benchmark.evaluate(y_pred=pred_labels, y_prob=pred_probs) - Otherwise: benchmark.evaluate(y_pred=pred_labels)
- If
upload_to_hub
upload_to_hub(settings: Optional[PolarisHubSettings] = None, cache_auth_token: bool = True, access: Optional[AccessType] = 'private', owner: Union[HubOwner, str, None] = None, **kwargs: dict)
Very light, convenient wrapper around the
PolarisHubClient.upload_benchmark
method.
to_json
Save the benchmark to a destination directory as a JSON file.
Multiple files
Perhaps unintuitive, this method creates multiple files in the destination directory as it also saves
the dataset it is based on to the specified destination.
See the docstring of Dataset.to_json
for more information.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
destination
|
str
|
The directory to save the associated data to. |
required |
Returns:
Type | Description |
---|---|
str
|
The path to the JSON file. |
Subclasses
polaris.benchmark.SingleTaskBenchmarkSpecification
Bases: BenchmarkSpecification
Subclass for any single-task benchmark specification
In addition to the data-model and logic of the base-class, this class verifies that there is just a single target-column.
polaris.benchmark.MultiTaskBenchmarkSpecification
Bases: BenchmarkSpecification
Subclass for any multi-task benchmark specification
In addition to the data-model and logic of the base-class, this class verifies that there are multiple target-columns.