Skip to content



Bases: BaseArtifactModel

Class for saving benchmarking results

This object is returned by BenchmarkSpecification.evaluate. In addition to the metrics on the test set, it contains additional meta-data and logic to integrate the results with the Polaris Hub.

The actual results are saved in the results field using the following tabular format:

Test set Target label Metric Score
test_iid EGFR_WT AUC 0.9
test_ood EGFR_WT AUC 0.75
... ... ... ...
test_ood EGFR_L858R AUC 0.79
Categorizing methods

An open question is how to best categorize a methodology (e.g. a model). This is needed since we would like to be able to aggregate results across benchmarks too, to say something about which (type of) methods performs best in general.


Name Type Description
results ResultsType

Benchmark results are stored directly in a dataframe or in a serialized, JSON compatible dict that can be decoded into the associated tabular format.

benchmark_name SlugCompatibleStringType

The name of the benchmark for which these results were generated. Together with the benchmark owner, this uniquely identifies the benchmark on the Hub.

benchmark_owner Optional[HubOwner]

The owner of the benchmark for which these results were generated. Together with the benchmark name, this uniquely identifies the benchmark on the Hub.

github_url Optional[HttpUrlString]

The URL to the GitHub repository of the code used to generate these results.

paper_url Optional[HttpUrlString]

The URL to the paper describing the methodology used to generate these results.

contributors Optional[list[HubUser]]

The users that are credited for these results.

_created_at datetime

The time-stamp at which the results were created. Automatically set.

For additional meta-data attributes, see the BaseArtifactModel class.


upload_to_hub(env_file: Optional[Union[str, os.PathLike]] = None, settings: Optional[PolarisHubSettings] = None, cache_auth_token: bool = True, access: Optional[AccessType] = 'private', owner: Optional[Union[HubOwner, str]] = None, **kwargs: dict)

Very light, convenient wrapper around the PolarisHubClient.upload_results method.


Bases: BaseModel

Metric metadata


Name Type Description
fn Callable

The callable that actually computes the metric.

is_multitask bool

Whether the metric expects a single set of predictions or a dict of predictions.

kwargs dict

Additional parameters required for the metric.

direction DirectionType

The direction for ranking of the metric, "max" for maximization and "min" for minimization.


absolute_average_fold_error(y_true: np.ndarray, y_pred: np.ndarray) -> float

Calculate the Absolute Average Fold Error (AAFE) metric. It measures the fold change between predicted values and observed values. The implementation is based on this paper.


Name Type Description Default
y_true ndarray

The true target values of shape (n_samples,)

y_pred ndarray

The predicted target values of shape (n_samples,).



Name Type Description
aafe float

The Absolute Average Fold Error.


Bases: Enum

A metric within the Polaris ecosystem is uniquely identified by its name and is associated with additional metadata in a MetricInfo instance.

Implemented as an enum.


score(y_true: np.ndarray, y_pred: Optional[np.ndarray] = None, y_prob: Optional[np.ndarray] = None) -> float

Endpoint for computing the metric.

For convenience, calling a Metric will result in this method being called.

metric = Metric.mean_absolute_error
assert metric.score(y_true=first, y_pred=second) == metric(y_true=first, y_pred=second)