Skip to content

Evaluation

polaris.evaluate.BenchmarkResults

Bases: BaseArtifactModel

Class for saving benchmarking results

This object is returned by BenchmarkSpecification.evaluate. In addition to the metrics on the test set, it contains additional meta-data and logic to integrate the results with the Polaris Hub.

The actual results are saved in the results field using the following tabular format:

Test set Target label Metric Score
test_iid EGFR_WT AUC 0.9
test_ood EGFR_WT AUC 0.75
... ... ... ...
test_ood EGFR_L858R AUC 0.79
Categorizing methods

An open question is how to best categorize a methodology (e.g. a model). This is needed since we would like to be able to aggregate results across benchmarks too, to say something about which (type of) methods performs best in general.

Attributes:

Name Type Description
results ResultsType

Benchmark results are stored directly in a dataframe or in a serialized, JSON compatible dict that can be decoded into the associated tabular format.

benchmark_name SlugCompatibleStringType

The name of the benchmark for which these results were generated. Together with the benchmark owner, this uniquely identifies the benchmark on the Hub.

benchmark_owner Optional[HubOwner]

The owner of the benchmark for which these results were generated. Together with the benchmark name, this uniquely identifies the benchmark on the Hub.

github_url Optional[HttpUrlString]

The URL to the GitHub repository of the code used to generate these results.

paper_url Optional[HttpUrlString]

The URL to the paper describing the methodology used to generate these results.

contributors Optional[list[HubUser]]

The users that are credited for these results.

_created_at datetime

The time-stamp at which the results were created. Automatically set.

For additional meta-data attributes, see the BaseArtifactModel class.

upload_to_hub

upload_to_hub(env_file: Optional[Union[str, os.PathLike]] = None, settings: Optional[PolarisHubSettings] = None, cache_auth_token: bool = True, access: Optional[AccessType] = 'private', owner: Optional[Union[HubOwner, str]] = None, **kwargs: dict)

Very light, convenient wrapper around the PolarisHubClient.upload_results method.


polaris.evaluate.MetricInfo

Bases: BaseModel

Metric metadata

Attributes:

Name Type Description
fn Callable

The callable that actually computes the metric.

is_multitask bool

Whether the metric expects a single set of predictions or a dict of predictions.

kwargs dict

Additional parameters required for the metric.

direction DirectionType

The direction for ranking of the metric, "max" for maximization and "min" for minimization.

polaris.evaluate._metric.absolute_average_fold_error

absolute_average_fold_error(y_true: np.ndarray, y_pred: np.ndarray) -> float

Calculate the Absolute Average Fold Error (AAFE) metric. It measures the fold change between predicted values and observed values. The implementation is based on this paper.

Parameters:

Name Type Description Default
y_true ndarray

The true target values of shape (n_samples,)

required
y_pred ndarray

The predicted target values of shape (n_samples,).

required

Returns:

Name Type Description
aafe float

The Absolute Average Fold Error.


polaris.evaluate.Metric

Bases: Enum

A metric within the Polaris ecosystem is uniquely identified by its name and is associated with additional metadata in a MetricInfo instance.

Implemented as an enum.

score

score(y_true: np.ndarray, y_pred: Optional[np.ndarray] = None, y_prob: Optional[np.ndarray] = None) -> float

Endpoint for computing the metric.

For convenience, calling a Metric will result in this method being called.

metric = Metric.mean_absolute_error
assert metric.score(y_true=first, y_pred=second) == metric(y_true=first, y_pred=second)