Skip to content

Evaluation

polaris.evaluate.BenchmarkPredictions

Bases: BaseModel

Base model to represent predictions in the Polaris code base.

Guided by Postel's Law, this class normalizes different formats to a single, internal representation.

Attributes:

Name Type Description
predictions PredictionsType

The predictions for the benchmark.

target_labels list[str]

The target columns for the associated benchmark.

test_set_labels list[str]

The names of the test sets for the associated benchmark.

check_test_set_size

check_test_set_size() -> Self

Verify that the size of all predictions

get_subset

get_subset(test_set_subset: list[str] | None = None, target_subset: list[str] | None = None) -> BenchmarkPredictions

Return a subset of the original predictions

get_size

get_size(test_set_subset: list[str] | None = None, target_subset: list[str] | None = None) -> int

Return the total number of predictions, allowing for filtering by test set and target

flatten

flatten() -> np.ndarray

Return the predictions as a single, flat numpy array

__len__

__len__() -> int

Return the total number of predictions


polaris.evaluate.ResultsMetadata

Bases: BaseArtifactModel

Base class for evaluation results

Attributes:

Name Type Description
github_url HttpUrlString | None

The URL to the GitHub repository of the code used to generate these results.

paper_url HttpUrlString | None

The URL to the paper describing the methodology used to generate these results.

contributors list[HubUser]

The users that are credited for these results.

_created_at datetime

The time-stamp at which the results were created. Automatically set.

For additional meta-data attributes, see the BaseArtifactModel class.


polaris.evaluate.EvaluationResult

Bases: ResultsMetadata

Class for saving evaluation results

The actual results are saved in the results field using the following tabular format:

Test set Target label Metric Score
test_iid EGFR_WT AUC 0.9
test_ood EGFR_WT AUC 0.75
... ... ... ...
test_ood EGFR_L858R AUC 0.79
Categorizing methods

An open question is how to best categorize a methodology (e.g. a model). This is needed since we would like to be able to aggregate results across benchmarks/competitions too, to say something about which (type of) methods performs best in general.

Attributes:

Name Type Description
results DataFrame

Evaluation results are stored directly in a dataframe or in a serialized, JSON compatible dict that can be decoded into the associated tabular format.

For additional meta-data attributes, see the ResultsMetadata class.


polaris.evaluate.BenchmarkResults

Bases: EvaluationResult

Class specific to results for standard benchmarks.

This object is returned by BenchmarkSpecification.evaluate. In addition to the metrics on the test set, it contains additional meta-data and logic to integrate the results with the Polaris Hub.

The name of the benchmark for which these results were generated.

Together with the benchmark owner, this uniquely identifies the benchmark on the Hub.

benchmark_owner: The owner of the benchmark for which these results were generated. Together with the benchmark name, this uniquely identifies the benchmark on the Hub.

upload_to_hub

upload_to_hub(access: AccessType = 'private', owner: HubOwner | str | None = None, **kwargs: dict) -> BenchmarkResults

Very light, convenient wrapper around the PolarisHubClient.upload_results method.


polaris.evaluate.MetricInfo

Bases: BaseModel

Metric metadata

Attributes:

Name Type Description
fn Callable

The callable that actually computes the metric.

is_multitask bool

Whether the metric expects a single set of predictions or a dict of predictions.

kwargs dict

Additional parameters required for the metric.

direction DirectionType

The direction for ranking of the metric, "max" for maximization and "min" for minimization.

y_type PredictionKwargs

The type of predictions expected by the metric interface.


polaris.evaluate.Metric

Bases: BaseModel

A Metric in Polaris.

A metric consists of a default metric, which is a callable labeled with additional metadata, as well as a config. The config can change how the metric is computed, for example by grouping the data before computing the metric.

Attributes:

Name Type Description
label MetricLabel

The actual callable that is at the core of the metric implementation.

custom_name str | None

A optional, custom name of the metric. Names should be unique within the context of a benchmark.

config GroupedMetricConfig | None

For more complex metrics, this object should hold all parameters for the metric.

fn Callable

The callable that actually computes the metric, automatically set based on the label.

is_multitask bool

Whether the metric expects a single set of predictions or a dict of predictions, automatically set based on the label.

kwargs dict

Additional parameters required for the metric, automatically set based on the label.

direction DirectionType

The direction for ranking of the metric, "max" for maximization and "min" for minimization, automatically set based on the label.

y_type PredictionKwargs

The type of predictions expected by the metric interface, automatically set based on the label.

score

score(y_true: GroundTruth, y_pred: BenchmarkPredictions | None = None, y_prob: BenchmarkPredictions | None = None) -> float

Compute the metric.

Parameters:

Name Type Description Default
y_true GroundTruth

The true target values.

required
y_pred BenchmarkPredictions | None

The predicted target values, if any.

None
y_prob BenchmarkPredictions | None

The predicted target probabilities, if any.

None

polaris.evaluate.metrics.generic_metrics

pearsonr

pearsonr(y_true: np.ndarray, y_pred: np.ndarray)

Calculate a pearson r correlation

spearman

spearman(y_true: np.ndarray, y_pred: np.ndarray)

Calculate a Spearman correlation

absolute_average_fold_error

absolute_average_fold_error(y_true: np.ndarray, y_pred: np.ndarray) -> float

Calculate the Absolute Average Fold Error (AAFE) metric. It measures the fold change between predicted values and observed values. The implementation is based on this paper.

Parameters:

Name Type Description Default
y_true ndarray

The true target values of shape (n_samples,)

required
y_pred ndarray

The predicted target values of shape (n_samples,).

required

Returns:

Name Type Description
aafe float

The Absolute Average Fold Error.

cohen_kappa_score

cohen_kappa_score(y_true, y_pred, **kwargs)

Scikit learn cohen_kappa_score wraper with renamed arguments

average_precision_score

average_precision_score(y_true, y_score, **kwargs)

Scikit learn average_precision_score wrapper that throws an error if y_true has no positive class

polaris.evaluate.metrics.docking_metrics

rmsd_coverage

rmsd_coverage(y_pred: Union[str, List[dm.Mol]], y_true: Union[str, list[dm.Mol]], max_rsmd: float = 2)

Calculate the coverage of molecules with an RMSD less than a threshold (2 Å by default) compared to the reference molecule conformer.

It is assumed that the predicted binding conformers are extracted from the docking output, where the receptor (protein) coordinates have been aligned with the original crystal structure.

Attributes:

Name Type Description
y_pred

List of predicted binding conformers.

y_true

List of ground truth binding confoermers.

max_rsmd

The threshold for determining acceptable rsmd.