Create a Benchmark

Polaris explicitly distinguished datasets from benchmarks. A benchmark defines the ML task and evaluation logic (e.g. split and metrics) for a dataset. Because of this, a single dataset can be the basis of multiple benchmarks.

Create a Benchmark¶

To create a benchmark, you need to instantiate the BenchmarkV2Specification class. This requires you to specify:

The dataset, which can be stored either locally or on the Hub.
The task, where a task is defined by input and target columns.
The split, where a split is defined by a bunch of indices.
The metric, where a metric needs to be officially supported by Polaris.
The metadata to contextualize your benchmark.

Define the dataset¶

To learn how to create a dataset, see this tutorial.

Alternatively, we can also load an existing dataset from the Hub.

Not all Hub datasets are supported

You can only create benchmarks for DatasetV2 instances, not for DatasetV1 instances. Some of the datasets stored on the Hub are still V1 datasets.

Define the task¶

Currently, Polaris only supports predictive tasks. Specifying a predictive task is simply done by specifying the input and target columns.

In [ ]:

Copied!

input_columns = ["SMILES"]
target_columns = ["LOG_SOLUBILITY"]
input_columns = ["SMILES"]
target_columns = ["LOG_SOLUBILITY"]

In this case, we specified just a single input and target column, but a benchmark can have multiple (e.g. a multi-task benchmark).

Define the split¶

To ensure reproducible results, Polaris represents a split through a bunch of sets of indices.

But there is a catch: We want Polaris to scale to extra large datasets. If we are to naively store millions of indices as lists of integers, this would impose a significant memory footprint. We therefore use bitmaps, more specifically roaring bitmaps to store the splits in a memory efficient way.

In [ ]:

Copied!





from polaris.benchmark._split_v2 import IndexSet

# To specify a set of integers, you can directly pass in a list of integers
# This will automatically convert the indices to a BitMap
training = IndexSet(indices=[0, 1])
test = IndexSet(indices=[2])
from polaris.benchmark._split_v2 import IndexSet

# To specify a set of integers, you can directly pass in a list of integers
# This will automatically convert the indices to a BitMap
training = IndexSet(indices=[0, 1])
test = IndexSet(indices=[2])

In [ ]:

Copied!





from pyroaring import BitMap

# Or you can create the BitMap manually and iteratively
indices = BitMap()
indices.add(0)
indices.add(1)

training = IndexSet(indices=indices)
from pyroaring import BitMap

# Or you can create the BitMap manually and iteratively
indices = BitMap()
indices.add(0)
indices.add(1)

training = IndexSet(indices=indices)

In [ ]:

Copied!

from polaris.benchmark._split_v2 import SplitV2

# Finally, we create the actual split object
split = SplitV2(training=training, test=test)
from polaris.benchmark._split_v2 import SplitV2

# Finally, we create the actual split object
split = SplitV2(training=training, test=test)

Define the metrics¶

Even something as widely used as Mean Absolute Error (MAE) can be implemented in subtly different ways. Some people apply a log transform first, others might clip outliers, and sometimes an off-by-one or a bug creeps in. Over time, these variations add up. We decided to codify each metric for a Polaris benchmark in a single, transparent implementation. Our priority here is eliminating “mystery differences” that have nothing to do with actual model performance. Learn more here.

Specifying a metric is easy. You can simply specify its label.

In [ ]:

Copied!

metrics = ["mean_absolute_error", "mean_squared_error"]
metrics = ["mean_absolute_error", "mean_squared_error"]

You can also specify a main metric, which will be the metric used to rank the leaderboard.

In [ ]:

Copied!

main_metric = "mean_absolute_error"
main_metric = "mean_absolute_error"

To get a list of all support metrics, you can use:

In [ ]:

Copied!

from polaris.evaluate._metric import DEFAULT_METRICS

DEFAULT_METRICS.keys()
from polaris.evaluate._metric import DEFAULT_METRICS

DEFAULT_METRICS.keys()

You can also create more complex metrics that wrap these base metrics.

In [ ]:

Copied!

from polaris.evaluate import Metric

mae_agg = Metric(label="mean_absolute_error", config={"group_by": "UNIQUE_ID", "on_error": "ignore", "aggregation": "mean"})
metrics.append(mae_agg)
from polaris.evaluate import Metric

mae_agg = Metric(label="mean_absolute_error", config={"group_by": "UNIQUE_ID", "on_error": "ignore", "aggregation": "mean"})
metrics.append(mae_agg)

What if my metric isn't supported yet?

Using a metric that's not supported yet, currently requires adding it to the Polaris codebase. We're always looking to improve support. Reach out to us over Github and we're happy to help!

Bringing it all together¶

Now we can create the BenchmarkV2Specification instance.

In [ ]:

Copied!

type(dataset)
type(dataset)

In [ ]:

Copied!





from polaris.benchmark._benchmark_v2 import BenchmarkV2Specification

benchmark = BenchmarkV2Specification(
    # 1. The dataset
    dataset=dataset,
    # 2. The task
    input_cols=input_columns,
    target_cols=target_columns,
    # 3. The split
    split=split,
    # 4. The metrics
    metrics=metrics,
    main_metric=main_metric,
    # 5. The metadata
    name="my-first-benchmark",
    owner="your-username", 
    description="Created using the Polaris tutorial",
    tags=["tutorial"], 
    user_attributes={"Key": "Value"}
)
from polaris.benchmark._benchmark_v2 import BenchmarkV2Specification

benchmark = BenchmarkV2Specification(
    # 1. The dataset
    dataset=dataset,
    # 2. The task
    input_cols=input_columns,
    target_cols=target_columns,
    # 3. The split
    split=split,
    # 4. The metrics
    metrics=metrics,
    main_metric=main_metric,
    # 5. The metadata
    name="my-first-benchmark",
    owner="your-username", 
    description="Created using the Polaris tutorial",
    tags=["tutorial"], 
    user_attributes={"Key": "Value"}
)

Want to share your benchmark with the community? Upload it to the Polaris Hub!

In [ ]:

Copied!

benchmark.upload_to_hub(owner="your-username")
benchmark.upload_to_hub(owner="your-username")

The End.

Create a Benchmark