Skip to content

Subset

polaris.dataset.Subset

The Subset class provides easy access to a single partition of a split dataset.

No need to create this class manually

You should not have to create this class manually. In most use-cases, you can create a Subset through the get_train_test_split method of a BenchmarkSpecification object.

Featurize your inputs

Not all datasets are already featurized. For example, a small-molecule task might simply provide the SMILES string. To easily featurize the inputs, you can pass or set a transformation function. For example:

import datamol as dm

benchmark.get_train_test_split(..., featurization_fn=dm.to_fp)

This should be the starting point for any framework-specific (e.g. PyTorch, Tensorflow) data-loader implementation. How the data is loaded in Polaris can be non-trivial, so this class is provided to abstract away the details. To easily build framework-specific data-loaders, a Subset supports various styles of accessing the data:

  1. In memory: Loads the entire dataset in memory and returns a single array with all datapoints, this style is accessible through the subset.targets and subset.inputs properties.
  2. List: Index the subset like a list, this style is accessible through the subset[idx] syntax.
  3. Iterator: Iterate over the subset, this style is accessible through the iter(subset) syntax.

Examples:

The different styles of accessing the data:

import polaris as po

benchmark = po.load_benchmark(...)
train, test = benchmark.get_train_test_split()

# Load the entire dataset in memory, useful for e.g. scikit-learn.
X = train.inputs
y = train.targets

# Access a single datapoint as with a list, useful for e.g. PyTorch.
x, y = train[0]

# Iterate over the dataset, useful for very large datasets.
for x, y in train:
    ...

Raises:

Type Description
TestAccessError

When trying to access the targets of the test set (specified by the hide_targets attribute).