Subset
polaris.dataset.Subset
The Subset
class provides easy access to a single partition of a split dataset.
No need to create this class manually
You should not have to create this class manually. In most use-cases, you can create a Subset
through the
get_train_test_split
method of a BenchmarkSpecification
object.
Featurize your inputs
Not all datasets are already featurized. For example, a small-molecule task might simply provide the SMILES string. To easily featurize the inputs, you can pass or set a transformation function. For example:
This should be the starting point for any framework-specific (e.g. PyTorch, Tensorflow) data-loader implementation.
How the data is loaded in Polaris can be non-trivial, so this class is provided to abstract away the details.
To easily build framework-specific data-loaders, a Subset
supports various styles of accessing the data:
- In memory: Loads the entire dataset in memory and returns a single array with all datapoints,
this style is accessible through the
subset.targets
andsubset.inputs
properties. - List: Index the subset like a list, this style is accessible through the
subset[idx]
syntax. - Iterator: Iterate over the subset, this style is accessible through the
iter(subset)
syntax.
Examples:
The different styles of accessing the data:
import polaris as po
benchmark = po.load_benchmark(...)
train, test = benchmark.get_train_test_split()
# Load the entire dataset in memory, useful for e.g. scikit-learn.
X = train.inputs
y = train.targets
# Access a single datapoint as with a list, useful for e.g. PyTorch.
x, y = train[0]
# Iterate over the dataset, useful for very large datasets.
for x, y in train:
...
Raises:
Type | Description |
---|---|
TestAccessError
|
When trying to access the targets of the test set (specified by the |