Skip to content

Dataset

polaris.dataset.Dataset

Bases: BaseArtifactModel

Basic data-model for a Polaris dataset, implemented as a Pydantic model.

At its core, a dataset in Polaris is a tabular data structure that stores data-points in a row-wise manner. A Dataset can have multiple modalities or targets, can be sparse and can be part of one or multiple BenchmarkSpecification objects.

Pointer columns

Whereas a Dataset contains all information required to construct a dataset, it is not ready yet. For complex data, such as images, we support storing the content in external blobs of data. In that case, the table contains pointers to these blobs that are dynamically loaded when needed.

Attributes:

Name Type Description
table Union[DataFrame, str]

The core data-structure, storing data-points in a row-wise manner. Can be specified as either a path to a .parquet file or a pandas.DataFrame.

default_adapters Dict[str, Adapter]

The adapters that the Dataset recommends to use by default to change the format of the data for specific columns.

zarr_root_path Optional[str]

The data for any pointer column should be saved in the Zarr archive this path points to.

md5sum Optional[str]

The checksum is used to verify the version of the dataset specification. If specified, it will raise an error if the specified checksum doesn't match the computed checksum.

readme str

Markdown text that can be used to provide a formatted description of the dataset. If using the Polaris Hub, it is worth noting that this field is more easily edited through the Hub UI as it provides a rich text editor for writing markdown.

annotations Dict[str, ColumnAnnotation]

Each column can be annotated with a ColumnAnnotation object. Importantly, this is used to annotate whether a column is a pointer column.

source Optional[HttpUrlString]

The data source, e.g. a DOI, Github repo or URI.

license Optional[SupportedLicenseType]

The dataset license. Polaris only supports some Creative Commons licenses. See SupportedLicenseType for accepted ID values.

curation_reference Optional[HttpUrlString]

A reference to the curation process, e.g. a DOI, Github repo or URI.

For additional meta-data attributes, see the BaseArtifactModel class.

Raises:

Type Description
InvalidDatasetError

If the dataset does not conform to the Pydantic data-model specification.

PolarisChecksumError

If the specified checksum does not match the computed checksum.

client property

client

The Polaris Hub client used to interact with the Polaris Hub.

zarr_data property

zarr_data

Get the Zarr data.

This is different from the Zarr Root, because to optimize the efficiency of data loading, a user can choose to load the data into memory as a numpy array

General purpose dataloader.

The goal with Polaris is to provide general purpose datasets that serve as good options for a wide variety of use cases. This also implies you should be able to optimize things further for a specific use case if needed.

zarr_root property

zarr_root

Get the zarr Group object corresponding to the root.

Opens the zarr archive in read-write mode if it is not already open.

Different to zarr_data

The zarr_data attribute references either to the Zarr archive or to a in-memory copy of the data. See also Dataset.load_to_memory.

n_rows property

n_rows: int

The number of rows in the dataset.

n_columns property

n_columns: int

The number of columns in the dataset.

rows property

rows: list

Return all row indices for the dataset

columns property

columns: list

Return all columns for the dataset

load_to_memory

load_to_memory()

Pre-load the entire dataset into memory.

Make sure the uncompressed dataset fits in-memory.

This method will load the uncompressed dataset into memory. Make sure you actually have enough memory to store the dataset.

get_data

get_data(row: str | int, col: str, adapters: Optional[List[Adapter]] = None) -> np.ndarray

Since the dataset might contain pointers to external files, data retrieval is more complicated than just indexing the table attribute. This method provides an end-point for seamlessly accessing the underlying data.

Parameters:

Name Type Description Default
row str | int

The row index in the Dataset.table attribute

required
col str

The column index in the Dataset.table attribute

required
adapters Optional[List[Adapter]]

The adapters to apply to the data before returning it. If None, will use the default adapters specified for the dataset.

None

Returns:

Type Description
ndarray

A numpy array with the data at the specified indices. If the column is a pointer column, the content of the referenced file is loaded to memory.

upload_to_hub

upload_to_hub(access: Optional[AccessType] = 'private', owner: Optional[Union[HubOwner, str]] = None)

Very light, convenient wrapper around the PolarisHubClient.upload_dataset method.

from_json classmethod

from_json(path: str)

Loads a benchmark from a JSON file. Overrides the method from the base class to remove the caching dir from the file to load from, as that should be user dependent.

Parameters:

Name Type Description Default
path str

Loads a benchmark specification from a JSON file.

required

to_json

to_json(destination: str) -> str

Save the dataset to a destination directory as a JSON file.

Multiple files

Perhaps unintuitive, this method creates multiple files.

  1. /path/to/destination/dataset.json: This file can be loaded with Dataset.from_json.
  2. /path/to/destination/table.parquet: The Dataset.table attribute is saved here.
  3. (Optional) /path/to/destination/data/*: Any additional blobs of data referenced by the pointer columns will be stored here.

Parameters:

Name Type Description Default
destination str

The directory to save the associated data to.

required

Returns:

Type Description
str

The path to the JSON file.

cache

cache(cache_dir: Optional[str] = None) -> str

Caches the dataset by downloading all additional data for pointer columns to a local directory.

Parameters:

Name Type Description Default
cache_dir Optional[str]

The directory to cache the data to. If not provided, this will fall back to the Dataset.cache_dir attribute

None

Returns:

Type Description
str

The path to the cache directory.


polaris.dataset.ColumnAnnotation

Bases: BaseModel

The ColumnAnnotation class is used to annotate the columns of the Dataset object. This mostly just stores meta-data and does not affect the logic. The exception is the is_pointer attribute.

Attributes:

Name Type Description
is_pointer bool

Annotates whether a column is a pointer column. If so, it does not contain data, but rather contains references to blobs of data from which the data is loaded.

modality Union[str, Modality]

The data modality describes the data type and is used to categorize datasets on the hub and while it does not affect logic in this library, it does affect the logic of the hub.

description Optional[str]

Describes how the data was generated.

user_attributes Dict[str, str]

Any additional meta-data can be stored in the user attributes.