Skip to content

Dataset

polaris.dataset.Dataset

Bases: BaseDataset, ChecksumMixin

First version of a Polaris Dataset.

Stores datapoints in a Pandas DataFrame and implements pointer columns to support the storage of XL data outside the DataFrame in a Zarr archive.

Pointer columns

For complex data, such as images, we support storing the content in external blobs of data. In that case, the table contains pointers to these blobs that are dynamically loaded when needed.

Attributes:

Name Type Description
table DataFrame

The core data-structure, storing data-points in a row-wise manner. Can be specified as either a path to a .parquet file or a pandas.DataFrame.

For additional meta-data attributes, see the BaseDataset class.

Raises:

Type Description
InvalidDatasetError

If the dataset does not conform to the Pydantic data-model specification.

zarr_md5sum_manifest property

zarr_md5sum_manifest: List[ZarrFileChecksum]

The Zarr Checksum manifest stores the checksums of all files in a Zarr archive. If the dataset doesn't use Zarr, this will simply return an empty list.

rows property

rows: list[str | int]

Return all row indices for the dataset

columns property

columns: list[str]

Return all columns for the dataset

dtypes property

dtypes: dict[str, dtype]

Return the dtype for each of the columns for the dataset

load_zarr_root_from_hub

load_zarr_root_from_hub()

Loads a Zarr archive from the Hub.

get_data

get_data(row: str | int, col: str, adapters: dict[str, Adapter] | None = None) -> np.ndarray | Any

Since the dataset might contain pointers to external files, data retrieval is more complicated than just indexing the table attribute. This method provides an end-point for seamlessly accessing the underlying data.

Parameters:

Name Type Description Default
row str | int

The row index in the Dataset.table attribute

required
col str

The column index in the Dataset.table attribute

required
adapters dict[str, Adapter] | None

The adapters to apply to the data before returning it. If None, will use the default adapters specified for the dataset.

None

Returns:

Type Description
ndarray | Any

A numpy array with the data at the specified indices. If the column is a pointer column, the content of the referenced file is loaded to memory.

upload_to_hub

upload_to_hub(access: AccessType = 'private', owner: HubOwner | str | None = None)

Very light, convenient wrapper around the PolarisHubClient.upload_dataset method.

from_json classmethod

from_json(path: str)

Loads a dataset from a JSON file.

Parameters:

Name Type Description Default
path str

The path to the JSON file to load the dataset from .ColumnAnnotation

required

to_json

to_json(destination: str | Path, if_exists: ZarrConflictResolution = 'replace') -> str

Save the dataset to a destination directory as a JSON file.

Multiple files

Perhaps unintuitive, this method creates multiple files.

  1. /path/to/destination/[dataset.slug].json: This file can be loaded with Dataset.from_json.
  2. /path/to/destination/table.parquet: The Dataset.table attribute is saved here.
  3. (Optional) /path/to/destination/[dataset.zarr_root]: Any additional blobs of data referenced by the pointer columns will be stored here.

Parameters:

Name Type Description Default
destination str | Path

The directory to save the associated data to.

required
if_exists ZarrConflictResolution

Action for handling existing files in the Zarr archive. Options are 'raise' to throw an error, 'replace' to overwrite, or 'skip' to proceed without altering the existing files.

'replace'

Returns:

Type Description
str

The path to the JSON file.

cache

cache(destination: str | PathLike | None = None, if_exists: ZarrConflictResolution = 'replace', verify_checksum: bool = True) -> str

Caches the dataset by downloading all additional data for pointer columns to a local directory.

Parameters:

Name Type Description Default
destination str | PathLike | None

The directory to cache the data to. If None, will use the default cache directory.

None
if_exists ZarrConflictResolution

Action for handling existing files at the destination. Options are 'raise' to throw an error, 'replace' to overwrite, or 'skip' to proceed without altering the existing files.

'replace'
verify_checksum bool

Whether to verify the checksum of the dataset after caching.

True

Returns:

Type Description
str

The path to the directory where data has been cached to.

should_verify_checksum

should_verify_checksum(strategy: ChecksumStrategy) -> bool

Determines whether to verify the checksum of the dataset based on the strategy.


polaris.dataset._base.BaseDataset

Bases: BaseArtifactModel, ABC

Base data-model for a Polaris dataset, implemented as a Pydantic model.

At its core, a dataset in Polaris can conceptually be thought of as tabular data structure that stores data-points in a row-wise manner, where each column correspond to a variable associated with that datapoint.

A Dataset can have multiple modalities or targets, can be sparse and can be part of one or multiple BenchmarkSpecification objects.

Attributes:

Name Type Description
default_adapters dict[str, Adapter]

The adapters that the Dataset recommends to use by default to change the format of the data for specific columns.

zarr_root_path str | None

The data for any pointer column should be saved in the Zarr archive this path points to.

readme str

Markdown text that can be used to provide a formatted description of the dataset. If using the Polaris Hub, it is worth noting that this field is more easily edited through the Hub UI as it provides a rich text editor for writing markdown.

annotations dict[str, ColumnAnnotation]

Each column can be annotated with a ColumnAnnotation object. Importantly, this is used to annotate whether a column is a pointer column.

source HttpUrlString | None

The data source, e.g. a DOI, Github repo or URI.

license SupportedLicenseType | None

The dataset license. Polaris only supports some Creative Commons licenses. See SupportedLicenseType for accepted ID values.

curation_reference HttpUrlString | None

A reference to the curation process, e.g. a DOI, Github repo or URI.

For additional meta-data attributes, see the BaseArtifactModel class.

Raises:

Type Description
InvalidDatasetError

If the dataset does not conform to the Pydantic data-model specification.

uses_zarr property

uses_zarr: bool

Whether any of the data in this dataset is stored in a Zarr Archive.

zarr_data property

zarr_data

Get the Zarr data.

This is different from the Zarr Root, because to optimize the efficiency of data loading, a user can choose to load the data into memory as a numpy array

General purpose dataloader.

The goal with Polaris is to provide general purpose datasets that serve as good options for a wide variety of use cases. This also implies you should be able to optimize things further for a specific use case if needed.

zarr_root property

zarr_root: Group | None

Get the zarr Group object corresponding to the root.

Opens the zarr archive in read-write mode if it is not already open.

Different to zarr_data

The zarr_data attribute references either to the Zarr archive or to a in-memory copy of the data. See also Dataset.load_to_memory.

n_rows property

n_rows: int

The number of rows in the dataset.

n_columns property

n_columns: int

The number of columns in the dataset.

rows abstractmethod property

rows: list[str | int]

Return all row indices for the dataset

columns abstractmethod property

columns: list[str]

Return all columns for the dataset

dtypes abstractmethod property

dtypes: dict[str, dtype]

Return the dtype for each of the columns for the dataset

load_zarr_root_from_hub abstractmethod

load_zarr_root_from_hub()

Loads a Zarr archive from the Polaris Hub.

load_zarr_root_from_local

load_zarr_root_from_local()

Loads a locally stored Zarr archive.

We use memory mapping by default because our experiments show that it's consistently faster

load_to_memory

load_to_memory()

Load data from zarr files to memeory

Make sure the uncompressed dataset fits in-memory.

This method will load the uncompressed dataset into memory. Make sure you actually have enough memory to store the dataset.

get_data abstractmethod

get_data(row: str | int, col: str, adapters: dict[str, Adapter] | None = None) -> np.ndarray | Any

Since the dataset might contain pointers to external files, data retrieval is more complicated than just indexing the table attribute. This method provides an end-point for seamlessly accessing the underlying data.

Parameters:

Name Type Description Default
row str | int

The row index in the Dataset.table attribute

required
col str

The column index in the Dataset.table attribute

required
adapters dict[str, Adapter] | None

The adapters to apply to the data before returning it. If None, will use the default adapters specified for the dataset.

None

Returns:

Type Description
ndarray | Any

A numpy array with the data at the specified indices. If the column is a pointer column, the content of the referenced file is loaded to memory.

upload_to_hub abstractmethod

upload_to_hub(access: AccessType = 'private', owner: HubOwner | str | None = None)

Uploads the dataset to the Polaris Hub.

from_json abstractmethod classmethod

from_json(path: str)

Loads a dataset from a JSON file.

Parameters:

Name Type Description Default
path str

The path to the JSON file to load the dataset from.

required

to_json abstractmethod

to_json(destination: str | Path, if_exists: ZarrConflictResolution = 'replace') -> str

Save the dataset to a destination directory as a JSON file.

Parameters:

Name Type Description Default
destination str | Path

The directory to save the associated data to.

required
if_exists ZarrConflictResolution

Action for handling existing files in the Zarr archive. Options are 'raise' to throw an error, 'replace' to overwrite, or 'skip' to proceed without altering the existing files.

'replace'

Returns:

Type Description
str

The path to the JSON file.


polaris.dataset.ColumnAnnotation

Bases: BaseModel

The ColumnAnnotation class is used to annotate the columns of the Dataset object. This mostly just stores meta-data and does not affect the logic. The exception is the is_pointer attribute.

Attributes:

Name Type Description
is_pointer bool

Annotates whether a column is a pointer column. If so, it does not contain data, but rather contains references to blobs of data from which the data is loaded.

modality Modality

The data modality describes the data type and is used to categorize datasets on the hub and while it does not affect logic in this library, it does affect the logic of the hub.

description str | None

Describes how the data was generated.

user_attributes dict[str, str]

Any additional meta-data can be stored in the user attributes.

content_type KnownContentType | str | None

Specify column's IANA content type. If the the content type matches with a known type for molecules (e.g. "chemical/x-smiles"), visualization for its content will be activated on the Hub side


polaris.dataset.zarr

ZarrFileChecksum

Bases: BaseModel

This data is sent to the Hub to verify the integrity of the Zarr archive on upload.

Attributes:

Name Type Description
path str

The path of the file relative to the Zarr root.

md5sum str

The md5sum of the file.

size int

The size of the file in bytes.

MemoryMappedDirectoryStore

Bases: DirectoryStore

A Zarr Store to open chunks as memory-mapped files. See also this Github issue.

Memory mapping leverages low-level OS functionality to reduce the time it takes to read the content of a file by directly mapping to memory.

compute_zarr_checksum

compute_zarr_checksum(zarr_root_path: str) -> Tuple[_ZarrDirectoryDigest, List[ZarrFileChecksum]]

Implements an algorithm to compute the Zarr checksum.

This checksum is sensitive to Zarr configuration.

This checksum is sensitive to change in the Zarr structure. For example, if you change the chunk size, the checksum will also change.

To understand how this works, consider the following directory structure:

       . (root)
      / \
     a   c
    /
   b

Within zarr, this would for example be:

  • root: A Zarr Group with a single Array.
  • a: A Zarr Array
  • b: A single chunk of the Zarr Array
  • c: A metadata file (i.e. .zarray, .zattrs or .zgroup)

To compute the checksum, we first find all the trees in the node, in this case b and c. We compute the hash of the content (the raw bytes) for each of these files.

We then work our way up the tree. For any node (directory), we find all children of that node. In an sorted order, we then serialize a list with - for each of the children - the checksum, size, and number of children. The hash of the directory is then equal to the hash of the serialized JSON.

The Polaris implementation is heavily based on the zarr-checksum package. This method is the biggest deviation of the original code.

generate_zarr_manifest

generate_zarr_manifest(zarr_root_path: str, output_dir: str) -> str

Entry point function which triggers the creation of a Zarr manifest for a V2 dataset.

Parameters:

Name Type Description Default
zarr_root_path str

The path to the root of a Zarr archive

required
output_dir str

The path to the directory which will hold the generated manifest

required