Dataset

polaris.dataset.Dataset

Bases: BaseDataset, ChecksumMixin

First version of a Polaris Dataset.

Stores datapoints in a Pandas DataFrame and implements pointer columns to support the storage of XL data outside the DataFrame in a Zarr archive.

Pointer columns

For complex data, such as images, we support storing the content in external blobs of data. In that case, the table contains pointers to these blobs that are dynamically loaded when needed.

Attributes:

Name	Type	Description
`table`	`DataFrame`	The core data-structure, storing data-points in a row-wise manner. Can be specified as either a path to a `.parquet` file or a `pandas.DataFrame`.

For additional meta-data attributes, see the BaseDataset class.

Raises:

Type	Description
`InvalidDatasetError`	If the dataset does not conform to the Pydantic data-model specification.

zarr_md5sum_manifest `property`

zarr_md5sum_manifest: List[ZarrFileChecksum]

The Zarr Checksum manifest stores the checksums of all files in a Zarr archive. If the dataset doesn't use Zarr, this will simply return an empty list.

rows `property`

rows: list[str | int]

Return all row indices for the dataset

columns `property`

columns: list[str]

Return all columns for the dataset

dtypes `property`

dtypes: dict[str, dtype]

Return the dtype for each of the columns for the dataset

load_zarr_root_from_hub

load_zarr_root_from_hub()

Loads a Zarr archive from the Hub.

get_data

get_data(row: str | int, col: str, adapters: dict[str, Adapter] | None = None) -> np.ndarray | Any

Since the dataset might contain pointers to external files, data retrieval is more complicated than just indexing the table attribute. This method provides an end-point for seamlessly accessing the underlying data.

Parameters:

Name	Type	Description	Default
`row`	`str \| int`	The row index in the `Dataset.table` attribute	required
`col`	`str`	The column index in the `Dataset.table` attribute	required
`adapters`	`dict[str, Adapter] \| None`	The adapters to apply to the data before returning it. If None, will use the default adapters specified for the dataset.	`None`

Returns:

Type	Description
`ndarray \| Any`	A numpy array with the data at the specified indices. If the column is a pointer column, the content of the referenced file is loaded to memory.

upload_to_hub

upload_to_hub(access: AccessType = 'private', owner: HubOwner | str | None = None)

Very light, convenient wrapper around the PolarisHubClient.upload_dataset method.

from_json `classmethod`

from_json(path: str)

Loads a dataset from a JSON file.

Parameters:

Name	Type	Description	Default
`path`	`str`	The path to the JSON file to load the dataset from .ColumnAnnotation	required

to_json

to_json(destination: str | Path, if_exists: ZarrConflictResolution = 'replace') -> str

Save the dataset to a destination directory as a JSON file.

Multiple files

Perhaps unintuitive, this method creates multiple files.

/path/to/destination/[dataset.slug].json: This file can be loaded with Dataset.from_json.
/path/to/destination/table.parquet: The Dataset.table attribute is saved here.
(Optional) /path/to/destination/[dataset.zarr_root]: Any additional blobs of data referenced by the pointer columns will be stored here.

Parameters:

Name	Type	Description	Default
`destination`	`str \| Path`	The directory to save the associated data to.	required
`if_exists`	`ZarrConflictResolution`	Action for handling existing files in the Zarr archive. Options are 'raise' to throw an error, 'replace' to overwrite, or 'skip' to proceed without altering the existing files.	`'replace'`

Returns:

Type	Description
`str`	The path to the JSON file.

cache

cache(destination: str | PathLike | None = None, if_exists: ZarrConflictResolution = 'replace', verify_checksum: bool = True) -> str

Caches the dataset by downloading all additional data for pointer columns to a local directory.

Parameters:

Name	Type	Description	Default
`destination`	`str \| PathLike \| None`	The directory to cache the data to. If None, will use the default cache directory.	`None`
`if_exists`	`ZarrConflictResolution`	Action for handling existing files at the destination. Options are 'raise' to throw an error, 'replace' to overwrite, or 'skip' to proceed without altering the existing files.	`'replace'`
`verify_checksum`	`bool`	Whether to verify the checksum of the dataset after caching.	`True`

Returns:

Type	Description
`str`	The path to the directory where data has been cached to.

should_verify_checksum

should_verify_checksum(strategy: ChecksumStrategy) -> bool

Determines whether to verify the checksum of the dataset based on the strategy.

polaris.dataset._base.BaseDataset

Bases: BaseArtifactModel, ABC

Base data-model for a Polaris dataset, implemented as a Pydantic model.

At its core, a dataset in Polaris can conceptually be thought of as tabular data structure that stores data-points in a row-wise manner, where each column correspond to a variable associated with that datapoint.

A Dataset can have multiple modalities or targets, can be sparse and can be part of one or multiple BenchmarkSpecification objects.

Attributes:

Name	Type	Description
`default_adapters`	`dict[str, Adapter]`	The adapters that the Dataset recommends to use by default to change the format of the data for specific columns.
`zarr_root_path`	`str \| None`	The data for any pointer column should be saved in the Zarr archive this path points to.
`readme`	`str`	Markdown text that can be used to provide a formatted description of the dataset. If using the Polaris Hub, it is worth noting that this field is more easily edited through the Hub UI as it provides a rich text editor for writing markdown.
`annotations`	`dict[str, ColumnAnnotation]`	Each column can be annotated with a `ColumnAnnotation` object. Importantly, this is used to annotate whether a column is a pointer column.
`source`	`HttpUrlString \| None`	The data source, e.g. a DOI, Github repo or URI.
`license`	`SupportedLicenseType \| None`	The dataset license. Polaris only supports some Creative Commons licenses. See `SupportedLicenseType` for accepted ID values.
`curation_reference`	`HttpUrlString \| None`	A reference to the curation process, e.g. a DOI, Github repo or URI.

For additional meta-data attributes, see the BaseArtifactModel class.

Raises:

Type	Description
`InvalidDatasetError`	If the dataset does not conform to the Pydantic data-model specification.

uses_zarr `property`

uses_zarr: bool

Whether any of the data in this dataset is stored in a Zarr Archive.

zarr_data `property`

zarr_data

Get the Zarr data.

This is different from the Zarr Root, because to optimize the efficiency of data loading, a user can choose to load the data into memory as a numpy array

General purpose dataloader.

The goal with Polaris is to provide general purpose datasets that serve as good options for a wide variety of use cases. This also implies you should be able to optimize things further for a specific use case if needed.

zarr_root `property`

zarr_root: Group | None

Get the zarr Group object corresponding to the root.

Opens the zarr archive in read-write mode if it is not already open.

Different to zarr_data

The zarr_data attribute references either to the Zarr archive or to a in-memory copy of the data. See also Dataset.load_to_memory.

n_rows `property`

n_rows: int

The number of rows in the dataset.

n_columns `property`

n_columns: int

The number of columns in the dataset.

rows `abstractmethod` `property`

rows: list[str | int]

Return all row indices for the dataset

columns `abstractmethod` `property`

columns: list[str]

Return all columns for the dataset

dtypes `abstractmethod` `property`

dtypes: dict[str, dtype]

Return the dtype for each of the columns for the dataset

load_zarr_root_from_hub `abstractmethod`

load_zarr_root_from_hub()

Loads a Zarr archive from the Polaris Hub.

load_zarr_root_from_local

load_zarr_root_from_local()

Loads a locally stored Zarr archive.

We use memory mapping by default because our experiments show that it's consistently faster

load_to_memory

load_to_memory()

Load data from zarr files to memeory

Make sure the uncompressed dataset fits in-memory.

This method will load the uncompressed dataset into memory. Make sure you actually have enough memory to store the dataset.

get_data `abstractmethod`

get_data(row: str | int, col: str, adapters: dict[str, Adapter] | None = None) -> np.ndarray | Any

Since the dataset might contain pointers to external files, data retrieval is more complicated than just indexing the table attribute. This method provides an end-point for seamlessly accessing the underlying data.

Parameters:

Name	Type	Description	Default
`row`	`str \| int`	The row index in the `Dataset.table` attribute	required
`col`	`str`	The column index in the `Dataset.table` attribute	required
`adapters`	`dict[str, Adapter] \| None`	The adapters to apply to the data before returning it. If None, will use the default adapters specified for the dataset.	`None`

Returns:

Type	Description
`ndarray \| Any`	A numpy array with the data at the specified indices. If the column is a pointer column, the content of the referenced file is loaded to memory.

upload_to_hub `abstractmethod`

upload_to_hub(access: AccessType = 'private', owner: HubOwner | str | None = None)

Uploads the dataset to the Polaris Hub.

from_json `abstractmethod` `classmethod`

from_json(path: str)

Loads a dataset from a JSON file.

Parameters:

Name	Type	Description	Default
`path`	`str`	The path to the JSON file to load the dataset from.	required

to_json `abstractmethod`

to_json(destination: str | Path, if_exists: ZarrConflictResolution = 'replace') -> str

Save the dataset to a destination directory as a JSON file.

Parameters:

Name	Type	Description	Default
`destination`	`str \| Path`	The directory to save the associated data to.	required
`if_exists`	`ZarrConflictResolution`	Action for handling existing files in the Zarr archive. Options are 'raise' to throw an error, 'replace' to overwrite, or 'skip' to proceed without altering the existing files.	`'replace'`

Returns:

Type	Description
`str`	The path to the JSON file.

polaris.dataset.ColumnAnnotation

Bases: BaseModel

The ColumnAnnotation class is used to annotate the columns of the Dataset object. This mostly just stores meta-data and does not affect the logic. The exception is the is_pointer attribute.

Attributes:

Name	Type	Description
`is_pointer`	`bool`	Annotates whether a column is a pointer column. If so, it does not contain data, but rather contains references to blobs of data from which the data is loaded.
`modality`	`Modality`	The data modality describes the data type and is used to categorize datasets on the hub and while it does not affect logic in this library, it does affect the logic of the hub.
`description`	`str \| None`	Describes how the data was generated.
`user_attributes`	`dict[str, str]`	Any additional meta-data can be stored in the user attributes.
`content_type`	`KnownContentType \| str \| None`	Specify column's IANA content type. If the the content type matches with a known type for molecules (e.g. "chemical/x-smiles"), visualization for its content will be activated on the Hub side

polaris.dataset.zarr

ZarrFileChecksum

Bases: BaseModel

This data is sent to the Hub to verify the integrity of the Zarr archive on upload.

Attributes:

Name	Type	Description
`path`	`str`	The path of the file relative to the Zarr root.
`md5sum`	`str`	The md5sum of the file.
`size`	`int`	The size of the file in bytes.

MemoryMappedDirectoryStore

Bases: DirectoryStore

A Zarr Store to open chunks as memory-mapped files. See also this Github issue.

Memory mapping leverages low-level OS functionality to reduce the time it takes to read the content of a file by directly mapping to memory.

compute_zarr_checksum

compute_zarr_checksum(zarr_root_path: str) -> Tuple[_ZarrDirectoryDigest, List[ZarrFileChecksum]]

Implements an algorithm to compute the Zarr checksum.

This checksum is sensitive to Zarr configuration.

This checksum is sensitive to change in the Zarr structure. For example, if you change the chunk size, the checksum will also change.

To understand how this works, consider the following directory structure:

       . (root)
      / \
     a   c
    /
   b

Within zarr, this would for example be:

root: A Zarr Group with a single Array.
a: A Zarr Array
b: A single chunk of the Zarr Array
c: A metadata file (i.e. .zarray, .zattrs or .zgroup)

To compute the checksum, we first find all the trees in the node, in this case b and c. We compute the hash of the content (the raw bytes) for each of these files.

We then work our way up the tree. For any node (directory), we find all children of that node. In an sorted order, we then serialize a list with - for each of the children - the checksum, size, and number of children. The hash of the directory is then equal to the hash of the serialized JSON.

The Polaris implementation is heavily based on the zarr-checksum package. This method is the biggest deviation of the original code.

generate_zarr_manifest

generate_zarr_manifest(zarr_root_path: str, output_dir: str) -> str

Entry point function which triggers the creation of a Zarr manifest for a V2 dataset.

Parameters:

Name	Type	Description	Default
`zarr_root_path`	`str`	The path to the root of a Zarr archive	required
`output_dir`	`str`	The path to the directory which will hold the generated manifest	required

Dataset

polaris.dataset.Dataset

zarr_md5sum_manifest property

rows property

columns property

dtypes property

load_zarr_root_from_hub

get_data

upload_to_hub

from_json classmethod

to_json

cache

should_verify_checksum

polaris.dataset._base.BaseDataset

uses_zarr property

zarr_data property

zarr_root property

n_rows property

n_columns property

rows abstractmethod property

columns abstractmethod property

dtypes abstractmethod property

load_zarr_root_from_hub abstractmethod

load_zarr_root_from_local

load_to_memory

get_data abstractmethod

upload_to_hub abstractmethod

from_json abstractmethod classmethod

to_json abstractmethod

polaris.dataset.ColumnAnnotation

polaris.dataset.zarr

ZarrFileChecksum

MemoryMappedDirectoryStore

compute_zarr_checksum

generate_zarr_manifest

zarr_md5sum_manifest `property`

rows `property`

columns `property`

dtypes `property`

from_json `classmethod`

uses_zarr `property`

zarr_data `property`

zarr_root `property`

n_rows `property`

n_columns `property`

rows `abstractmethod` `property`

columns `abstractmethod` `property`

dtypes `abstractmethod` `property`

load_zarr_root_from_hub `abstractmethod`

get_data `abstractmethod`

upload_to_hub `abstractmethod`

from_json `abstractmethod` `classmethod`

to_json `abstractmethod`