Dataset
polaris.dataset.Dataset
Bases: BaseDataset
, ChecksumMixin
First version of a Polaris Dataset.
Stores datapoints in a Pandas DataFrame and implements pointer columns to support the storage of XL data outside the DataFrame in a Zarr archive.
Pointer columns
For complex data, such as images, we support storing the content in external blobs of data. In that case, the table contains pointers to these blobs that are dynamically loaded when needed.
Attributes:
Name | Type | Description |
---|---|---|
table |
DataFrame
|
The core data-structure, storing data-points in a row-wise manner. Can be specified as either a
path to a |
For additional meta-data attributes, see the BaseDataset
class.
Raises:
Type | Description |
---|---|
InvalidDatasetError
|
If the dataset does not conform to the Pydantic data-model specification. |
zarr_md5sum_manifest
property
The Zarr Checksum manifest stores the checksums of all files in a Zarr archive. If the dataset doesn't use Zarr, this will simply return an empty list.
get_data
Since the dataset might contain pointers to external files, data retrieval is more complicated
than just indexing the table
attribute. This method provides an end-point for seamlessly
accessing the underlying data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
row
|
str | int
|
The row index in the |
required |
col
|
str
|
The column index in the |
required |
adapters
|
dict[str, Adapter] | None
|
The adapters to apply to the data before returning it. If None, will use the default adapters specified for the dataset. |
None
|
Returns:
Type | Description |
---|---|
ndarray | Any
|
A numpy array with the data at the specified indices. If the column is a pointer column, the content of the referenced file is loaded to memory. |
upload_to_hub
Very light, convenient wrapper around the
PolarisHubClient.upload_dataset
method.
from_json
classmethod
Loads a dataset from a JSON file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
The path to the JSON file to load the dataset from .ColumnAnnotation |
required |
to_json
Save the dataset to a destination directory as a JSON file.
Multiple files
Perhaps unintuitive, this method creates multiple files.
/path/to/destination/[dataset.slug].json
: This file can be loaded withDataset.from_json
./path/to/destination/table.parquet
: TheDataset.table
attribute is saved here.- (Optional)
/path/to/destination/[dataset.zarr_root]
: Any additional blobs of data referenced by the pointer columns will be stored here.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
destination
|
str | Path
|
The directory to save the associated data to. |
required |
if_exists
|
ZarrConflictResolution
|
Action for handling existing files in the Zarr archive. Options are 'raise' to throw an error, 'replace' to overwrite, or 'skip' to proceed without altering the existing files. |
'replace'
|
Returns:
Type | Description |
---|---|
str
|
The path to the JSON file. |
cache
cache(destination: str | PathLike | None = None, if_exists: ZarrConflictResolution = 'replace', verify_checksum: bool = True) -> str
Caches the dataset by downloading all additional data for pointer columns to a local directory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
destination
|
str | PathLike | None
|
The directory to cache the data to. If None, will use the default cache directory. |
None
|
if_exists
|
ZarrConflictResolution
|
Action for handling existing files at the destination. Options are 'raise' to throw an error, 'replace' to overwrite, or 'skip' to proceed without altering the existing files. |
'replace'
|
verify_checksum
|
bool
|
Whether to verify the checksum of the dataset after caching. |
True
|
Returns:
Type | Description |
---|---|
str
|
The path to the directory where data has been cached to. |
polaris.dataset._base.BaseDataset
Bases: BaseArtifactModel
, ABC
Base data-model for a Polaris dataset, implemented as a Pydantic model.
At its core, a dataset in Polaris can conceptually be thought of as tabular data structure that stores data-points in a row-wise manner, where each column correspond to a variable associated with that datapoint.
A Dataset can have multiple modalities or targets, can be sparse and can be part of one or multiple
BenchmarkSpecification
objects.
Attributes:
Name | Type | Description |
---|---|---|
default_adapters |
dict[str, Adapter]
|
The adapters that the Dataset recommends to use by default to change the format of the data for specific columns. |
zarr_root_path |
str | None
|
The data for any pointer column should be saved in the Zarr archive this path points to. |
readme |
str
|
Markdown text that can be used to provide a formatted description of the dataset. If using the Polaris Hub, it is worth noting that this field is more easily edited through the Hub UI as it provides a rich text editor for writing markdown. |
annotations |
dict[str, ColumnAnnotation]
|
Each column can be annotated with a |
source |
HttpUrlString | None
|
The data source, e.g. a DOI, Github repo or URI. |
license |
SupportedLicenseType | None
|
The dataset license. Polaris only supports some Creative Commons licenses. See |
curation_reference |
HttpUrlString | None
|
A reference to the curation process, e.g. a DOI, Github repo or URI. |
For additional meta-data attributes, see the BaseArtifactModel
class.
Raises:
Type | Description |
---|---|
InvalidDatasetError
|
If the dataset does not conform to the Pydantic data-model specification. |
uses_zarr
property
Whether any of the data in this dataset is stored in a Zarr Archive.
zarr_data
property
Get the Zarr data.
This is different from the Zarr Root, because to optimize the efficiency of data loading, a user can choose to load the data into memory as a numpy array
General purpose dataloader.
The goal with Polaris is to provide general purpose datasets that serve as good options for a wide variety of use cases. This also implies you should be able to optimize things further for a specific use case if needed.
zarr_root
property
Get the zarr Group object corresponding to the root.
Opens the zarr archive in read-write mode if it is not already open.
Different to zarr_data
The zarr_data
attribute references either to the Zarr archive or to a in-memory copy of the data.
See also Dataset.load_to_memory
.
dtypes
abstractmethod
property
Return the dtype for each of the columns for the dataset
load_zarr_root_from_hub
abstractmethod
Loads a Zarr archive from the Polaris Hub.
load_zarr_root_from_local
Loads a locally stored Zarr archive.
We use memory mapping by default because our experiments show that it's consistently faster
load_to_memory
Load data from zarr files to memeory
Make sure the uncompressed dataset fits in-memory.
This method will load the uncompressed dataset into memory. Make sure you actually have enough memory to store the dataset.
get_data
abstractmethod
Since the dataset might contain pointers to external files, data retrieval is more complicated
than just indexing the table
attribute. This method provides an end-point for seamlessly
accessing the underlying data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
row
|
str | int
|
The row index in the |
required |
col
|
str
|
The column index in the |
required |
adapters
|
dict[str, Adapter] | None
|
The adapters to apply to the data before returning it. If None, will use the default adapters specified for the dataset. |
None
|
Returns:
Type | Description |
---|---|
ndarray | Any
|
A numpy array with the data at the specified indices. If the column is a pointer column, the content of the referenced file is loaded to memory. |
upload_to_hub
abstractmethod
Uploads the dataset to the Polaris Hub.
from_json
abstractmethod
classmethod
Loads a dataset from a JSON file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
The path to the JSON file to load the dataset from. |
required |
to_json
abstractmethod
Save the dataset to a destination directory as a JSON file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
destination
|
str | Path
|
The directory to save the associated data to. |
required |
if_exists
|
ZarrConflictResolution
|
Action for handling existing files in the Zarr archive. Options are 'raise' to throw an error, 'replace' to overwrite, or 'skip' to proceed without altering the existing files. |
'replace'
|
Returns:
Type | Description |
---|---|
str
|
The path to the JSON file. |
polaris.dataset.ColumnAnnotation
Bases: BaseModel
The ColumnAnnotation
class is used to annotate the columns of the Dataset
object.
This mostly just stores meta-data and does not affect the logic. The exception is the is_pointer
attribute.
Attributes:
Name | Type | Description |
---|---|---|
is_pointer |
bool
|
Annotates whether a column is a pointer column. If so, it does not contain data, but rather contains references to blobs of data from which the data is loaded. |
modality |
Modality
|
The data modality describes the data type and is used to categorize datasets on the hub and while it does not affect logic in this library, it does affect the logic of the hub. |
description |
str | None
|
Describes how the data was generated. |
user_attributes |
dict[str, str]
|
Any additional meta-data can be stored in the user attributes. |
content_type |
KnownContentType | str | None
|
Specify column's IANA content type. If the the content type matches with a known type for molecules (e.g. "chemical/x-smiles"), visualization for its content will be activated on the Hub side |
polaris.dataset.zarr
ZarrFileChecksum
Bases: BaseModel
This data is sent to the Hub to verify the integrity of the Zarr archive on upload.
Attributes:
Name | Type | Description |
---|---|---|
path |
str
|
The path of the file relative to the Zarr root. |
md5sum |
str
|
The md5sum of the file. |
size |
int
|
The size of the file in bytes. |
MemoryMappedDirectoryStore
Bases: DirectoryStore
A Zarr Store to open chunks as memory-mapped files. See also this Github issue.
Memory mapping leverages low-level OS functionality to reduce the time it takes to read the content of a file by directly mapping to memory.
compute_zarr_checksum
Implements an algorithm to compute the Zarr checksum.
This checksum is sensitive to Zarr configuration.
This checksum is sensitive to change in the Zarr structure. For example, if you change the chunk size, the checksum will also change.
To understand how this works, consider the following directory structure:
. (root)
/ \
a c
/
b
Within zarr, this would for example be:
root
: A Zarr Group with a single Array.a
: A Zarr Arrayb
: A single chunk of the Zarr Arrayc
: A metadata file (i.e. .zarray, .zattrs or .zgroup)
To compute the checksum, we first find all the trees in the node, in this case b and c. We compute the hash of the content (the raw bytes) for each of these files.
We then work our way up the tree. For any node (directory), we find all children of that node. In an sorted order, we then serialize a list with - for each of the children - the checksum, size, and number of children. The hash of the directory is then equal to the hash of the serialized JSON.
The Polaris implementation is heavily based on the zarr-checksum
package.
This method is the biggest deviation of the original code.
generate_zarr_manifest
Entry point function which triggers the creation of a Zarr manifest for a V2 dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
zarr_root_path
|
str
|
The path to the root of a Zarr archive |
required |
output_dir
|
str
|
The path to the directory which will hold the generated manifest |
required |