Zarr Datasets
In short
This tutorial shows how to create datasets with more advanced data-modalities through the .zarr format.
Pointer columns¶
Not all data might fit the tabular format, e.g. images or conformers. In that case, we have pointer columns. Pointer columns do not contain the data itself, but rather store a reference to an external file from which the content can be loaded.
For now, we only support .zarr
files as references. To learn more about .zarr
, visit their documentation. Their tutorial specifically is a good read to better understand the main features.
Dummy example¶
For the sake of simplicity, let's assume we have just two datapoints. We will use this to demonstrate the idea behind pointer columns.
import zarr
import platformdirs
import numpy as np
import datamol as dm
import pandas as pd
SAVE_DIR = dm.fs.join(platformdirs.user_cache_dir(appname="polaris-tutorials"), "dataset_zarr")
/mnt/ps/home/CORP/lu.zhu/miniconda3/envs/po_datasets/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
# Create two images and save them to a Zarr archive
base_path = dm.fs.join(SAVE_DIR, "data.zarr")
inp_col_name = "images"
images = np.random.random((2, 64, 64, 3))
root = zarr.open(base_path, "w")
root.array(inp_col_name, images)
<zarr.core.Array '/images' (2, 64, 64, 3) float64>
# Consolidate the dataset for efficient loading from the cloud bucket
zarr.consolidate_metadata(base_path)
<zarr.hierarchy.Group '/'>
# For performance reasons, Polaris expects all data related to a column to be saved in a single Zarr array.
# To index a specific element in that array, the pointer path can have a suffix to specify the index.
train_path = f"{inp_col_name}#0"
test_path = f"{inp_col_name}#1"
tgt_col_name = "target"
table = pd.DataFrame(
{
inp_col_name: [train_path, test_path], # Instead of the content, we specify paths
tgt_col_name: np.random.random(2),
}
)
from polaris.dataset import Dataset, ColumnAnnotation
dataset = Dataset(
table=table,
# To indicate that we are dealing with a pointer column here,
# we need to annotate the column.
annotations={"images": ColumnAnnotation(is_pointer=True)},
# We also need to specify the path to the root of the Zarr archive
zarr_root_path=base_path,
)
Note how the table does not contain the image data, but rather stores a path relative to the root of the Zarr.
dataset.table.loc[0, "images"]
'images#0'
To load the data that is being pointed to, you can simply use the Dataset.get_data()
utility method.
dataset.get_data(col="images", row=0).shape
(64, 64, 3)
Creating a benchmark and the associated Subset
objects will automatically do so!
from polaris.benchmark import SingleTaskBenchmarkSpecification
benchmark = SingleTaskBenchmarkSpecification(
dataset=dataset,
input_cols=inp_col_name,
target_cols=tgt_col_name,
metrics="mean_absolute_error",
split=([0], [1]),
)
train, test = benchmark.get_train_test_split()
for x, y in train:
# At this point, the content is loaded from the path specified in the table
print(x.shape)
(64, 64, 3)
Creating datasets from .zarr
arrays¶
While the above example works, creating the table with all paths from scratch is time-consuming when datasets get large. Instead, you can also automatically parse a Zarr archive into the expected tabular data structure.
A Zarr archive can contain groups and arrays, where each group can again contain groups and arrays. Within Polaris, we expect the root to be a flat hierarchy that contains a single array per column.
A single array for all datapoints¶
Polaris expects a flat zarr hierarchy, with a single array per pointer column:
/
column_a
Which will get parsed into a table like:
column_a |
---|
column_a/array#1 |
column_a/array#2 |
... |
column_a/array#N |
Note
Notice the # suffix in the path, which indicates the index at which the data-point is stored within the big array.
# Let's first create some dummy dataset with 1000 64x64 "images"
images = np.random.random((1000, 64, 64, 3))
path = dm.fs.join(SAVE_DIR, "zarr", "data.zarr")
with zarr.open(path, "w") as root:
root.array(inp_col_name, images)
To create a dataset from a Zarr archive, we can use the convenience function create_dataset_from_file()
.
from polaris.dataset import create_dataset_from_file
# Because Polaris might restructure the Zarr archive,
# we need to specify a location to save the Zarr file to.
dataset = create_dataset_from_file(path, zarr_root_path=dm.fs.join(SAVE_DIR, "zarr", "processed.zarr"))
# The path refers to the original zarr directory we created in the above code block
dataset.table.iloc[0][inp_col_name]
'images#0'
dataset.get_data(col=inp_col_name, row=0).shape
(64, 64, 3)
Saving the dataset¶
We can still easily save the dataset. All the pointer columns will be automatically updated.
savedir = dm.fs.join(SAVE_DIR, "json")
json_path = dataset.to_json(savedir)
2024-07-21 13:11:49.273 | INFO | polaris._mixins:md5sum:27 - Computing the checksum. This can be slow for large datasets. Finding all files in the Zarr archive: 60%|██████ | 79/131 [00:00<00:00, 375.21it/s]
Finding all files in the Zarr archive: 100%|██████████| 131/131 [00:00<00:00, 396.17it/s] 2024-07-21 13:11:49.616 | INFO | polaris.dataset._dataset:to_json:431 - Copying Zarr archive to /mnt/ps/home/CORP/lu.zhu/.cache/polaris-tutorials/002/json/data.zarr. This may take a while.
fs = dm.fs.get_mapper(path).fs
fs.ls(SAVE_DIR)
['/mnt/ps/home/CORP/lu.zhu/.cache/polaris-tutorials/002/json', '/mnt/ps/home/CORP/lu.zhu/.cache/polaris-tutorials/002/data.zarr', '/mnt/ps/home/CORP/lu.zhu/.cache/polaris-tutorials/002/zarr']
Besides the table.parquet
and dataset.yaml
, we can now also see a data
folder which stores the content for the additional content from the pointer columns.
Load the dataset¶
Dataset.from_json(json_path)
2024-07-21 13:12:16.485 | INFO | polaris._mixins:md5sum:27 - Computing the checksum. This can be slow for large datasets. Finding all files in the Zarr archive: 17%|█▋ | 22/131 [00:00<00:00, 211.62it/s]
Finding all files in the Zarr archive: 100%|██████████| 131/131 [00:00<00:00, 246.81it/s]
name | None | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
description | |||||||||||||
tags | |||||||||||||
user_attributes | |||||||||||||
owner | None | ||||||||||||
polaris_version | dev | ||||||||||||
default_adapters | |||||||||||||
zarr_root_path | /mnt/ps/home/CORP/lu.zhu/.cache/polaris-tutorials/002/zarr/processed.zarr | ||||||||||||
readme | |||||||||||||
annotations |
| ||||||||||||
source | None | ||||||||||||
license | None | ||||||||||||
curation_reference | None | ||||||||||||
cache_dir | /mnt/ps/home/CORP/lu.zhu/.cache/polaris/datasets/97d642a2-001c-40aa-ac98-0e24353005d2 | ||||||||||||
md5sum | b7c52acfbda1f9bba47ae218e9c4717f | ||||||||||||
artifact_id | None | ||||||||||||
n_rows | 1000 | ||||||||||||
n_columns | 1 |
Upload zarr dataset to Hub¶
# Define the zarr dataset metadata before uploading
dataset.name = "tutorial_zarr"
dataset.license = "CC-BY-4.0"
dataset.source = "https://github.com/polaris-hub/polaris"
dataset.upload_to_hub(owner="polaris")
⠙ Uploading dataset...
⠦ Uploading dataset...
2024-07-21 13:19:12.188 | INFO | polaris.hub.client:upload_dataset:602 - Copying Zarr archive to the Hub. This may take a while.
✅ SUCCESS: Your dataset has been successfully uploaded to the Hub. View it here: https://polarishub.io/datasets/polaris/tutorial_zarr
/mnt/ps/home/CORP/lu.zhu/miniconda3/envs/po_datasets/lib/python3.12/site-packages/yaspin/core.py:228: UserWarning: color, on_color and attrs are not supported when running in jupyter self._color = self._set_color(value) if value else value
The End.