Zarr Datasets

In short

This tutorial shows how to create datasets with more advanced data-modalities through the .zarr format.

Pointer columns¶

Not all data might fit the tabular format, e.g. images or conformers. In that case, we have pointer columns. Pointer columns do not contain the data itself, but rather store a reference to an external file from which the content can be loaded.

For now, we only support .zarr files as references. To learn more about .zarr, visit their documentation. Their tutorial specifically is a good read to better understand the main features.

Dummy example¶

For the sake of simplicity, let's assume we have just two datapoints. We will use this to demonstrate the idea behind pointer columns.

In [2]:

Copied!





import zarr
import platformdirs

import numpy as np
import datamol as dm
import pandas as pd

SAVE_DIR = dm.fs.join(platformdirs.user_cache_dir(appname="polaris-tutorials"), "dataset_zarr")
import zarr
import platformdirs

import numpy as np
import datamol as dm
import pandas as pd

SAVE_DIR = dm.fs.join(platformdirs.user_cache_dir(appname="polaris-tutorials"), "dataset_zarr")

/mnt/ps/home/CORP/lu.zhu/miniconda3/envs/po_datasets/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

In [3]:

Copied!





# Create two images and save them to a Zarr archive
base_path = dm.fs.join(SAVE_DIR, "data.zarr")
inp_col_name = "images"

images = np.random.random((2, 64, 64, 3))
root = zarr.open(base_path, "w")
root.array(inp_col_name, images)
# Create two images and save them to a Zarr archive
base_path = dm.fs.join(SAVE_DIR, "data.zarr")
inp_col_name = "images"

images = np.random.random((2, 64, 64, 3))
root = zarr.open(base_path, "w")
root.array(inp_col_name, images)

Out[3]:

<zarr.core.Array '/images' (2, 64, 64, 3) float64>

In [4]:

Copied!

# Consolidate the dataset for efficient loading from the cloud bucket
zarr.consolidate_metadata(base_path)
# Consolidate the dataset for efficient loading from the cloud bucket
zarr.consolidate_metadata(base_path)

Out[4]:

<zarr.hierarchy.Group '/'>

In [5]:

Copied!





# For performance reasons, Polaris expects all data related to a column to be saved in a single Zarr array.
# To index a specific element in that array, the pointer path can have a suffix to specify the index.
train_path = f"{inp_col_name}#0"
test_path = f"{inp_col_name}#1"
# For performance reasons, Polaris expects all data related to a column to be saved in a single Zarr array.
# To index a specific element in that array, the pointer path can have a suffix to specify the index.
train_path = f"{inp_col_name}#0"
test_path = f"{inp_col_name}#1"

In [6]:

Copied!





tgt_col_name = "target"

table = pd.DataFrame(
    {
        inp_col_name: [train_path, test_path],  # Instead of the content, we specify paths
        tgt_col_name: np.random.random(2),
    }
)
tgt_col_name = "target"

table = pd.DataFrame(
    {
        inp_col_name: [train_path, test_path],  # Instead of the content, we specify paths
        tgt_col_name: np.random.random(2),
    }
)

In [7]:

Copied!





from polaris.dataset import Dataset, ColumnAnnotation

dataset = Dataset(
    table=table,
    # To indicate that we are dealing with a pointer column here,
    # we need to annotate the column.
    annotations={"images": ColumnAnnotation(is_pointer=True)},
    # We also need to specify the path to the root of the Zarr archive
    zarr_root_path=base_path,
)
from polaris.dataset import Dataset, ColumnAnnotation

dataset = Dataset(
    table=table,
    # To indicate that we are dealing with a pointer column here,
    # we need to annotate the column.
    annotations={"images": ColumnAnnotation(is_pointer=True)},
    # We also need to specify the path to the root of the Zarr archive
    zarr_root_path=base_path,
)

Note how the table does not contain the image data, but rather stores a path relative to the root of the Zarr.

In [8]:

Copied!

dataset.table.loc[0, "images"]
dataset.table.loc[0, "images"]

Out[8]:

'images#0'

To load the data that is being pointed to, you can simply use the Dataset.get_data() utility method.

In [9]:

Copied!

dataset.get_data(col="images", row=0).shape
dataset.get_data(col="images", row=0).shape

Out[9]:

(64, 64, 3)

Creating a benchmark and the associated Subset objects will automatically do so!

In [10]:

Copied!





from polaris.benchmark import SingleTaskBenchmarkSpecification

benchmark = SingleTaskBenchmarkSpecification(
    dataset=dataset,
    input_cols=inp_col_name,
    target_cols=tgt_col_name,
    metrics="mean_absolute_error",
    split=([0], [1]),
)
from polaris.benchmark import SingleTaskBenchmarkSpecification

benchmark = SingleTaskBenchmarkSpecification(
    dataset=dataset,
    input_cols=inp_col_name,
    target_cols=tgt_col_name,
    metrics="mean_absolute_error",
    split=([0], [1]),
)

In [11]:

Copied!

train, test = benchmark.get_train_test_split()

for x, y in train:
    # At this point, the content is loaded from the path specified in the table
    print(x.shape)
train, test = benchmark.get_train_test_split()

for x, y in train:
    # At this point, the content is loaded from the path specified in the table
    print(x.shape)

(64, 64, 3)

Creating datasets from `.zarr` arrays¶

While the above example works, creating the table with all paths from scratch is time-consuming when datasets get large. Instead, you can also automatically parse a Zarr archive into the expected tabular data structure.

A Zarr archive can contain groups and arrays, where each group can again contain groups and arrays. Within Polaris, we expect the root to be a flat hierarchy that contains a single array per column.

A single array for all datapoints¶

Polaris expects a flat zarr hierarchy, with a single array per pointer column:

/
  column_a

Which will get parsed into a table like:

column_a
column_a/array#1
column_a/array#2
...
column_a/array#N

Note

Notice the # suffix in the path, which indicates the index at which the data-point is stored within the big array.

In [12]:

Copied!

# Let's first create some dummy dataset with 1000 64x64 "images"
images = np.random.random((1000, 64, 64, 3))
# Let's first create some dummy dataset with 1000 64x64 "images"
images = np.random.random((1000, 64, 64, 3))

In [13]:

Copied!

path = dm.fs.join(SAVE_DIR, "zarr", "data.zarr")

with zarr.open(path, "w") as root:
    root.array(inp_col_name, images)
path = dm.fs.join(SAVE_DIR, "zarr", "data.zarr")

with zarr.open(path, "w") as root:
    root.array(inp_col_name, images)

To create a dataset from a Zarr archive, we can use the convenience function create_dataset_from_file().

In [14]:

Copied!





from polaris.dataset import create_dataset_from_file

# Because Polaris might restructure the Zarr archive,
# we need to specify a location to save the Zarr file to.
dataset = create_dataset_from_file(path, zarr_root_path=dm.fs.join(SAVE_DIR, "zarr", "processed.zarr"))

# The path refers to the original zarr directory we created in the above code block
dataset.table.iloc[0][inp_col_name]
from polaris.dataset import create_dataset_from_file

# Because Polaris might restructure the Zarr archive,
# we need to specify a location to save the Zarr file to.
dataset = create_dataset_from_file(path, zarr_root_path=dm.fs.join(SAVE_DIR, "zarr", "processed.zarr"))

# The path refers to the original zarr directory we created in the above code block
dataset.table.iloc[0][inp_col_name]

Out[14]:

'images#0'

In [15]:

Copied!

dataset.get_data(col=inp_col_name, row=0).shape
dataset.get_data(col=inp_col_name, row=0).shape

Out[15]:

(64, 64, 3)

Saving the dataset¶

We can still easily save the dataset. All the pointer columns will be automatically updated.

In [16]:

Copied!

savedir = dm.fs.join(SAVE_DIR, "json")
json_path = dataset.to_json(savedir)
savedir = dm.fs.join(SAVE_DIR, "json")
json_path = dataset.to_json(savedir)

2024-07-21 13:11:49.273 | INFO     | polaris._mixins:md5sum:27 - Computing the checksum. This can be slow for large datasets.
Finding all files in the Zarr archive:  60%|██████    | 79/131 [00:00<00:00, 375.21it/s]

Finding all files in the Zarr archive: 100%|██████████| 131/131 [00:00<00:00, 396.17it/s]
2024-07-21 13:11:49.616 | INFO     | polaris.dataset._dataset:to_json:431 - Copying Zarr archive to /mnt/ps/home/CORP/lu.zhu/.cache/polaris-tutorials/002/json/data.zarr. This may take a while.

In [17]:

Copied!

fs = dm.fs.get_mapper(path).fs
fs.ls(SAVE_DIR)
fs = dm.fs.get_mapper(path).fs
fs.ls(SAVE_DIR)

Out[17]:

['/mnt/ps/home/CORP/lu.zhu/.cache/polaris-tutorials/002/json',
 '/mnt/ps/home/CORP/lu.zhu/.cache/polaris-tutorials/002/data.zarr',
 '/mnt/ps/home/CORP/lu.zhu/.cache/polaris-tutorials/002/zarr']

Besides the table.parquet and dataset.yaml, we can now also see a data folder which stores the content for the additional content from the pointer columns.

Load the dataset¶

In [18]:

Copied!

Dataset.from_json(json_path)
Dataset.from_json(json_path)

2024-07-21 13:12:16.485 | INFO     | polaris._mixins:md5sum:27 - Computing the checksum. This can be slow for large datasets.
Finding all files in the Zarr archive:  17%|█▋        | 22/131 [00:00<00:00, 211.62it/s]

Finding all files in the Zarr archive: 100%|██████████| 131/131 [00:00<00:00, 246.81it/s]

Out[18]:

name

None

description

Upload zarr dataset to Hub¶

In [22]:

Copied!





# Define the zarr dataset metadata before uploading
dataset.name = "tutorial_zarr"
dataset.license = "CC-BY-4.0"
dataset.source = "https://github.com/polaris-hub/polaris"
# Define the zarr dataset metadata before uploading
dataset.name = "tutorial_zarr"
dataset.license = "CC-BY-4.0"
dataset.source = "https://github.com/polaris-hub/polaris"

In [23]:

Copied!

dataset.upload_to_hub(owner="polaris")
dataset.upload_to_hub(owner="polaris")

⠙ Uploading dataset...

⠦ Uploading dataset...

2024-07-21 13:19:12.188 | INFO     | polaris.hub.client:upload_dataset:602 - Copying Zarr archive to the Hub. This may take a while.

✅ SUCCESS: Your dataset has been successfully uploaded to the Hub. View it here: https://polarishub.io/datasets/polaris/tutorial_zarr

/mnt/ps/home/CORP/lu.zhu/miniconda3/envs/po_datasets/lib/python3.12/site-packages/yaspin/core.py:228: UserWarning: color, on_color and attrs are not supported when running in jupyter
  self._color = self._set_color(value) if value else value

The End.