Optimization

In short

This tutorial shows how to optimize a Polaris dataset to improve its efficiency.

No magic bullet

What works best really depends on the specific dataset you're using and you will benefit from trying out different ways of storing the data.

Datasets that fit in memory¶

Through the Polaris Subset class, we aim to provide a general purpose data loader that serves as a good default for a variety of use cases.

As a dataset creator, it is important to be mindful of some design decisions you can make to improve performance for your downstream users. These design decisions are most impactful!

As a dataset user, we provide the Dataset.load_to_memory() method to load the uncompressed dataset into memory. This is limited though, because there is only so much we can do automatically without risking data integrity.

Despite our best efforts to provide a data loader that is as efficient as possible, you will always be able to optimize things further for a specific use case if needed.

Without Zarr¶

Without pointer columns, the best way to optimize your dataset's performance is by making sure you use the appropriate dtype. A smaller memory footprint not only reduces storage requirements, but also speeds up moving data around (e.g. to the GPU or to create torch.Tensor objects).

In [2]:

Copied!

import numpy as np
import pandas as pd
import numpy as np
import pandas as pd

In [3]:

Copied!





# Let's create a dummy dataset with two columns
rng = np.random.default_rng(0)
col_a = rng.choice(list(range(100)), 10000)
col_b = rng.random(10000)
table = pd.DataFrame({"A": col_a, "B": col_b})
# Let's create a dummy dataset with two columns
rng = np.random.default_rng(0)
col_a = rng.choice(list(range(100)), 10000)
col_b = rng.random(10000)
table = pd.DataFrame({"A": col_a, "B": col_b})

By default, Pandas (and NumPy) use the largest dtype available.

In [4]:

Copied!

table.dtypes
table.dtypes

Out[4]:

A      int64
B    float64
dtype: object

In [5]:

Copied!

table.memory_usage().sum()
table.memory_usage().sum()

Out[5]:

However, we know that column A only has values between 0 and 99, so we won't need the full int64 dtype. The np.int16 is already more appropriate!

In [6]:

Copied!

table["A"] = table["A"].astype(np.int16)
table.memory_usage().sum()
table["A"] = table["A"].astype(np.int16)
table.memory_usage().sum()

Out[6]:

We managed to reduce the number of bytes by ~60k (or 60KB). That's 37.5% less!

Now imagine we would be talking about gigabyte-sized dataset!

With Zarr¶

If part of the dataset is stored in a Zarr archive - and that Zarr archive fits in memory (remember to optimize the dtype) - the most efficient thing to do is to just convert from Zarr to a NumPy array. Zarr is not built to support this use case specifically and NumPy is optimized for it. For more information, see e.g. this Github issue.

Luckily, you don't have to do this yourself. You can use Polaris its Dataset.load_to_memory().

Let's again start by creating a dummy dataset!

In [7]:

Copied!





import os
import zarr
from tempfile import mkdtemp

tmpdir = mkdtemp()

# For the ones familiar with Zarr, this is not optimized at all.
# If you wouldn't want to convert to NumPy, you would want to
# optimize the chunking / compression.

path = os.path.join(tmpdir, "data.zarr")
root = zarr.open(path, "w")
root.array("A", rng.random(10000))
root.array("B", rng.random(10000));
import os
import zarr
from tempfile import mkdtemp

tmpdir = mkdtemp()

# For the ones familiar with Zarr, this is not optimized at all.
# If you wouldn't want to convert to NumPy, you would want to
# optimize the chunking / compression.

path = os.path.join(tmpdir, "data.zarr")
root = zarr.open(path, "w")
root.array("A", rng.random(10000))
root.array("B", rng.random(10000));

In [8]:

Copied!

from polaris.dataset import create_dataset_from_file

root_path = os.path.join(tmpdir, "data", "data.zarr")
dataset = create_dataset_from_file(path, zarr_root_path=root_path)
from polaris.dataset import create_dataset_from_file

root_path = os.path.join(tmpdir, "data", "data.zarr")
dataset = create_dataset_from_file(path, zarr_root_path=root_path)

In [9]:

Copied!

from polaris.dataset import Subset

subset = Subset(dataset, np.arange(len(dataset)), "A", "B")
from polaris.dataset import Subset

subset = Subset(dataset, np.arange(len(dataset)), "A", "B")

For the sake of this example, we will use PyTorch.

In [10]:

Copied!

from torch.utils.data import DataLoader

dataloader = DataLoader(subset, batch_size=64, shuffle=True)
from torch.utils.data import DataLoader

dataloader = DataLoader(subset, batch_size=64, shuffle=True)

Let's see how fast this is!

In [11]:

Copied!

%%timeit
for batch in dataloader:
    pass
%%timeit
for batch in dataloader:
    pass

1.45 s ± 22 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

That's pretty slow... Let's see if Polaris its optimization helps.

In [12]:

Copied!

dataset.load_to_memory()
dataset.load_to_memory()

In [13]:

Copied!

%%timeit
for batch in dataloader:
    pass
%%timeit
for batch in dataloader:
    pass

99.4 ms ± 2.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

That's a lot faster!

Now all that's left to do, is to clean up the temporary directory.

In [14]:

Copied!

from shutil import rmtree

rmtree(tmpdir)
from shutil import rmtree

rmtree(tmpdir)

Datasets that fit on a local disk¶

For datasets that don't fit in memory, but that can be stored on a local disk, the most impactful design decision is how the dataset is chunked.

Zarr datasets are chunked. When you try to load one piece of data, the entire chunk that data is part of has to be loaded into memory and decompressed. Remember that in ML, data access is typically random, which is a terrible access pattern because you are likely to reload chunks into memory.

Most efficient is thus to chunk the data such that each chunk only contains a single data point.

Benefit: No longer induce a performance penalty due to loading additional data into memory that it might not need.
Downside: You might be able to compress the data more if you can consider similarities across data points while compressing.

A note on rechunking: Within Polaris, you do not have control over how a dataset on the Hub is chunked. In that case, rechunking is needed. This can induce a one-time, but nevertheless big performance penalty (see also the Zarr docs). I don’t expect this to be an issue in the short-term given the size of the dataset we will be working with, but Zarr recommends using the rechunker Python package to improve performance.

Remote Datasets¶

In this case, you really benefit from improving memory storage by trying different compressors.