Skip to content

Dataset Factory

polaris.dataset.DatasetFactory

The DatasetFactory makes it easier to create complex datasets.

It is based on the the factory design pattern and allows a user to specify specific handlers (i.e. Converter objects) for different file types. These converters are used to convert commonly used file types in drug discovery to something that can be used within Polaris while losing as little information as possible.

In addition, it contains utility method to incrementally build out a dataset from different sources.

Try quickly converting one of your datasets

The DatasetFactory is designed to give you full control. If your dataset is saved in a single file and you don't need anything fancy, you can try use create_dataset_from_file instead.

from polaris.dataset import create_dataset_from_file
dataset = create_dataset_from_file("path/to/my_dataset.sdf")
How to make adding meta-data easier?

The DatasetFactory is designed to more easily pull together data from different sources. However, adding meta-data remains a laborous process. How could we make this simpler through the Python API?

zarr_root_path property

zarr_root_path: Group

The root of the zarr archive for the Dataset that is being built. All data for a single dataset is expected to be stored in the same Zarr archive.

zarr_root property

zarr_root: Group

The root of the zarr archive for the Dataset that is being built. All data for a single dataset is expected to be stored in the same Zarr archive.

register_converter

register_converter(ext: str, converter: Converter)

Registers a new converter for a specific file type.

Parameters:

Name Type Description Default
ext str

The file extension for which the converter should be used. There can only be a single converter per file extension.

required
converter Converter

The handler for the file type. This should convert the file to a Polaris-compatible format.

required

add_column

add_column(column: pd.Series, annotation: Optional[ColumnAnnotation] = None, adapters: Optional[Adapter] = None)

Add a single column to the DataFrame

We require:

  1. The name attribute of the column to be set.
  2. The name attribute of the column to be unique.
  3. If the column is a pointer column, the zarr_root_path needs to be set.
  4. The length of the column to match the length of the alredy constructed table.

Parameters:

Name Type Description Default
column Series

The column to add to the dataset.

required
annotation Optional[ColumnAnnotation]

The annotation for the column. If None, a default annotation will be used.

None

add_columns

add_columns(df: pd.DataFrame, annotations: Optional[Dict[str, ColumnAnnotation]] = None, adapters: Optional[Dict[str, Adapter]] = None, merge_on: Optional[str] = None)

Add multiple columns to the dataset based on another dataframe.

To have more control over how the two dataframes are combined, you can specify a column to merge on. This will always do an outer join.

If not specifying a key to merge on, the columns will simply be added to the dataset that has been built so far without any reordering. They are therefore expected to meet all the same expectations as for add_column.

Parameters:

Name Type Description Default
df DataFrame

A Pandas DataFrame with the columns that we want to add to the dataset.

required
annotations Optional[Dict[str, ColumnAnnotation]]

The annotations for the columns. If None, default annotations will be used.

None
merge_on Optional[str]

The column to merge on, if any.

None

add_from_file

add_from_file(path: str)

Uses the registered converters to parse the data from a specific file and add it to the dataset. If no converter is found for the file extension, it raises an error.

Parameters:

Name Type Description Default
path str

The path to the file that should be parsed.

required

build

build() -> Dataset

Returns a Dataset based on the current state of the factory.

reset

reset(zarr_root_path: Optional[str] = None)

Resets the factory to its initial state to start building the next dataset from scratch. Note that this will not reset the registered converters.

Parameters:

Name Type Description Default
zarr_root_path Optional[str]

The root path of the zarr hierarchy. If you want to use pointer columns for your next dataset, this arguments needs to be passed.

None

polaris.dataset.create_dataset_from_file

create_dataset_from_file(path: str, zarr_root_path: Optional[str] = None) -> Dataset

This function is a convenience function to create a dataset from a file.

It sets up the dataset factory with sensible defaults for the converters. For creating more complicated datasets, please use the DatasetFactory directly.