The DatasetFactory makes it easier to create complex datasets.

It is based on the the factory design pattern and allows a user to specify specific handlers (i.e. Converter objects) for different file types. These converters are used to convert commonly used file types in drug discovery to something that can be used within Polaris while losing as little information as possible.

In addition, it contains utility method to incrementally build out a dataset from different sources.

Try quickly converting one of your datasets

The DatasetFactory is designed to give you full control. If your dataset is saved in a single file and you don't need anything fancy, you can try use create_dataset_from_file instead.

from polaris.dataset import create_dataset_from_file
dataset = create_dataset_from_file("path/to/my_dataset.sdf")
How to make adding meta-data easier?

The DatasetFactory is designed to more easily pull together data from different sources. However, adding meta-data remains a laborous process. How could we make this simpler through the Python API?

zarr_root_path property

zarr_root_path: Group

The root of the zarr archive for the Dataset that is being built. All data for a single dataset is expected to be stored in the same Zarr archive.

register_converter(ext: str, converter: Converter)

Registers a new converter for a specific file type.


Name Type Description Default
ext str

The file extension for which the converter should be used. There can only be a single converter per file extension.

converter Converter

The handler for the file type. This should convert the file to a Polaris-compatible format.



add_column(column: pd.Series, annotation: Optional[ColumnAnnotation] = None, adapters: Optional[Adapter] = None)

Add a single column to the DataFrame

We require:

  1. The name attribute of the column to be set.
  2. The name attribute of the column to be unique.
  3. If the column is a pointer column, the zarr_root_path needs to be set.
  4. The length of the column to match the length of the alredy constructed table.


Name Type Description Default
column Series

The column to add to the dataset.

annotation Optional[ColumnAnnotation]

The annotation for the column. If None, a default annotation will be used.



add_columns(df: pd.DataFrame, annotations: Optional[Dict[str, ColumnAnnotation]] = None, adapters: Optional[Dict[str, Adapter]] = None, merge_on: Optional[str] = None)

Add multiple columns to the dataset based on another dataframe.

To have more control over how the two dataframes are combined, you can specify a column to merge on. This will always do an outer join.

If not specifying a key to merge on, the columns will simply be added to the dataset that has been built so far without any reordering. They are therefore expected to meet all the same expectations as for add_column.


Name Type Description Default
df DataFrame

A Pandas DataFrame with the columns that we want to add to the dataset.

annotations Optional[Dict[str, ColumnAnnotation]]

The annotations for the columns. If None, default annotations will be used.

merge_on Optional[str]

The column to merge on, if any.



add_from_file(path: str)

Uses the registered converters to parse the data from a specific file and add it to the dataset. If no converter is found for the file extension, it raises an error.


Name Type Description Default
path str

The path to the file that should be parsed.



add_from_files(paths: List[str], axis: Literal[0, 1, 'index', 'columns'])

Uses the registered converters to parse the data from a specific files and add them to the dataset. If no converter is found for the file extension, it raises an error.


Name Type Description Default
paths List[str]

The list of paths that should be parsed.

axis Literal[0, 1, 'index', 'columns']

Axis along which the files should be added. - 0 or 'index': append the rows with files. Files must be of the same type. - 1 or 'columns': append the columns with files. Files can be of the different types.



build() -> DatasetV1

Returns a Dataset based on the current state of the factory.


reset(zarr_root_path: Optional[str] = None)

Resets the factory to its initial state to start building the next dataset from scratch. Note that this will not reset the registered converters.


Name Type Description Default
zarr_root_path Optional[str]

The root path of the zarr hierarchy. If you want to use pointer columns for your next dataset, this arguments needs to be passed.



create_dataset_from_file(path: str, zarr_root_path: Optional[str] = None) -> DatasetV1

This function is a convenience function to create a dataset from a file.

It sets up the dataset factory with sensible defaults for the converters. For creating more complicated datasets, please use the DatasetFactory directly.


create_dataset_from_files(paths: List[str], zarr_root_path: Optional[str] = None, axis: Literal[0, 1, 'index', 'columns'] = 0) -> DatasetV1

This function is a convenience function to create a dataset from multiple files.

It sets up the dataset factory with sensible defaults for the converters. For creating more complicated datasets, please use the DatasetFactory directly.


Name Type Description Default
axis Literal[0, 1, 'index', 'columns']

Axis along which the files should be added. - 0 or 'index': append the rows with files. Files must be of the same type. - 1 or 'columns': append the columns with files. Files can be of the different types.