Factory

polaris.dataset.DatasetFactory

The DatasetFactory makes it easier to create complex datasets.

It is based on the factory design pattern and allows a user to specify specific handlers (i.e. Converter objects) for different file types. These converters are used to convert commonly used file types in drug discovery to something that can be used within Polaris while losing as little information as possible.

In addition, it contains utility method to incrementally build out a dataset from different sources.

Try quickly converting one of your datasets

The DatasetFactory is designed to give you full control. If your dataset is saved in a single file and you don't need anything fancy, you can try use create_dataset_from_file instead.

from polaris.dataset import create_dataset_from_file
dataset = create_dataset_from_file("path/to/my_dataset.sdf")

How to make adding metadata easier?

The DatasetFactory is designed to more easily pull together data from different sources. However, adding metadata remains a laborious process. How could we make this simpler through the Python API?

zarr_root_path `property`

zarr_root_path: Group

The root of the zarr archive for the Dataset that is being built. All data for a single dataset is expected to be stored in the same Zarr archive.

zarr_root `property`

zarr_root: Group

The root of the zarr archive for the Dataset that is being built. All data for a single dataset is expected to be stored in the same Zarr archive.

register_converter

register_converter(ext: str, converter: Converter)

Registers a new converter for a specific file type.

Parameters:

Name	Type	Description	Default
`ext`	`str`	The file extension for which the converter should be used. There can only be a single converter per file extension.	required
`converter`	`Converter`	The handler for the file type. This should convert the file to a Polaris-compatible format.	required

add_column

add_column(
    column: Series,
    annotation: ColumnAnnotation | None = None,
    adapters: Adapter | None = None,
)

Add a single column to the DataFrame

We require:

The name attribute of the column to be set.
The name attribute of the column to be unique.
If the column is a pointer column, the zarr_root_path needs to be set.
The length of the column to match the length of the already constructed table.

Parameters:

Name	Type	Description	Default
`column`	`Series`	The column to add to the dataset.	required
`annotation`	`ColumnAnnotation \| None`	The annotation for the column. If None, a default annotation will be used.	`None`

add_columns

add_columns(
    df: DataFrame,
    annotations: dict[str, ColumnAnnotation] | None = None,
    adapters: dict[str, Adapter] | None = None,
    merge_on: str | None = None,
)

Add multiple columns to the dataset based on another dataframe.

To have more control over how the two dataframes are combined, you can specify a column to merge on. This will always do an outer join.

If not specifying a key to merge on, the columns will simply be added to the dataset that has been built so far without any reordering. They are therefore expected to meet all the same expectations as for add_column().

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A Pandas DataFrame with the columns that we want to add to the dataset.	required
`annotations`	`dict[str, ColumnAnnotation] \| None`	The annotations for the columns. If None, default annotations will be used.	`None`
`merge_on`	`str \| None`	The column to merge on, if any.	`None`

add_from_file

add_from_file(path: str)

Uses the registered converters to parse the data from a specific file and add it to the dataset. If no converter is found for the file extension, it raises an error.

Parameters:

Name	Type	Description	Default
`path`	`str`	The path to the file that should be parsed.	required

add_from_files

add_from_files(paths: list[str], axis: Literal[0, 1, 'index', 'columns'])

Uses the registered converters to parse the data from a specific files and add them to the dataset. If no converter is found for the file extension, it raises an error.

Parameters:

Name	Type	Description	Default
`paths`	`list[str]`	The list of paths that should be parsed.	required
`axis`	`Literal[0, 1, 'index', 'columns']`	Axis along which the files should be added. - 0 or 'index': append the rows with files. Files must be of the same type. - 1 or 'columns': append the columns with files. Files can be of the different types.	required

build

build() -> DatasetV1

Returns a Dataset based on the current state of the factory.

reset

reset(zarr_root_path: str | None = None)

Resets the factory to its initial state to start building the next dataset from scratch. Note that this will not reset the registered converters.

Parameters:

Name	Type	Description	Default
`zarr_root_path`	`str \| None`	The root path of the zarr hierarchy. If you want to use pointer columns for your next dataset, this arguments needs to be passed.	`None`

polaris.dataset.create_dataset_from_file

create_dataset_from_file(
    path: str, zarr_root_path: str | None = None
) -> DatasetV1

This function is a convenience function to create a dataset from a file.

It sets up the dataset factory with sensible defaults for the converters. For creating more complicated datasets, please use the DatasetFactory directly.

polaris.dataset.create_dataset_from_files

create_dataset_from_files(
    paths: list[str],
    zarr_root_path: str | None = None,
    axis: Literal[0, 1, "index", "columns"] = 0,
) -> DatasetV1

This function is a convenience function to create a dataset from multiple files.

It sets up the dataset factory with sensible defaults for the converters. For creating more complicated datasets, please use the DatasetFactory directly.

Parameters:

Name	Type	Description	Default
`axis`	`Literal[0, 1, 'index', 'columns']`	Axis along which the files should be added. - 0 or 'index': append the rows with files. Files must be of the same type. - 1 or 'columns': append the columns with files. Files can be of the different types.	`0`

Factory

polaris.dataset.DatasetFactory

zarr_root_path property

zarr_root property

register_converter

add_column

add_columns

add_from_file

add_from_files

build

reset

polaris.dataset.create_dataset_from_file

polaris.dataset.create_dataset_from_files

zarr_root_path `property`

zarr_root `property`