Dataset Factory
polaris.dataset.DatasetFactory
The DatasetFactory
makes it easier to create complex datasets.
It is based on the factory design pattern and allows a user to specify specific handlers
(i.e. Converter
objects) for different file types.
These converters are used to convert commonly used file types in drug discovery
to something that can be used within Polaris while losing as little information as possible.
In addition, it contains utility method to incrementally build out a dataset from different sources.
Try quickly converting one of your datasets
The DatasetFactory
is designed to give you full control.
If your dataset is saved in a single file and you don't need anything fancy, you can try use
create_dataset_from_file
instead.
How to make adding meta-data easier?
The DatasetFactory
is designed to more easily pull together data from different sources.
However, adding meta-data remains a laborious process. How could we make this simpler through
the Python API?
zarr_root_path
property
The root of the zarr archive for the Dataset that is being built. All data for a single dataset is expected to be stored in the same Zarr archive.
zarr_root
property
The root of the zarr archive for the Dataset that is being built. All data for a single dataset is expected to be stored in the same Zarr archive.
register_converter
Registers a new converter for a specific file type.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ext
|
str
|
The file extension for which the converter should be used. There can only be a single converter per file extension. |
required |
converter
|
Converter
|
The handler for the file type. This should convert the file to a Polaris-compatible format. |
required |
add_column
add_column(column: pd.Series, annotation: ColumnAnnotation | None = None, adapters: Adapter | None = None)
Add a single column to the DataFrame
We require:
- The name attribute of the column to be set.
- The name attribute of the column to be unique.
- If the column is a pointer column, the
zarr_root_path
needs to be set. - The length of the column to match the length of the already constructed table.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
column
|
Series
|
The column to add to the dataset. |
required |
annotation
|
ColumnAnnotation | None
|
The annotation for the column. If None, a default annotation will be used. |
None
|
add_columns
add_columns(df: pd.DataFrame, annotations: dict[str, ColumnAnnotation] | None = None, adapters: dict[str, Adapter] | None = None, merge_on: str | None = None)
Add multiple columns to the dataset based on another dataframe.
To have more control over how the two dataframes are combined, you can specify a column to merge on. This will always do an outer join.
If not specifying a key to merge on, the columns will simply be added to the dataset
that has been built so far without any reordering. They are therefore expected to meet all
the same expectations as for add_column
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
A Pandas DataFrame with the columns that we want to add to the dataset. |
required |
annotations
|
dict[str, ColumnAnnotation] | None
|
The annotations for the columns. If None, default annotations will be used. |
None
|
merge_on
|
str | None
|
The column to merge on, if any. |
None
|
add_from_file
Uses the registered converters to parse the data from a specific file and add it to the dataset. If no converter is found for the file extension, it raises an error.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
The path to the file that should be parsed. |
required |
add_from_files
Uses the registered converters to parse the data from a specific files and add them to the dataset. If no converter is found for the file extension, it raises an error.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
paths
|
list[str]
|
The list of paths that should be parsed. |
required |
axis
|
Literal[0, 1, 'index', 'columns']
|
Axis along which the files should be added. - 0 or 'index': append the rows with files. Files must be of the same type. - 1 or 'columns': append the columns with files. Files can be of the different types. |
required |
reset
Resets the factory to its initial state to start building the next dataset from scratch. Note that this will not reset the registered converters.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
zarr_root_path
|
str | None
|
The root path of the zarr hierarchy. If you want to use pointer columns for your next dataset, this arguments needs to be passed. |
None
|
polaris.dataset.create_dataset_from_file
This function is a convenience function to create a dataset from a file.
It sets up the dataset factory with sensible defaults for the converters.
For creating more complicated datasets, please use the DatasetFactory
directly.
polaris.dataset.create_dataset_from_files
create_dataset_from_files(paths: list[str], zarr_root_path: str | None = None, axis: Literal[0, 1, 'index', 'columns'] = 0) -> DatasetV1
This function is a convenience function to create a dataset from multiple files.
It sets up the dataset factory with sensible defaults for the converters.
For creating more complicated datasets, please use the DatasetFactory
directly.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
axis
|
Literal[0, 1, 'index', 'columns']
|
Axis along which the files should be added. - 0 or 'index': append the rows with files. Files must be of the same type. - 1 or 'columns': append the columns with files. Files can be of the different types. |
0
|