Skip to content

Data Converters

polaris.dataset.converters.Converter

Bases: ABC

convert abstractmethod

convert(path: str) -> FactoryProduct

This converts a file into a table and possibly annotations

get_pointer staticmethod

get_pointer(column: str, index: Union[int, slice]) -> str

Creates a pointer.

Parameters:

Name Type Description Default
column str

The name of the column. Each column has its own group in the root.

required
index Union[int, slice]

The index or slice of the pointer.

required

polaris.dataset.converters.SDFConverter

Bases: Converter

Converts a SDF file into a Polaris dataset.

Binary strings for serialization

This class converts the molecules to binary strings (for ML purposes, this should be lossless). This might not be the most storage efficient, but is fastest and easiest to maintain. See this Github Discussion for more info.

Properties defined on the molecule level in the SDF file can be extracted into separate columns or can be kept in the molecule object.

Parameters:

Name Type Description Default
mol_column str

The name of the column that will contain the pointers to the molecules.

'molecule'
smiles_column Optional[str]

The name of the column that will contain the SMILES strings.

'smiles'
use_isomeric_smiles bool

Whether to use isomeric SMILES.

True
mol_id_column Optional[str]

The name of the column that will contain the molecule names.

None
mol_prop_as_cols bool

Whether to extract properties defined on the molecule level in the SDF file into separate columns.

True
groupby_key Optional[str]

The name of the column to group by. If set, the dataset can combine multiple pointers to the molecules into a single datapoint.

None

polaris.dataset.converters.ZarrConverter

Bases: Converter

Parse a .zarr archive into a Polaris Dataset.

Tutorial

To learn more about the zarr format, see the tutorial.

Loading from .zarr

Loading and saving datasets from and to .zarr is still experimental and currently not fully supported by the Hub.

A .zarr file can contain groups and arrays, where each group can again contain groups and arrays. Within Polaris, the Zarr archive is expected to have a flat hierarchy where each array corresponds to a single column and each array contains the values for all datapoints in that column.