Skip to content

Data Converters


Bases: ABC

convert abstractmethod

convert(path: str, append: bool = False) -> FactoryProduct

This converts a file into a table and possibly annotations

get_pointer staticmethod

get_pointer(column: str, index: Union[int, slice]) -> str

Creates a pointer.


Name Type Description Default
column str

The name of the column. Each column has its own group in the root.

index Union[int, slice]

The index or slice of the pointer.



Bases: Converter

Converts a SDF file into a Polaris dataset.

Binary strings for serialization

This class converts the molecules to binary strings (for ML purposes, this should be lossless). This might not be the most storage efficient, but is fastest and easiest to maintain. See this Github Discussion for more info.

Properties defined on the molecule level in the SDF file can be extracted into separate columns or can be kept in the molecule object.


Name Type Description Default
mol_column str

The name of the column that will contain the pointers to the molecules.

smiles_column Optional[str]

The name of the column that will contain the SMILES strings.

use_isomeric_smiles bool

Whether to use isomeric SMILES.

mol_id_column Optional[str]

The name of the column that will contain the molecule names.

mol_prop_as_cols bool

Whether to extract properties defined on the molecule level in the SDF file into separate columns.

groupby_key Optional[str]

The name of the column to group by. If set, the dataset can combine multiple pointers to the molecules into a single datapoint.



Bases: Converter

Parse a .zarr archive into a Polaris Dataset.


To learn more about the zarr format, see the tutorial.

Loading from .zarr

Loading and saving datasets from and to .zarr is still experimental and currently not fully supported by the Hub.

A .zarr file can contain groups and arrays, where each group can again contain groups and arrays. Within Polaris, the Zarr archive is expected to have a flat hierarchy where each array corresponds to a single column and each array contains the values for all datapoints in that column.


Bases: Converter

Converts PDB files into a Polaris dataset based on fastpdb.

Only the most essential structural information of a protein is retained

This conversion saves the 3D coordinates, chain ID, residue ID, insertion code, residue name, heteroatom indicator, atom name, element, atom ID, B-factor, occupancy, and charge. Records such as CONECT (connectivity information), ANISOU (anisotropic Temperature Factors), HETATM (heteroatoms and ligands) are handled by fastpdb. We believe this makes for a good ML-ready format, but let us know if you require any other information to be saved.

PDBs as ND-arrays using biotite

To save PDBs in a Polaris-compatible format, we convert them to ND-arrays using fastpdb and biotite. We then save these ND-arrays to Zarr archives. For more info, see fastpdb and biotite


Name Type Description Default
pdb_column str

The name of the column that will contain the pointers to the pdbs.

n_jobs int

The number of jobs to run in parallel.

zarr_chunks Sequence[Optional[int]]

The chunk size for the Zarr arrays.



convert(path, factory: DatasetFactory, append: bool = False) -> FactoryProduct

Convert one or a list of PDB files into Zarr