Converters

polaris.dataset.converters.Converter

Bases: ABC

convert `abstractmethod`

convert(path: str, append: bool = False) -> FactoryProduct

This converts a file into a table and possibly annotations

get_pointer `staticmethod`

get_pointer(column: str, index: int | slice) -> str

Creates a pointer.

Parameters:

Name	Type	Description	Default
`column`	`str`	The name of the column. Each column has its own group in the root.	required
`index`	`int \| slice`	The index or slice of the pointer.	required

polaris.dataset.converters.SDFConverter

Bases: Converter

Converts a SDF file into a Polaris dataset.

Binary strings for serialization

This class converts the molecules to binary strings (for ML purposes, this should be lossless). This might not be the most storage efficient, but is fastest and easiest to maintain. See this Github Discussion for more info.

Properties defined on the molecule level in the SDF file can be extracted into separate columns or can be kept in the molecule object.

Parameters:

Name	Type	Description	Default
`mol_column`	`str`	The name of the column that will contain the pointers to the molecules.	`'molecule'`
`smiles_column`	`Optional[str]`	The name of the column that will contain the SMILES strings.	`'smiles'`
`use_isomeric_smiles`	`bool`	Whether to use isomeric SMILES.	`True`
`mol_id_column`	`Optional[str]`	The name of the column that will contain the molecule names.	`None`
`mol_prop_as_cols`	`bool`	Whether to extract properties defined on the molecule level in the SDF file into separate columns.	`True`
`groupby_key`	`Optional[str]`	The name of the column to group by. If set, the dataset can combine multiple pointers to the molecules into a single datapoint.	`None`

polaris.dataset.converters.ZarrConverter

Bases: Converter

Parse a .zarr archive into a Polaris Dataset.

Loading from .zarr

Loading and saving datasets from and to .zarr is still experimental and currently not fully supported by the Hub.

A .zarr file can contain groups and arrays, where each group can again contain groups and arrays. Within Polaris, the Zarr archive is expected to have a flat hierarchy where each array corresponds to a single column and each array contains the values for all datapoints in that column.

polaris.dataset.converters.PDBConverter

Bases: Converter

Converts PDB files into a Polaris dataset based on fastpdb.

Only the most essential structural information of a protein is retained

This conversion saves the 3D coordinates, chain ID, residue ID, insertion code, residue name, heteroatom indicator, atom name, element, atom ID, B-factor, occupancy, and charge. Records such as CONECT (connectivity information), ANISOU (anisotropic Temperature Factors), HETATM (heteroatoms and ligands) are handled by fastpdb. We believe this makes for a good ML-ready format, but let us know if you require any other information to be saved.

PDBs as ND-arrays using biotite

To save PDBs in a Polaris-compatible format, we convert them to ND-arrays using fastpdb and biotite. We then save these ND-arrays to Zarr archives. For more info, see fastpdb and biotite

Parameters:

Name	Type	Description	Default
`pdb_column`	`str`	The name of the column that will contain the pointers to the pdbs.	`'pdb'`
`n_jobs`	`int`	The number of jobs to run in parallel.	`1`
`zarr_chunks`	`Sequence[Optional[int]]`	The chunk size for the Zarr arrays.	`(1,)`

convert

convert(path, factory: DatasetFactory, append: bool = False) -> FactoryProduct

Convert one or a list of PDB files into Zarr

Converters

polaris.dataset.converters.Converter

convert abstractmethod

get_pointer staticmethod

polaris.dataset.converters.SDFConverter

polaris.dataset.converters.ZarrConverter

polaris.dataset.converters.PDBConverter

convert

convert `abstractmethod`

get_pointer `staticmethod`