Data Converters
polaris.dataset.converters.Converter
Bases: ABC
convert
abstractmethod
This converts a file into a table and possibly annotations
polaris.dataset.converters.SDFConverter
Bases: Converter
Converts a SDF file into a Polaris dataset.
Binary strings for serialization
This class converts the molecules to binary strings (for ML purposes, this should be lossless). This might not be the most storage efficient, but is fastest and easiest to maintain. See this Github Discussion for more info.
Properties defined on the molecule level in the SDF file can be extracted into separate columns or can be kept in the molecule object.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mol_column
|
str
|
The name of the column that will contain the pointers to the molecules. |
'molecule'
|
smiles_column
|
Optional[str]
|
The name of the column that will contain the SMILES strings. |
'smiles'
|
use_isomeric_smiles
|
bool
|
Whether to use isomeric SMILES. |
True
|
mol_id_column
|
Optional[str]
|
The name of the column that will contain the molecule names. |
None
|
mol_prop_as_cols
|
bool
|
Whether to extract properties defined on the molecule level in the SDF file into separate columns. |
True
|
groupby_key
|
Optional[str]
|
The name of the column to group by. If set, the dataset can combine multiple pointers to the molecules into a single datapoint. |
None
|
polaris.dataset.converters.ZarrConverter
Bases: Converter
Parse a .zarr archive into a Polaris Dataset
.
Tutorial
To learn more about the zarr format, see the tutorial.
Loading from .zarr
Loading and saving datasets from and to .zarr
is still experimental and currently not
fully supported by the Hub.
A .zarr
file can contain groups and arrays, where each group can again contain groups and arrays.
Within Polaris, the Zarr archive is expected to have a flat hierarchy where each array corresponds
to a single column and each array contains the values for all datapoints in that column.
polaris.dataset.converters.PDBConverter
Bases: Converter
Converts PDB files into a Polaris dataset based on fastpdb.
Only the most essential structural information of a protein is retained
This conversion saves the 3D coordinates, chain ID, residue ID, insertion code, residue name, heteroatom indicator, atom name, element, atom ID, B-factor, occupancy, and charge.
Records such as CONECT (connectivity information), ANISOU (anisotropic Temperature Factors), HETATM (heteroatoms and ligands) are handled by fastpdb
.
We believe this makes for a good ML-ready format, but let us know if you require any other information to be saved.
PDBs as ND-arrays using biotite
To save PDBs in a Polaris-compatible format, we convert them to ND-arrays using fastpdb
and biotite
.
We then save these ND-arrays to Zarr archives.
For more info, see fastpdb
and biotite
Parameters:
Name | Type | Description | Default |
---|---|---|---|
pdb_column
|
str
|
The name of the column that will contain the pointers to the pdbs. |
'pdb'
|
n_jobs
|
int
|
The number of jobs to run in parallel. |
1
|
zarr_chunks
|
Sequence[Optional[int]]
|
The chunk size for the Zarr arrays. |
(1)
|