PDB Datasets
In short
This tutorial shows how to create datasets with PDBs through the .zarr format.
This feature is still very new.
The features we will show in this tutorial are still experimental. We would love to learn from the community how we can make it easier to create datasets.
Dummy PDB example¶
In [1]:
Copied!
import platformdirs
import datamol as dm
from polaris.dataset import DatasetFactory
from polaris.dataset.converters import PDBConverter
SAVE_DIR = dm.fs.join(platformdirs.user_cache_dir(appname="polaris-tutorials"), "dataset_pdb")
import platformdirs
import datamol as dm
from polaris.dataset import DatasetFactory
from polaris.dataset.converters import PDBConverter
SAVE_DIR = dm.fs.join(platformdirs.user_cache_dir(appname="polaris-tutorials"), "dataset_pdb")
Fetch PDB files from RCSB PDB¶
In [ ]:
Copied!
import biotite.database.rcsb as rcsb
pdb_path = rcsb.fetch("6s89", "pdb", SAVE_DIR)
print(pdb_path)
import biotite.database.rcsb as rcsb
pdb_path = rcsb.fetch("6s89", "pdb", SAVE_DIR)
print(pdb_path)
Create dataset from PDB file¶
In [14]:
Copied!
save_dst = dm.fs.join(SAVE_DIR, "tutorial_pdb.zarr")
factory = DatasetFactory(zarr_root_path=save_dst)
factory.reset(save_dst)
factory.register_converter("pdb", PDBConverter(pdb_column="pdb"))
factory.add_from_file(pdb_path)
# Build the dataset
dataset = factory.build()
save_dst = dm.fs.join(SAVE_DIR, "tutorial_pdb.zarr")
factory = DatasetFactory(zarr_root_path=save_dst)
factory.reset(save_dst)
factory.register_converter("pdb", PDBConverter(pdb_column="pdb"))
factory.add_from_file(pdb_path)
# Build the dataset
dataset = factory.build()
Check the dataset¶
In [15]:
Copied!
dataset
dataset
Out[15]:
name | None | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
description | |||||||||||||
tags | |||||||||||||
user_attributes | |||||||||||||
owner | None | ||||||||||||
polaris_version | 0.7.10.dev22+g8edf177.d20240814 | ||||||||||||
default_adapters |
| ||||||||||||
zarr_root_path | /Users/lu.zhu/Library/Caches/polaris-tutorials/002/tutorial_pdb.zarr | ||||||||||||
readme | |||||||||||||
annotations |
| ||||||||||||
source | None | ||||||||||||
license | None | ||||||||||||
curation_reference | None | ||||||||||||
cache_dir | /Users/lu.zhu/Library/Caches/polaris/datasets/b0895f92-5a11-4e48-953f-3f969c6a9ca6 | ||||||||||||
md5sum | 66f3c7774e655bc6d48c907100d6912f | ||||||||||||
artifact_id | None | ||||||||||||
n_rows | 1 | ||||||||||||
n_columns | 1 |
Check data table¶
In [16]:
Copied!
dataset.table
dataset.table
Out[16]:
pdb | |
---|---|
0 | pdb/6s89 |
In [ ]:
Copied!
dataset.get_data(0, "pdb")
dataset.get_data(0, "pdb")
Create dataset from multiple PDB files¶
In [7]:
Copied!
pdb_paths = rcsb.fetch(["1l2y", "4i23"], "pdb", SAVE_DIR)
print(pdb_paths)
pdb_paths = rcsb.fetch(["1l2y", "4i23"], "pdb", SAVE_DIR)
print(pdb_paths)
['/Users/lu.zhu/Library/Caches/polaris-tutorials/002/1l2y.pdb', '/Users/lu.zhu/Library/Caches/polaris-tutorials/002/4i23.pdb']
In [8]:
Copied!
factory = DatasetFactory(SAVE_DIR.join("pdbs.zarr"))
converter = PDBConverter()
factory.register_converter("pdb", converter)
factory.add_from_files(pdb_paths, axis=0)
dataset = factory.build()
factory = DatasetFactory(SAVE_DIR.join("pdbs.zarr"))
converter = PDBConverter()
factory.register_converter("pdb", converter)
factory.add_from_files(pdb_paths, axis=0)
dataset = factory.build()
In [9]:
Copied!
dataset.table
dataset.table
Out[9]:
pdb | |
---|---|
0 | pdb/1l2y |
1 | pdb/4i23 |
In [ ]:
Copied!
dataset.get_data(1, "pdb")
dataset.get_data(1, "pdb")
The process of completing the dataset's metadata and uploading it to the hub follows the same steps as outlined in the tutorial dataset_zarr.ipynb
The End.