Skip to content

Curation

detect_streoisomer_activity_cliff

detect_streoisomer_activity_cliff(dataset: pd.DataFrame, stereoisomer_id_col: str, y_cols: List[str], threshold: float = 2.0, prefix: str = 'AC_') -> pd.DataFrame

Detect activity cliff among stereoisomers based on classification label or pre-defined threshold for continuous values.

Parameters:

Name Type Description Default
dataset DataFrame

Dataframe

required
stereoisomer_id_col str

Column which identifies the stereoisomers

required
y_cols List[str]

List of columns for bioactivities

required
threshold float

Threshold to identify the activity cliff. Currently, the difference of zscores between isomers are used for identification.

2.0
prefix str

Prefix for the adding columns

'AC_'

deduplicate

deduplicate(dataset: pd.DataFrame, deduplicate_on: Optional[Union[str, List[str]]] = None, y_cols: Optional[Union[str, List[str]]] = None, keep: Literal['first', 'last'] = 'first', method: Literal['mean', 'median'] = 'median') -> pd.DataFrame

Deduplicate a dataframe.

If deduplicate_on specifies a subset of all columns in the dataset and y_cols specifies a set of non-overlapping columns, data will be grouped by deduplicate_on and the y_cols will be aggregated to a single value per group according to method.

Parameters:

Name Type Description Default
dataset DataFrame

The dataset to deduplicate.

required
deduplicate_on Optional[Union[str, List[str]]]

A subset of the columns to deduplicate on (can be default).

None
y_cols Optional[Union[str, List[str]]]

The columns to aggregate.

None
keep Literal['first', 'last']

Whether to keep the first or last copy of the duplicates.

'first'
method Literal['mean', 'median']

The method to aggregate the data.

'median'

discretize

discretize(X: np.ndarray, thresholds: Union[np.ndarray, list], inplace: bool = False, allow_nan: bool = True, label_order: Literal['ascending', 'descending'] = 'ascending') -> np.ndarray

Thresholding of array-like or scipy.sparse matrix into binary or multiclass labels.

Parameters:

Name Type Description Default
X

The data to discretize, element by element. scipy.sparse matrices should be in CSR or CSC format to avoid an un-necessary copy.

required
thresholds Union[ndarray, list]

Interval boundaries that include the right bin edge.

required
inplace bool

Set to True to perform inplace discretization and avoid a copy (if the input is already a numpy array or a scipy.sparse CSR / CSC matrix and if axis is 1).

False
allow_nan bool

Set to True to allow nans in the array for discretization. Otherwise, an error will be raised instead.

True
label_order Literal['ascending', 'descending']

The continuous values are discretized to labels 0, 1, 2, .., N with respect to given threshold bins [threshold_1, threshold_2,.., threshould_n]. When set to 'ascending', the class label is in ascending order with the threshold bins that 0 represents negative class or lower class, while 1, 2, 3 are for higher classes. When set to 'descending' the class label is in ascending order with the threshold bins. Sometimes the positive labels are on the left side of provided threshold. E.g. For binarization with threshold [0.5], the positive label is defined byX < 0.5. In this case, label_order should be descending.

'ascending'

Returns:

Name Type Description
X_tr ndarray

The transformed data.

curate_molecules

curate_molecules(mols: List[Union[str, dm.Mol]], progress: bool = True, remove_stereo: bool = False, fix_mol: bool = True, count_stereoisomers: bool = True, count_stereocenters: bool = True, **parallelized_kwargs) -> Tuple

Curate a list of molecules.

Parameters:

Name Type Description Default
mols List[Union[str, Mol]]

List of molecules.

required
progress bool

Whether show curation progress.

True
fix_mol bool

Whether fix the error in molecule.

True
remove_stereo bool

Whether remove stereo chemistry information from molecule.

False
count_stereoisomers bool

Whether count the number of stereoisomers of molecule.

True
count_stereocenters bool

Whether count the number of stereocenters of molecule.

True

Returns:

Name Type Description
mol_dict Tuple

Dictionary of molecule and additional metadata

num_invalid Tuple

Number of invßßalid molecules

detect_outliers

detect_outliers(X: np.ndarray, method: OutlierDetectionMethod = 'zscore', **kwargs: Any)

Functional interface for detecting outliers

Parameters:

Name Type Description Default
X ndarray

The observations that we want to classify as inliers or outliers.

required
method OutlierDetectionMethod

The method to use for outlier detection.

'zscore'
**kwargs Any

Keyword arguments for the outlier detection method.

{}