Curation

detect_streoisomer_activity_cliff

detect_streoisomer_activity_cliff(dataset: pd.DataFrame, stereoisomer_id_col: str, y_cols: List[str], threshold: float = 2.0, prefix: str = 'AC_') -> pd.DataFrame

Detect activity cliff among stereoisomers based on classification label or pre-defined threshold for continuous values.

Parameters:

Name	Type	Description	Default
`dataset`	`DataFrame`	Dataframe	required
`stereoisomer_id_col`	`str`	Column which identifies the stereoisomers	required
`y_cols`	`List[str]`	List of columns for bioactivities	required
`threshold`	`float`	Threshold to identify the activity cliff. Currently, the difference of zscores between isomers are used for identification.	`2.0`
`prefix`	`str`	Prefix for the adding columns	`'AC_'`

deduplicate

deduplicate(dataset: pd.DataFrame, deduplicate_on: Optional[Union[str, List[str]]] = None, y_cols: Optional[Union[str, List[str]]] = None, keep: Literal['first', 'last'] = 'first', method: Literal['mean', 'median'] = 'median') -> pd.DataFrame

Deduplicate a dataframe.

If deduplicate_on specifies a subset of all columns in the dataset and y_cols specifies a set of non-overlapping columns, data will be grouped by deduplicate_on and the y_cols will be aggregated to a single value per group according to method.

Parameters:

Name	Type	Description	Default
`dataset`	`DataFrame`	The dataset to deduplicate.	required
`deduplicate_on`	`Optional[Union[str, List[str]]]`	A subset of the columns to deduplicate on (can be default).	`None`
`y_cols`	`Optional[Union[str, List[str]]]`	The columns to aggregate.	`None`
`keep`	`Literal['first', 'last']`	Whether to keep the first or last copy of the duplicates.	`'first'`
`method`	`Literal['mean', 'median']`	The method to aggregate the data.	`'median'`

discretize

discretize(X: np.ndarray, thresholds: Union[np.ndarray, list], inplace: bool = False, allow_nan: bool = True, label_order: Literal['ascending', 'descending'] = 'ascending') -> np.ndarray

Thresholding of array-like or scipy.sparse matrix into binary or multiclass labels.

Parameters:

Name	Type	Description	Default
`X`		The data to discretize, element by element. scipy.sparse matrices should be in CSR or CSC format to avoid an un-necessary copy.	required
`thresholds`	`Union[ndarray, list]`	Interval boundaries that include the right bin edge.	required
`inplace`	`bool`	Set to True to perform inplace discretization and avoid a copy (if the input is already a numpy array or a scipy.sparse CSR / CSC matrix and if axis is 1).	`False`
`allow_nan`	`bool`	Set to True to allow nans in the array for discretization. Otherwise, an error will be raised instead.	`True`
`label_order`	`Literal['ascending', 'descending']`	The continuous values are discretized to labels 0, 1, 2, .., N with respect to given threshold bins [threshold_1, threshold_2,.., threshould_n]. When set to 'ascending', the class label is in ascending order with the threshold bins that `0` represents negative class or lower class, while 1, 2, 3 are for higher classes. When set to 'descending' the class label is in ascending order with the threshold bins. Sometimes the positive labels are on the left side of provided threshold. E.g. For binarization with threshold [0.5], the positive label is defined by`X < 0.5`. In this case, `label_order` should be `descending`.	`'ascending'`

Returns:

Name	Type	Description
`X_tr`	`ndarray`	The transformed data.

curate_molecules

curate_molecules(mols: List[Union[str, dm.Mol]], progress: bool = True, remove_stereo: bool = False, fix_mol: bool = True, count_stereoisomers: bool = True, count_stereocenters: bool = True, **parallelized_kwargs) -> Tuple

Curate a list of molecules.

Parameters:

Name	Type	Description	Default
`mols`	`List[Union[str, Mol]]`	List of molecules.	required
`progress`	`bool`	Whether show curation progress.	`True`
`fix_mol`	`bool`	Whether fix the error in molecule.	`True`
`remove_stereo`	`bool`	Whether remove stereo chemistry information from molecule.	`False`
`count_stereoisomers`	`bool`	Whether count the number of stereoisomers of molecule.	`True`
`count_stereocenters`	`bool`	Whether count the number of stereocenters of molecule.	`True`

Returns:

Name	Type	Description
`mol_dict`	`Tuple`	Dictionary of molecule and additional metadata
`num_invalid`	`Tuple`	Number of invßßalid molecules

detect_outliers

detect_outliers(X: np.ndarray, method: OutlierDetectionMethod = 'zscore', **kwargs: Any)

Functional interface for detecting outliers

Parameters:

Name	Type	Description	Default
`X`	`ndarray`	The observations that we want to classify as inliers or outliers.	required
`method`	`OutlierDetectionMethod`	The method to use for outlier detection.	`'zscore'`
`**kwargs`	`Any`	Keyword arguments for the outlier detection method.	`{}`