Curation
detect_streoisomer_activity_cliff
detect_streoisomer_activity_cliff(dataset: pd.DataFrame, stereoisomer_id_col: str, y_cols: List[str], threshold: float = 2.0, prefix: str = 'AC_') -> pd.DataFrame
Detect activity cliff among stereoisomers based on classification label or pre-defined threshold for continuous values.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| dataset | DataFrame | Dataframe | required | 
| stereoisomer_id_col | str | Column which identifies the stereoisomers | required | 
| y_cols | List[str] | List of columns for bioactivities | required | 
| threshold | float | Threshold to identify the activity cliff. Currently, the difference of zscores between isomers are used for identification. | 2.0 | 
| prefix | str | Prefix for the adding columns | 'AC_' | 
deduplicate
deduplicate(dataset: pd.DataFrame, deduplicate_on: Optional[Union[str, List[str]]] = None, y_cols: Optional[Union[str, List[str]]] = None, keep: Literal['first', 'last'] = 'first', method: Literal['mean', 'median'] = 'median') -> pd.DataFrame
Deduplicate a dataframe.
If deduplicate_on specifies a subset of all columns in the dataset and y_cols specifies a set
of non-overlapping columns, data will be grouped by deduplicate_on and the y_cols will be aggregated
to a single value per group according to method.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| dataset | DataFrame | The dataset to deduplicate. | required | 
| deduplicate_on | Optional[Union[str, List[str]]] | A subset of the columns to deduplicate on (can be default). | None | 
| y_cols | Optional[Union[str, List[str]]] | The columns to aggregate. | None | 
| keep | Literal['first', 'last'] | Whether to keep the first or last copy of the duplicates. | 'first' | 
| method | Literal['mean', 'median'] | The method to aggregate the data. | 'median' | 
discretize
discretize(X: np.ndarray, thresholds: Union[np.ndarray, list], inplace: bool = False, allow_nan: bool = True, label_order: Literal['ascending', 'descending'] = 'ascending') -> np.ndarray
Thresholding of array-like or scipy.sparse matrix into binary or multiclass labels.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| X | The data to discretize, element by element. scipy.sparse matrices should be in CSR or CSC format to avoid an un-necessary copy. | required | |
| thresholds | Union[ndarray, list] | Interval boundaries that include the right bin edge. | required | 
| inplace | bool | Set to True to perform inplace discretization and avoid a copy (if the input is already a numpy array or a scipy.sparse CSR / CSC matrix and if axis is 1). | False | 
| allow_nan | bool | Set to True to allow nans in the array for discretization. Otherwise, an error will be raised instead. | True | 
| label_order | Literal['ascending', 'descending'] | The continuous values are discretized to labels 0, 1, 2, .., N with respect to given
threshold bins [threshold_1, threshold_2,.., threshould_n].
When set to 'ascending', the class label is in ascending order with the threshold
bins that  | 'ascending' | 
Returns:
| Name | Type | Description | 
|---|---|---|
| X_tr | ndarray | The transformed data. | 
curate_molecules
curate_molecules(mols: List[Union[str, dm.Mol]], progress: bool = True, remove_stereo: bool = False, fix_mol: bool = True, count_stereoisomers: bool = True, count_stereocenters: bool = True, **parallelized_kwargs) -> Tuple
Curate a list of molecules.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| mols | List[Union[str, Mol]] | List of molecules. | required | 
| progress | bool | Whether show curation progress. | True | 
| fix_mol | bool | Whether fix the error in molecule. | True | 
| remove_stereo | bool | Whether remove stereo chemistry information from molecule. | False | 
| count_stereoisomers | bool | Whether count the number of stereoisomers of molecule. | True | 
| count_stereocenters | bool | Whether count the number of stereocenters of molecule. | True | 
Returns:
| Name | Type | Description | 
|---|---|---|
| mol_dict | Tuple | Dictionary of molecule and additional metadata | 
| num_invalid | Tuple | Number of invßßalid molecules | 
detect_outliers
Functional interface for detecting outliers
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| X | ndarray | The observations that we want to classify as inliers or outliers. | required | 
| method | OutlierDetectionMethod | The method to use for outlier detection. | 'zscore' | 
| **kwargs | Any | Keyword arguments for the outlier detection method. | {} |