Curation
detect_streoisomer_activity_cliff
detect_streoisomer_activity_cliff(dataset: pd.DataFrame, stereoisomer_id_col: str, y_cols: List[str], threshold: float = 2.0, prefix: str = 'AC_') -> pd.DataFrame
Detect activity cliff among stereoisomers based on classification label or pre-defined threshold for continuous values.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset |
DataFrame
|
Dataframe |
required |
stereoisomer_id_col |
str
|
Column which identifies the stereoisomers |
required |
y_cols |
List[str]
|
List of columns for bioactivities |
required |
threshold |
float
|
Threshold to identify the activity cliff. Currently, the difference of zscores between isomers are used for identification. |
2.0
|
prefix |
str
|
Prefix for the adding columns |
'AC_'
|
deduplicate
deduplicate(dataset: pd.DataFrame, deduplicate_on: Optional[Union[str, List[str]]] = None, y_cols: Optional[Union[str, List[str]]] = None, keep: Literal['first', 'last'] = 'first', method: Literal['mean', 'median'] = 'median') -> pd.DataFrame
Deduplicate a dataframe.
If deduplicate_on
specifies a subset of all columns in the dataset and y_cols
specifies a set
of non-overlapping columns, data will be grouped by deduplicate_on
and the y_cols
will be aggregated
to a single value per group according to method
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset |
DataFrame
|
The dataset to deduplicate. |
required |
deduplicate_on |
Optional[Union[str, List[str]]]
|
A subset of the columns to deduplicate on (can be default). |
None
|
y_cols |
Optional[Union[str, List[str]]]
|
The columns to aggregate. |
None
|
keep |
Literal['first', 'last']
|
Whether to keep the first or last copy of the duplicates. |
'first'
|
method |
Literal['mean', 'median']
|
The method to aggregate the data. |
'median'
|
discretize
discretize(X: np.ndarray, thresholds: Union[np.ndarray, list], inplace: bool = False, allow_nan: bool = True, label_order: Literal['ascending', 'descending'] = 'ascending') -> np.ndarray
Thresholding of array-like or scipy.sparse matrix into binary or multiclass labels.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
The data to discretize, element by element. scipy.sparse matrices should be in CSR or CSC format to avoid an un-necessary copy. |
required | |
thresholds |
Union[ndarray, list]
|
Interval boundaries that include the right bin edge. |
required |
inplace |
bool
|
Set to True to perform inplace discretization and avoid a copy (if the input is already a numpy array or a scipy.sparse CSR / CSC matrix and if axis is 1). |
False
|
allow_nan |
bool
|
Set to True to allow nans in the array for discretization. Otherwise, an error will be raised instead. |
True
|
label_order |
Literal['ascending', 'descending']
|
The continuous values are discretized to labels 0, 1, 2, .., N with respect to given
threshold bins [threshold_1, threshold_2,.., threshould_n].
When set to 'ascending', the class label is in ascending order with the threshold
bins that |
'ascending'
|
Returns:
Name | Type | Description |
---|---|---|
X_tr |
ndarray
|
The transformed data. |
curate_molecules
curate_molecules(mols: List[Union[str, dm.Mol]], progress: bool = True, remove_stereo: bool = False, fix_mol: bool = True, count_stereoisomers: bool = True, count_stereocenters: bool = True, **parallelized_kwargs) -> Tuple
Curate a list of molecules.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mols |
List[Union[str, Mol]]
|
List of molecules. |
required |
progress |
bool
|
Whether show curation progress. |
True
|
fix_mol |
bool
|
Whether fix the error in molecule. |
True
|
remove_stereo |
bool
|
Whether remove stereo chemistry information from molecule. |
False
|
count_stereoisomers |
bool
|
Whether count the number of stereoisomers of molecule. |
True
|
count_stereocenters |
bool
|
Whether count the number of stereocenters of molecule. |
True
|
Returns:
Name | Type | Description |
---|---|---|
mol_dict |
Tuple
|
Dictionary of molecule and additional metadata |
num_invalid |
Tuple
|
Number of invßßalid molecules |
detect_outliers
Functional interface for detecting outliers
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
ndarray
|
The observations that we want to classify as inliers or outliers. |
required |
method |
OutlierDetectionMethod
|
The method to use for outlier detection. |
'zscore'
|
**kwargs |
Any
|
Keyword arguments for the outlier detection method. |
{}
|