Curator
auroris.curation.Curator
Bases: BaseModel
A curator is a serializable collection of actions that are applied to a dataset.
Attributes:
Name | Type | Description |
---|---|---|
steps |
List[BaseAction]
|
Ordered list of curation actions to apply to the dataset. |
src_dataset_path |
Optional[str]
|
An optional path to load the source dataset from. Can be used to specify a reproducible workflow. |
verbosity |
VerbosityLevel
|
Verbosity level for logging. |
parallelized_kwargs |
dict
|
Keyword arguments to affect parallelization in the steps. |
transform
Runs the curation process.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset |
Optional[DataFrame]
|
The dataset to be curated. If |
None
|
Returns:
Type | Description |
---|---|
Tuple[DataFrame, CurationReport]
|
A tuple of the curated dataset and a report summarizing the changes made. |
load_dataset
staticmethod
Loads a dataset, to be curated, from a path.
File-format support
This currently only supports CSV and Parquet files and uses the default
parameters for pd.read_csv
and pd.read_parquet
. If you need more flexibility,
consider loading the data yourself and passing it directly to Curator.transform(dataset=...)
.
from_json
classmethod
Loads a curation workflow from a JSON file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path |
str
|
The path to load from |
required |