Curator

auroris.curation.Curator

Bases: BaseModel

A curator is a serializable collection of actions that are applied to a dataset.

Attributes:

Name	Type	Description
`steps`	`List[BaseAction]`	Ordered list of curation actions to apply to the dataset.
`src_dataset_path`	`Optional[str]`	An optional path to load the source dataset from. Can be used to specify a reproducible workflow.
`verbosity`	`VerbosityLevel`	Verbosity level for logging.
`parallelized_kwargs`	`dict`	Keyword arguments to affect parallelization in the steps.

transform

transform(dataset: Optional[pd.DataFrame] = None) -> Tuple[pd.DataFrame, CurationReport]

Runs the curation process.

Parameters:

Name	Type	Description	Default
`dataset`	`Optional[DataFrame]`	The dataset to be curated. If `src_dataset_path` is set, this parameter is ignored.	`None`

Returns:

Type	Description
`Tuple[DataFrame, CurationReport]`	A tuple of the curated dataset and a report summarizing the changes made.

load_dataset `staticmethod`

load_dataset(path: str)

Loads a dataset, to be curated, from a path.

File-format support

This currently only supports CSV and Parquet files and uses the default parameters for pd.read_csv and pd.read_parquet. If you need more flexibility, consider loading the data yourself and passing it directly to Curator.transform(dataset=...).

from_json `classmethod`

from_json(path: str)

Loads a curation workflow from a JSON file.

Parameters:

Name	Type	Description	Default
`path`	`str`	The path to load from	required

to_json

to_json(path: str)

Saves the curation workflow to a JSON file.

Parameters:

Name	Type	Description	Default
`path`	`str`	The destination to save to.	required

Curator

auroris.curation.Curator

transform

load_dataset staticmethod

from_json classmethod

to_json

load_dataset `staticmethod`

from_json `classmethod`