Skip to content

Curator

auroris.curation.Curator

Bases: BaseModel

A curator is a serializable collection of actions that are applied to a dataset.

Attributes:

Name Type Description
steps List[BaseAction]

Ordered list of curation actions to apply to the dataset.

src_dataset_path Optional[str]

An optional path to load the source dataset from. Can be used to specify a reproducible workflow.

verbosity VerbosityLevel

Verbosity level for logging.

parallelized_kwargs dict

Keyword arguments to affect parallelization in the steps.

transform

transform(dataset: Optional[pd.DataFrame] = None) -> Tuple[pd.DataFrame, CurationReport]

Runs the curation process.

Parameters:

Name Type Description Default
dataset Optional[DataFrame]

The dataset to be curated. If src_dataset_path is set, this parameter is ignored.

None

Returns:

Type Description
Tuple[DataFrame, CurationReport]

A tuple of the curated dataset and a report summarizing the changes made.

load_dataset staticmethod

load_dataset(path: str)

Loads a dataset, to be curated, from a path.

File-format support

This currently only supports CSV and Parquet files and uses the default parameters for pd.read_csv and pd.read_parquet. If you need more flexibility, consider loading the data yourself and passing it directly to Curator.transform(dataset=...).

from_json classmethod

from_json(path: str)

Loads a curation workflow from a JSON file.

Parameters:

Name Type Description Default
path str

The path to load from

required

to_json

to_json(path: str)

Saves the curation workflow to a JSON file.

Parameters:

Name Type Description Default
path str

The destination to save to.

required