Helpers
This module contains some of the functionality that gets called upon from the entrypoints
Dataset Statistics
- class src.helpers.statistics.GenerateTaxonomyStatistics(config: Config, logger: Logger, dataset: list[dict[str, str]], taxonomy: Taxonomy, max_level: int, local_storage_dir: str)[source]
Bases:
objectThis class calculates the distribution of labels for a given taxonomy, dataset and label depth combination.
- Typical usage example:
>>> dataset_builder = DatasetBuilder(...) >>> stats = GenerateTaxonomyStatistics( config=Config(), logger=logging.logger, dataset=dataset_builder.train_dataset taxonomy=Taxonomy(...), max_level=4, local_storage_dir="..." ) >>> stats.calculate_stats()
- _generate_level_stats(level: int) None[source]
internal function that does the actual calculations of the statistics about the label distribution
- Parameters:
level – the level to generate the statistics for
- Returns:
Nothing at al
- _get_level_labels(level: int) list[str][source]
This function is a wrapper around the taxonomy get_level_specific_labels function.
- Parameters:
level – level to retrieve labels from
- Returns:
the list of labels that occur on the provided level
- _prep_dataset() None[source]
This function creates a dataset object from the provided dataset. This dataset object does most of the remapping in order to easily calculate statistics.
- Returns:
- calculate_stats()[source]
This function calculates the stats for each level up until the max level.
- Returns:
- classmethod from_checkpoint(config: Config, logger: Logger, checkpoint_folder: str, max_level: int, local_storage_dir: str) GenerateTaxonomyStatistics[source]
Classmethod to instantiate taxonomy statistics class from a dataset checkpoint.
- Parameters:
config – the general config used throughout the project
logger – the logger object
checkpoint_folder – checkpoint location where we can load the dataset from
max_level – maximum depth defined as integer
local_storage_dir – local storage dir for caching/ mlflow artifacts
- Returns:
- classmethod from_sparql(config: Config, logger: Logger, request_handler: RequestHandler, taxonomy_uri: str, max_level: int, local_storage_dir: str, **kwargs) GenerateTaxonomyStatistics[source]
Classmethod that creates the dataset from sparql, this is helpfull when simply calculating intermediate statistics on datasets to track progression of the labeling process.
- Parameters:
config – the general config that is used throughout the project
logger – logger object for logging
request_handler – the instantiated request handler to use
taxonomy_uri – the taxonomy uri
max_level – max depth specified as int
local_storage_dir – local caching/ artifact trakcing dir
kwargs
- Returns: