Helpers

This module contains some of the functionality that gets called upon from the entrypoints

Dataset Statistics

class src.helpers.statistics.GenerateTaxonomyStatistics(config: Config, logger: Logger, dataset: list[dict[str, str]], taxonomy: Taxonomy, max_level: int, local_storage_dir: str)[source]

Bases: object

This class calculates the distribution of labels for a given taxonomy, dataset and label depth combination.

Typical usage example:
>>> dataset_builder = DatasetBuilder(...)
>>> stats = GenerateTaxonomyStatistics(
        config=Config(),
        logger=logging.logger,
        dataset=dataset_builder.train_dataset
        taxonomy=Taxonomy(...),
        max_level=4,
        local_storage_dir="..."
    )
>>> stats.calculate_stats()
_generate_level_stats(level: int) None[source]

internal function that does the actual calculations of the statistics about the label distribution

Parameters:

level – the level to generate the statistics for

Returns:

Nothing at al

_get_level_labels(level: int) list[str][source]

This function is a wrapper around the taxonomy get_level_specific_labels function.

Parameters:

level – level to retrieve labels from

Returns:

the list of labels that occur on the provided level

_prep_dataset() None[source]

This function creates a dataset object from the provided dataset. This dataset object does most of the remapping in order to easily calculate statistics.

Returns:

calculate_stats()[source]

This function calculates the stats for each level up until the max level.

Returns:

classmethod from_checkpoint(config: Config, logger: Logger, checkpoint_folder: str, max_level: int, local_storage_dir: str) GenerateTaxonomyStatistics[source]

Classmethod to instantiate taxonomy statistics class from a dataset checkpoint.

Parameters:
  • config – the general config used throughout the project

  • logger – the logger object

  • checkpoint_folder – checkpoint location where we can load the dataset from

  • max_level – maximum depth defined as integer

  • local_storage_dir – local storage dir for caching/ mlflow artifacts

Returns:

classmethod from_sparql(config: Config, logger: Logger, request_handler: RequestHandler, taxonomy_uri: str, max_level: int, local_storage_dir: str, **kwargs) GenerateTaxonomyStatistics[source]

Classmethod that creates the dataset from sparql, this is helpfull when simply calculating intermediate statistics on datasets to track progression of the labeling process.

Parameters:
  • config – the general config that is used throughout the project

  • logger – logger object for logging

  • request_handler – the instantiated request handler to use

  • taxonomy_uri – the taxonomy uri

  • max_level – max depth specified as int

  • local_storage_dir – local caching/ artifact trakcing dir

  • kwargs

Returns: