Datasets

TOC

dataset base (training dataset)

class src.dataset.base.TrainDataset[source]

Bases: Dataset, ABC

Abstract training dataset class.

_abc_impl = <_abc._abc_data object>
abstract _get_label(idx: int) list[int][source]
abstract _get_text(idx: int) str[source]
abstract property binarized_label_dictionary: dict[str, int]
abstract property candid_labels: list[str]
abstract get_specific_record(idx: int, label_level: int) dict[str, str | list][source]
abstract property max_label_depth: int
abstract property taxonomy: Taxonomy

dataset builder

class src.dataset.builder.DatasetBuilder(config: Config, logger: Logger, train_dataset: list[dict[str, str]], test_dataset: list[dict[str, str]], taxonomy: Taxonomy)[source]

Bases: object

The builder class is mainly used to control the creation/loading of datasets. During the creation of a new dataset, it is possible to tweak behaviour by setting specific values in the config. You have control over the usage of the predefined train-test split, split size, … more info can be found in the config module.

In general there are two main approaches to interface/load datasets:

  1. Loading dataset from sparql
    >>> dataset_builder = DatasetBuilder.from_sparql(...)
    >>> train_dataset = dataset_builder.train_dataset
    

    more related information can be found at the classmethod from_sparql

  2. Loading dataset from local checkpoint
    >>> dataset_builder = DatasetBuilder.from_checkpoint(...)
    >>> train_dataset = dataset_builder.train_dataset
    

    more related information can be found at the classmethod from_checkpoint

_dump_json(file_path: str, dictionary: dict | list[dict[Any, Any]]) None[source]

This function dumps the content from the provided dictionary to the provided filepath.

Parameters:
  • file_path – path where to dump the json

  • dictionary – dictionary that will be saved to json file

Returns:

Nothing at al

create_checkpoint(checkpoint_folder: str) str[source]

This function provides functionality to save all relevant information (train- and test dataset + taxonomy). These checkpoints are a 1-on-1 match for loading when using the from_checkpoint classmethod

Example usage:
>>> dataset_builder = DatasetBuilder(...)
>>> dataset_builder.create_checkpoint(checkpoint_folder="...")
Parameters:

checkpoint_folder – folder to save checkpoint to

Returns:

returns the unique checkpoint subfolder where the artifacts were saved to

classmethod from_checkpoint(config: Config, logger: Logger, checkpoint_folder: str)[source]

Classmethod to create an instance of the databuilder class from a checkpoint. This checkpoint is based on the checkpoints that are created using the ‘create_checkpoint’ method.

Parameters:
  • config – general config provided

  • logger – logging object that is used throughout the project

  • checkpoint_folder – folder where to save everything to

Returns:

an instance of the DatasetBuilder object

classmethod from_sparql(config: Config, logger: Logger, request_handler: RequestHandler, taxonomy_uri: str, query_type: DecisionQuery, do_train_test_split: bool = True, **kwargs)[source]

Class method for class initialization from sparql. When provided with a taxonomy uri, it will create a new dataset based on annotated decisions that can be found in the sparql database.

Example usage:
>>> annotation = DatasetBuilder.from_sparql(
        config = DataModelConfig(),
        logger = logging.logger,
        request_handler = RequestHandler(...),
        taxonomy_uri = "...",
        query_type = DecisionQuery.ANNOTATED
    )
Parameters:
  • config – the general DataModelConfig

  • logger – logger object that can be used for logs

  • request_handler – the request wrapper used for sparql requests

  • query_type – What type of query will be executed

  • taxonomy_uri – what taxonomy to pull when using annotated dataset query_type

  • do_train_test_split – wheter or not to execute the train test split (not via config, check code for clarity)

Returns:

an instance of the DatasetBuilder Class

src.dataset.builder.binarize(taxonomy, labels) list[int][source]

This function maps labels to a multilabel binazired input.

Returns:

list of 0 or 1 values based on provided labels

dataset provider

(Can be found in the modules __init__ file)

src.dataset.create_dataset(config: Config, logger: Logger, dataset: list[dict[str, Any]], taxonomy: Taxonomy, tokenizer: object = None, sub_node: str = None) TrainDataset | list[dict][source]

Function that creates the dataset based on the configuration that is provided

Parameters:
  • sub_node – sub_node to reselect input data for

  • tokenizer – tokenizer to use in the dataset if provided

  • config – configuration object

  • logger – logger object

  • dataset – the created dataset (list of dict)

  • taxonomy – the taxonomy that is used for

Returns: