Datasets

TOC

dataset base (training dataset)

class src.dataset.base.TrainDataset[source]

Bases: Dataset, ABC

Abstract training dataset class.

_abc_impl = <_abc._abc_data object>

abstract _get_label(idx: int) → list[int][source]

abstract _get_text(idx: int) → str[source]

abstract property binarized_label_dictionary: dict[str, int]

abstract property candid_labels: list[str]

abstract get_specific_record(idx: int, label_level: int) → dict[str, str | list][source]

abstract property max_label_depth: int

abstract property taxonomy: Taxonomy

dataset builder

class src.dataset.builder.DatasetBuilder(config: Config, logger: Logger, train_dataset: list[dict[str, str]], test_dataset: list[dict[str, str]], taxonomy: Taxonomy)[source]

Bases: object

The builder class is mainly used to control the creation/loading of datasets. During the creation of a new dataset, it is possible to tweak behaviour by setting specific values in the config. You have control over the usage of the predefined train-test split, split size, … more info can be found in the config module.

In general there are two main approaches to interface/load datasets:

Loading dataset from sparql
>>> dataset_builder = DatasetBuilder.from_sparql(...)
>>> train_dataset = dataset_builder.train_dataset
more related information can be found at the classmethod from_sparql
Loading dataset from local checkpoint
>>> dataset_builder = DatasetBuilder.from_checkpoint(...)
>>> train_dataset = dataset_builder.train_dataset
more related information can be found at the classmethod from_checkpoint

_dump_json(file_path: str, dictionary: dict | list[dict[Any, Any]]) → None[source]

This function dumps the content from the provided dictionary to the provided filepath.

Parameters:

file_path – path where to dump the json
dictionary – dictionary that will be saved to json file

Returns:

Nothing at al

create_checkpoint(checkpoint_folder: str) → str[source]

This function provides functionality to save all relevant information (train- and test dataset + taxonomy). These checkpoints are a 1-on-1 match for loading when using the from_checkpoint classmethod

Example usage:

>>> dataset_builder = DatasetBuilder(...)
>>> dataset_builder.create_checkpoint(checkpoint_folder="...")

Parameters:: checkpoint_folder – folder to save checkpoint to
Returns:: returns the unique checkpoint subfolder where the artifacts were saved to

classmethod from_checkpoint(config: Config, logger: Logger, checkpoint_folder: str)[source]

Classmethod to create an instance of the databuilder class from a checkpoint. This checkpoint is based on the checkpoints that are created using the ‘create_checkpoint’ method.

Parameters:

config – general config provided
logger – logging object that is used throughout the project
checkpoint_folder – folder where to save everything to

Returns:

an instance of the DatasetBuilder object

classmethod from_sparql(config: Config, logger: Logger, request_handler: RequestHandler, taxonomy_uri: str, query_type: DecisionQuery, do_train_test_split: bool = True, **kwargs)[source]

Class method for class initialization from sparql. When provided with a taxonomy uri, it will create a new dataset based on annotated decisions that can be found in the sparql database.

Example usage:

>>> annotation = DatasetBuilder.from_sparql(
        config = DataModelConfig(),
        logger = logging.logger,
        request_handler = RequestHandler(...),
        taxonomy_uri = "...",
        query_type = DecisionQuery.ANNOTATED
    )

Parameters:

config – the general DataModelConfig
logger – logger object that can be used for logs
request_handler – the request wrapper used for sparql requests
query_type – What type of query will be executed
taxonomy_uri – what taxonomy to pull when using annotated dataset query_type
do_train_test_split – wheter or not to execute the train test split (not via config, check code for clarity)

Returns:

an instance of the DatasetBuilder Class

src.dataset.builder.binarize(taxonomy, labels) → list[int][source]

This function maps labels to a multilabel binazired input.

Returns:: list of 0 or 1 values based on provided labels

dataset provider

(Can be found in the modules __init__ file)

src.dataset.create_dataset(config: Config, logger: Logger, dataset: list[dict[str, Any]], taxonomy: Taxonomy, tokenizer: object = None, sub_node: str = None) → TrainDataset | list[dict][source]

Function that creates the dataset based on the configuration that is provided

Parameters:

sub_node – sub_node to reselect input data for
tokenizer – tokenizer to use in the dataset if provided
config – configuration object
logger – logger object
dataset – the created dataset (list of dict)
taxonomy – the taxonomy that is used for

Returns: