Datasets
TOC
dataset base (training dataset)
dataset builder
- class src.dataset.builder.DatasetBuilder(config: Config, logger: Logger, train_dataset: list[dict[str, str]], test_dataset: list[dict[str, str]], taxonomy: Taxonomy)[source]
Bases:
objectThe builder class is mainly used to control the creation/loading of datasets. During the creation of a new dataset, it is possible to tweak behaviour by setting specific values in the config. You have control over the usage of the predefined train-test split, split size, … more info can be found in the config module.
In general there are two main approaches to interface/load datasets:
- Loading dataset from sparql
>>> dataset_builder = DatasetBuilder.from_sparql(...) >>> train_dataset = dataset_builder.train_dataset
more related information can be found at the classmethod from_sparql
- Loading dataset from local checkpoint
>>> dataset_builder = DatasetBuilder.from_checkpoint(...) >>> train_dataset = dataset_builder.train_dataset
more related information can be found at the classmethod from_checkpoint
- _dump_json(file_path: str, dictionary: dict | list[dict[Any, Any]]) None[source]
This function dumps the content from the provided dictionary to the provided filepath.
- Parameters:
file_path – path where to dump the json
dictionary – dictionary that will be saved to json file
- Returns:
Nothing at al
- create_checkpoint(checkpoint_folder: str) str[source]
This function provides functionality to save all relevant information (train- and test dataset + taxonomy). These checkpoints are a 1-on-1 match for loading when using the from_checkpoint classmethod
- Example usage:
>>> dataset_builder = DatasetBuilder(...) >>> dataset_builder.create_checkpoint(checkpoint_folder="...")
- Parameters:
checkpoint_folder – folder to save checkpoint to
- Returns:
returns the unique checkpoint subfolder where the artifacts were saved to
- classmethod from_checkpoint(config: Config, logger: Logger, checkpoint_folder: str)[source]
Classmethod to create an instance of the databuilder class from a checkpoint. This checkpoint is based on the checkpoints that are created using the ‘create_checkpoint’ method.
- Parameters:
config – general config provided
logger – logging object that is used throughout the project
checkpoint_folder – folder where to save everything to
- Returns:
an instance of the DatasetBuilder object
- classmethod from_sparql(config: Config, logger: Logger, request_handler: RequestHandler, taxonomy_uri: str, query_type: DecisionQuery, do_train_test_split: bool = True, **kwargs)[source]
Class method for class initialization from sparql. When provided with a taxonomy uri, it will create a new dataset based on annotated decisions that can be found in the sparql database.
- Example usage:
>>> annotation = DatasetBuilder.from_sparql( config = DataModelConfig(), logger = logging.logger, request_handler = RequestHandler(...), taxonomy_uri = "...", query_type = DecisionQuery.ANNOTATED )
- Parameters:
config – the general DataModelConfig
logger – logger object that can be used for logs
request_handler – the request wrapper used for sparql requests
query_type – What type of query will be executed
taxonomy_uri – what taxonomy to pull when using annotated dataset query_type
do_train_test_split – wheter or not to execute the train test split (not via config, check code for clarity)
- Returns:
an instance of the DatasetBuilder Class
dataset provider
(Can be found in the modules __init__ file)
- src.dataset.create_dataset(config: Config, logger: Logger, dataset: list[dict[str, Any]], taxonomy: Taxonomy, tokenizer: object = None, sub_node: str = None) TrainDataset | list[dict][source]
Function that creates the dataset based on the configuration that is provided
- Parameters:
sub_node – sub_node to reselect input data for
tokenizer – tokenizer to use in the dataset if provided
config – configuration object
logger – logger object
dataset – the created dataset (list of dict)
taxonomy – the taxonomy that is used for
- Returns: