Codebase overview

Modules

Entrypoints

In this section you can find all entrypoints scripts with their technical documentation. These entrypoints are directly called from the airflow DAGs

Generate dataset statistics

src.dataset_statistics.main(max_level: int = 4)[source]

This main function provides access to the dataset statistic generation class. This is usefull to extract information about the label/ data distribution in the provided taxonomy annotated decision pool.

The goal here is to create an overview of what labels have what degree of class balance/ samples.

Parameters:: max_level – The maximum depth to calculate these summary statistics for
Returns:: nothing, everything is logged in mlflow artifacts.

Export dataset

src.dataset_export.main(dataset_type: DatasetType = DatasetType.UNPROCESSED, taxonomy_uri: str = 'http://stad.gent/id/concepts/gent_words', checkpoint_location: str | None = None)[source]

This function provides access to the creation and retrieval of training datasets. Based on the provided configuration, it will create the respective dataset.

Parameters:

checkpoint_location – The location of the pre-downloaded dataset information (This must be of the same taxonomy type)
dataset_type – The type of dataset you want to extract
taxonomy_uri – The taxonomy you want to create the dataset for.

Returns:

Nothin, all artifacts are logged to mlflow

Benchmarking

src.benchmarking.main(model_types: list[ModelType] | str, dataset_types: list[DatasetType] | str, model_ids: list[str] | str, taxonomy_uri: str = 'http://stad.gent/id/concepts/gent_words', checkpoint_location: str | None = None)[source]

This main function contains the script for the execution of benchmarking with customizable inputs. During benchmarking all the permutations for the variables below will be executed, results will be accumulated in the mlflow interface under the pre-specified mlflow experiment id.

[sidenotes]: When executing the benchmarking on supervised models, keep in mind that the models are not nessecearly trained on the selected taxonomy. They will output the taxonomy they have been trained on (this however is specified in the naming convention of the model.)

Example usage:

When you specify the model/datasets instead of using the prefix it should look like this:

>>> main(
        model_types=[
            "embedding_child_labels",
             "embedding_chunked",
             "embedding_sentence"
        ],
        dataset_types=[
            "m1_article_split",
             "m1_shorttitle",
             "m1_general"
         ],
        model_ids=[
            "paraphrase-multilingual-mpnet-base-v2",
            "intfloat/multilingual-e5-small",
            "thenlper/gte-large",
            "multi-qa-mpnet-base-dot-v1"
        ]
        taxonomy_uri="http://stad.gent/id/concepts/gent_words"
    )

When you use the prefix method, it should look like this:

>>> main(
       model_types="embedding",
       dataset_types="m1",
       model_ids=[
           "paraphrase-multilingual-mpnet-base-v2",
           "intfloat/multilingual-e5-small",
           "thenlper/gte-large",
           "multi-qa-mpnet-base-dot-v1"
       ]
       taxonomy_uri="http://stad.gent/id/concepts/gent_words"
   )

Run command:

When running specific model types and dataset types.

>>> python -m src.benchmarking --model_types="embedding_chunked, embedding_sentence" --dataset_types="m1_article_split,m1_general" --model_ids="paraphrase-multilingual-mpnet-base-v2,intfloat/multilingual-e5-small,multi-qa-mpnet-base-dot-v1,thenlper/gte-large" --taxonomy_uri="http://stad.gent/id/concepts/gent_words"

When running one prefix dataset and one prefix model type

>>> python -m src.benchmarking --model_types="embedding" --dataset_types="m1" --model_ids="paraphrase-multilingual-mpnet-base-v2,intfloat/multilingual-e5-small,multi-qa-mpnet-base-dot-v1,thenlper/gte-large" --taxonomy_uri="http://stad.gent/id/concepts/gent_words"

Parameters:

checkpoint_location – The location where a dataset checkpoint can be found, this location should contain content from the create_checkpoint function as implemented in the dataset builder class. If it does not provided the same input, it wont load propperly or result in an error.
model_types –
The model type parameter is used to specify what model type(s) you are using for your benchmarking. This can be provided in two separate ways:
1. An actual list with values that can be found in the ModelType enum (most likely passed as string values). Keep in mind that the values in this list need te be matching with the enum values exactly! This list should be provided with comma’s to separate the values (see run command example).
2. You can provide a string prefix, this prefix will be used to identify all relevant models. These models will automaticly be added to the list of model types to experiment with.
dataset_types –
The dataset type parameter is used to specify what type of dataset(s) you want to use for the benchmarking run. The dataset type can be provided in two separate ways:
1. An actual list with values that can be found in the DatasetType enum (will be passed as string values), this list should be provided with comma’s for separation (see run command example).
2. String prefix value that will be matched to the current enum values. all matching enum values will be used for the benchmarking run.
model_ids – This parameter requires a list of models to run the benchmarking with, this will be huggingface model references. Keep in mind that models from zeroshot might work for embeddings. however the other way around they won’t. It is advised not to mix zeroshot, benchmarking and classifier models when running experiments, it would be better (and more funtional) to split these up in multiple benchmarking runs.
taxonomy_uri – The string reference to what taxonomy you want to use for the benchmarking process.

Returns:

Nothing

Train supervised model with specific taxonomy node

src.train_supervised_specific.main(train_flavour: TrainingFlavours | None = None, dataset_type: DatasetType | None = None, model_id: str | None = None, train_test_split: bool = True, checkpoint_folder: str | None = None, taxonomy_url: str = 'http://stad.gent/id/concepts/gent_words', taxonomy_sub_node: str | None = None, trainer_type: TrainerTypes = TrainerTypes.CUSTOM)[source]

This script provides the functionality to train the supervised models, the training can be configured in multiple ways:

1. You can train a classifier on an entire level of the taxonomy (Only relevant if there is enough information available per label in deeper taxonomies nodes.)

2. You can train a classifier on a specific node of the taxonomy (i.e. level 1 taxonomy is trained on the specific start node of the taxonomy, while a node from the level 1 can be used for training on a specific sub node in the taxonomy.)

Example usage:

>>> python -m src.train_supervised_tree --train_flavour="bert" --dataset_type="dynamic_general" --taxonomy_url='http://stad.gent/id/concepts/business_capabilities' --checkpoint_folder="data/business_capabilities"

Parameters:

train_flavour – The specific type of model training you would want, this is a one on one mapping with the training flavour enum It allows you to select a specific type of model you would like to train (BERT, DISTIL_BERT, SETFIT)
dataset_type – The specific type of dataset you want to use for model training, this is another one on one mapping with the dataset type enum. for more information about what dataset types are available, check out enums.
model_id – What base model to use when training one of the selected model flavours, this also can be left empty. The defaults are provided in the config and will be pulled from there.
train_test_split – Flag that allows you to do train_test split, this will force your code to create a split during the creation of your dataset. The behaviour is specified with the config (predefined split yes/no , split size …)
checkpoint_folder – Mainly used for debugging, when you have a pre-downloaded dataset checkpoints, you can provide this here.
taxonomy_url – The taxonomy to use for the model training.
taxonomy_sub_node – If provided, it will train the model only on the specific sub-level node that has been selected. This will be used to train specialized models that represent a node in the taxonomy.

Returns:

Create blank configs

src.create_blank_config.main()[source]

This script can be used to re-generate a blank model config. This would be helpfull if the taxonomy is adapted, extended etc.

Returns:: nothing, artifacts are pushed to a designated airflow run

Train supervised model for entire taxonomy tree

src.train_supervised_tree.main(train_flavour: TrainingFlavours | None = None, dataset_type: DatasetType | None = None, model_id: str | None = None, train_test_split: bool = True, checkpoint_folder: str | None = None, max_depth: int = 2, taxonomy_url: str = 'http://stad.gent/id/concepts/gent_words', trainer_type: TrainerTypes = TrainerTypes.CUSTOM)[source]

This script provides the functionality to train all supervised models up to a predefined depth. It enables an easy way to bootstrap the models to be t

1. You can train a classifier on an entire level of the taxonomy (Only relevant if there is enough information available per label in deeper taxonomies nodes.)

2. You can train a classifier on a specific node of the taxonomy (i.e. level 1 taxonomy is trained on the specific start node of the taxonomy, while a node from the level 1 can be used for training on a specific sub node in the taxonomy.)

Example usage:

>>>

Parameters:

max_depth – The maximum depth the taxonomy tree training should be used on
train_flavour – The specific type of model training you would want, this is a one on one mapping with the training flavour enum It allows you to select a specific type of model you would like to train (BERT, DISTIL_BERT, SETFIT)
dataset_type – The specific type of dataset you want to use for model training, this is another one on one mapping with the dataset type enum. for more information about what dataset types are available, check out enums.
model_id – What base model to use when training one of the selected model flavours, this also can be left empty. The defaults are provided in the config and will be pulled from there.
train_test_split – Flag that allows you to do train_test split, this will force your code to create a split during the creation of your dataset. The behaviour is specified with the config (predefined split yes/no , split size …)
checkpoint_folder – Mainly used for debugging, when you have a pre-downloaded dataset checkpoints, you can provide this here.
taxonomy_url – The taxonomy to use for the model training.:return:

Execute inference based on a provided model tree configuration

src.topic_modeling.main(dataset_type: DatasetType, model_type: ModelType, checkpoint_folder: str | None = None)[source]

This function is the entrypoint for the topic modeling functionality.

It calls on the specified class to generate the topic modeling artifacts that can be user for further analysis

Parameters:

dataset_type – the type of dataset to use as input formatting
model_type – the type of topic modeling to use
checkpoint_folder – a checkpoint that can be used to restore the input data from

Returns: