Benchmarks

Add more information here later !

Abstract Baseclass

class src.benchmark.base.BenchmarkBase[source]

Bases: ABC

_abc_impl = <_abc._abc_data object>

abstract _create_dataset(checkpoint: str | None) → None[source]

abstract _create_model(model_id: str) → Model[source]

abstract _create_run_name(model_id: str | None = None) → str[source]

abstract property default_description: str

abstract property default_mlflow_tags: dict[str, str]

abstract exec()[source]

Regular Benchmark

class src.benchmark.regular.BenchmarkWrapper(config: Config, logger: Logger, request_handler: RequestHandler, model_ids: list[str] | str, taxonomy_reference: str, checkpoint_dir: str = 'data', nested_mlflow_run: bool = False)[source]

Bases: BenchmarkBase

This is the baseclass for benchmarking models.

The main objective of this class is brining multiple components togheter, these components all contain their own custom behaviour for a certain part of the code.

These sub-components are:

Dataset: How do we transform the data from sparql data into training/inference data?
Model architecture (or type): A wrapper object around a model to allow abstract usage of methods that are universally implmented.
Model base: Model weights to load in the predefined model Architecture
Taxonomy: A specific taxonomy to use for predictions

The previously mentioned components are mostly declared in the config, only the model base is provided under the model_ids parameter.

Typical usage:

>>> benchmark = BenchmarkWrapper(
        config=Config(),
        logger=logging.logger,
        request_handler=RequestHandler(),
        model_ids=["...", ...],
        taxonomoy_reference="..."
    )
>>> benchmark()

_abc_impl = <_abc._abc_data object>

_create_dataset(checkpoint: str | None) → None[source]

Internal function that is responsible for the creation of the benchmarking dataset. When it is not provided with a checkpoint, it will automatically start building the dataset by pulling all annotated information for the provided taxonomy.

Parameters:: checkpoint – folder or path where the benchmark data can be found
Returns:: Nothing

_create_model(model_id: str) → Model[source]

_create_run_name(model_id: str | None = None) → str[source]

Internal function that generates custom run names, these are used for verbose naming in the mlflow tracking.

Parameters:: model_id – the model id of the currently selected model
Returns:: a custom string that represents the unique combination of components

property default_description: str

This property provides a getter for the default description that should be provided for mlflow logging

Example usage:

>>> benchmark = BenchmarkWrapper(...)
>>> description = benchmark.default_description

Returns:: string description for mlflow run

property default_mlflow_tags: dict[str, str]

This property provides a getter for the default mlflow tags that are provided by the selection of the class

Example usage:

>>> benchmark = BenchmarkWrapper(...)
>>> mlflow_tags = benchmark.default_mlflow_tags

Returns:: tags for mlflow

exec()[source]

This function is responsible for the execution of the benchmark. It creates (nested-)mlflow runs, these runs are based on the pre-defined config and selected base model.

For each combination, a (nested) run will appear in the mlflow interface containing all the artifacts created by a benchmark run. (more info about the artifacts can be found in the evaluate class)

Example usage:

>>> benchmark = BenchmarkWrapper(...)
>>> benchmark.exec()

Returns:: Nothing at all

Embedding Similarity Benchmark

class src.benchmark.embedding_similarity.EmbeddingSimilarityBenchmark(config: Config, logger: Logger, request_handler: RequestHandler, model_ids: list[str] | str, taxonomy_reference: str = 'http://stad.gent/id/concepts/gent_words', nested_mlflow_run: bool = False, checkpoint_dir: str = 'data')[source]

Bases: BenchmarkWrapper

This is the wrapper class for zeroshot models benchmarking, for more information check out the baseclass

>>> benchmark = EmbeddingSimilarityBenchmark(
        config=Config(),
        logger=logging.logger,
        request_handler=RequestHandler(),
        model_ids=["...", ...],
        taxonomoy_reference="..."
    )
>>> benchmark()

_abc_impl = <_abc._abc_data object>

property default_description

This property provides a getter for the default description that should be provided for mlflow logging

Example usage:

>>> benchmark = EmbeddingSimilarityBenchmark(...)
>>> description = EmbeddingSimilarityBenchmark.default_description

Returns:: string description for mlflow run

property default_mlflow_tags

This property provides a getter for the default mlflow tags that are provided by the selection of the class

Example usage:

>>> benchmark = EmbeddingSimilarityBenchmark(...)
>>> mlflow_tags = benchmark.default_mlflow_tags

Returns:: tags for mlflow

Evaluate

class src.benchmark.evaluate.MultilabelEvaluation(config: Config, logger: Logger, model: Model, dataset: TrainDataset, multilabel: bool = True)[source]

Bases: object

This class is the framework that executes a specific evaluation of a model. The evaluation loops over the dataset and generates predictions for each record, once completed it executes the metric calculation script.

Example usage:

>>> multilabel_eval = MultilabelEvaluation(
        config=Config(),
        logger=logging.logger,
        model = Model(...),
        dataset = TrainingDataset(...),
        multilabel = True
    )
>>> multilabel_eval.evaluate()

evaluate() → None[source]: This function starts the evaluation process for the given dataset :return:

Hybrid Benchmark

class src.benchmark.hybrid.HybridBenchmark(config: Config, logger: Logger, request_handler: RequestHandler, unsupervised_model_ids: list[str] | str, supervised_model_id: str, unsupervised_model_type: ModelType, taxonomy_reference: str = 'http://stad.gent/id/concepts/gent_words', checkpoint_dir: str = 'data', nested_mlflow_run: bool = False)[source]

Bases: BenchmarkWrapper

This is the wrapper class for hybrid model benchmarking, for more information check out the baseclass

>>> benchmark = HybridBenchmark(
        config=Config(),
        logger=logging.logger,
        request_handler=RequestHandler(),
        model_ids=["...", ...],
        taxonomoy_reference="..."
    )
>>> benchmark()

_abc_impl = <_abc._abc_data object>

_create_model(model_id: str) → Model[source]

property default_description

This property provides a getter for the default description that should be provided for mlflow logging

Example usage:

>>> benchmark = HybridBenchmark(...)
>>> description = benchmark.default_description

Returns:: string description for mlflow run

property default_mlflow_tags

This property provides a getter for the default mlflow tags that are provided by the selection of the class

Example usage:

>>> benchmark = HybridBenchmark(...)
>>> mlflow_tags = benchmark.default_mlflow_tags

Returns:: tags for mlflow

Metrics

class src.benchmark.metrics.Metrics(config: BenchmarkConfig, logger: Logger, model_id: str, base_folder: str, classes: list[str], average: str = 'weighted')[source]

Bases: object

This class is used to compute metrics for a benchmarking run. It is capable of generating multiple different artifacts that can be used for further model performance analysis.

compute(y_true: array, logits: array, suffix: str | None = None, save: bool = True) → DataFrame[source]

This function brings all previous calculations together. Paired with the config, you can enable and disable certain artifacts and metrics

Metrics that can be used:

Hamming score
f1 score
precision
recall

Artifacts that can be generated:

Confusion matrix
classification report
precision-recall plot
overview plot

Parameters:

y_true – the values that are expected to be predicted
logits – the logits for the predicted values
suffix – a suffix that can be used for custom naming of the artifacts
save – flag to save (currently not used -> config handles this)

Returns:

pandas dataframe containing all the metric values

f1_score(y_true: array, y_pred: array) → array[source]

This function contains the code to compute the hamming_Score.

F1 score formula:

>>> F1 = 2 * (precision * recall) / (precision + recall)

The F1 score is a statistical measure of a test’s accuracy, particularly in the context of binary classification. It is the harmonic mean of precision and recall, which are two important metrics for evaluating the performance of binary classifiers.

Parameters:

y_true – matrix what the labels should be
y_pred – matrix with the predicted labels

Returns:

the f1 scores

static hamming_score(y_true: array, y_pred: array) → array[source]

This function contains the code to compute the hamming_Score.

Hamming score formula:

>>>  Hamming score = (Σ (y_true_i == y_pred_i)) / (Σ 1)

The Hamming score is a useful metric for evaluating the performance of multi-label classification models, which are models that predict multiple labels for each instance. In multi-label classification, the Hamming score is more sensitive to errors than accuracy, as it considers both false positives and false negatives.

Parameters:

y_true – matrix what the labels should be
y_pred – matrix with the predicted labels

Returns:

the hamming scores

precision_score(y_true: array, y_pred: array) → array[source]

This function enables the computation of precision score

The precision formula:

>>> Precision = TP / (TP + FP)
>>> where:
>>>    TP is the number of true positives
>>>    FP is the number of false positives

In machine learning, precision, also known as positive predictive value (PPV), is the proportion of predicted positives that are actually positive. It is calculated as the number of true positives divided by the total number of predicted positives. Precision is a binary classification metric, meaning it is only relevant when a model is predicting one of two classes

Parameters:

y_true – matrix what the labels should be
y_pred – matrix with the predicted labels

Returns:

the precision scores

recall_score(y_true: array, y_pred: array) → array[source]

This function computes the recall for the given input

The recall formula:

>>> Recall = TP / (TP + FN)
>>> where:
>>>     TP is the number of true positives
>>>     FN is the number of false negatives

In machine learning, recall, also known as sensitivity, true positive rate (TPR), or completeness, is the proportion of actual positives that are correctly identified as such by the model. It is calculated as the number of true positives divided by the total number of actual positives. Recall is a binary classification metric, meaning it is only relevant when a model is predicting one of two classes.

Parameters:

y_true – matrix what the labels should be
y_pred – matrix with the predicted labels

Returns:

the recall scores

Zeroshot Benchmark

class src.benchmark.zeroshot.ZeroshotBenchmark(config: Config, logger: Logger, request_handler: RequestHandler, model_ids: list[str] | str, taxonomy_reference: str = 'http://stad.gent/id/concepts/gent_words', checkpoint_dir: str = 'data', nested_mlflow_run: bool = False)[source]

Bases: BenchmarkWrapper

This is the wrapper class for zeroshot models benchmarking, for more information check out the baseclass

>>> benchmark = ZeroshotBenchmark(
        config=Config(),
        logger=logging.logger,
        request_handler=RequestHandler(),
        model_ids=["...", ...],
        taxonomoy_reference="..."
    )
>>> benchmark()

_abc_impl = <_abc._abc_data object>

property default_description

This property provides a getter for the default description that should be provided for mlflow logging

Example usage:

>>> benchmark = ZeroshotBenchmark(...)
>>> description = benchmark.default_description

Returns:: string description for mlflow run

property default_mlflow_tags

This property provides a getter for the default mlflow tags that are provided by the selection of the class

Example usage:

>>> benchmark = ZeroshotBenchmark(...)
>>> mlflow_tags = benchmark.default_mlflow_tags

Returns:: tags for mlflow