Benchmarks
Add more information here later !
Abstract Baseclass
Regular Benchmark
- class src.benchmark.regular.BenchmarkWrapper(config: Config, logger: Logger, request_handler: RequestHandler, model_ids: list[str] | str, taxonomy_reference: str, checkpoint_dir: str = 'data', nested_mlflow_run: bool = False)[source]
Bases:
BenchmarkBaseThis is the baseclass for benchmarking models.
The main objective of this class is brining multiple components togheter, these components all contain their own custom behaviour for a certain part of the code.
- These sub-components are:
Dataset: How do we transform the data from sparql data into training/inference data?
Model architecture (or type): A wrapper object around a model to allow abstract usage of methods that are universally implmented.
Model base: Model weights to load in the predefined model Architecture
Taxonomy: A specific taxonomy to use for predictions
The previously mentioned components are mostly declared in the config, only the model base is provided under the model_ids parameter.
- Typical usage:
>>> benchmark = BenchmarkWrapper( config=Config(), logger=logging.logger, request_handler=RequestHandler(), model_ids=["...", ...], taxonomoy_reference="..." ) >>> benchmark()
- _abc_impl = <_abc._abc_data object>
- _create_dataset(checkpoint: str | None) None[source]
Internal function that is responsible for the creation of the benchmarking dataset. When it is not provided with a checkpoint, it will automatically start building the dataset by pulling all annotated information for the provided taxonomy.
- Parameters:
checkpoint – folder or path where the benchmark data can be found
- Returns:
Nothing
- _create_run_name(model_id: str | None = None) str[source]
Internal function that generates custom run names, these are used for verbose naming in the mlflow tracking.
- Parameters:
model_id – the model id of the currently selected model
- Returns:
a custom string that represents the unique combination of components
- property default_description: str
This property provides a getter for the default description that should be provided for mlflow logging
- Example usage:
>>> benchmark = BenchmarkWrapper(...) >>> description = benchmark.default_description
- Returns:
string description for mlflow run
- property default_mlflow_tags: dict[str, str]
This property provides a getter for the default mlflow tags that are provided by the selection of the class
- Example usage:
>>> benchmark = BenchmarkWrapper(...) >>> mlflow_tags = benchmark.default_mlflow_tags
- Returns:
tags for mlflow
- exec()[source]
This function is responsible for the execution of the benchmark. It creates (nested-)mlflow runs, these runs are based on the pre-defined config and selected base model.
For each combination, a (nested) run will appear in the mlflow interface containing all the artifacts created by a benchmark run. (more info about the artifacts can be found in the evaluate class)
- Example usage:
>>> benchmark = BenchmarkWrapper(...) >>> benchmark.exec()
- Returns:
Nothing at all
Embedding Similarity Benchmark
- class src.benchmark.embedding_similarity.EmbeddingSimilarityBenchmark(config: Config, logger: Logger, request_handler: RequestHandler, model_ids: list[str] | str, taxonomy_reference: str = 'http://stad.gent/id/concepts/gent_words', nested_mlflow_run: bool = False, checkpoint_dir: str = 'data')[source]
Bases:
BenchmarkWrapperThis is the wrapper class for zeroshot models benchmarking, for more information check out the baseclass
>>> benchmark = EmbeddingSimilarityBenchmark( config=Config(), logger=logging.logger, request_handler=RequestHandler(), model_ids=["...", ...], taxonomoy_reference="..." ) >>> benchmark()
- _abc_impl = <_abc._abc_data object>
- property default_description
This property provides a getter for the default description that should be provided for mlflow logging
- Example usage:
>>> benchmark = EmbeddingSimilarityBenchmark(...) >>> description = EmbeddingSimilarityBenchmark.default_description
- Returns:
string description for mlflow run
- property default_mlflow_tags
This property provides a getter for the default mlflow tags that are provided by the selection of the class
- Example usage:
>>> benchmark = EmbeddingSimilarityBenchmark(...) >>> mlflow_tags = benchmark.default_mlflow_tags
- Returns:
tags for mlflow
Evaluate
- class src.benchmark.evaluate.MultilabelEvaluation(config: Config, logger: Logger, model: Model, dataset: TrainDataset, multilabel: bool = True)[source]
Bases:
objectThis class is the framework that executes a specific evaluation of a model. The evaluation loops over the dataset and generates predictions for each record, once completed it executes the metric calculation script.
- Example usage:
>>> multilabel_eval = MultilabelEvaluation( config=Config(), logger=logging.logger, model = Model(...), dataset = TrainingDataset(...), multilabel = True ) >>> multilabel_eval.evaluate()
Hybrid Benchmark
- class src.benchmark.hybrid.HybridBenchmark(config: Config, logger: Logger, request_handler: RequestHandler, unsupervised_model_ids: list[str] | str, supervised_model_id: str, unsupervised_model_type: ModelType, taxonomy_reference: str = 'http://stad.gent/id/concepts/gent_words', checkpoint_dir: str = 'data', nested_mlflow_run: bool = False)[source]
Bases:
BenchmarkWrapperThis is the wrapper class for hybrid model benchmarking, for more information check out the baseclass
>>> benchmark = HybridBenchmark( config=Config(), logger=logging.logger, request_handler=RequestHandler(), model_ids=["...", ...], taxonomoy_reference="..." ) >>> benchmark()
- _abc_impl = <_abc._abc_data object>
- property default_description
This property provides a getter for the default description that should be provided for mlflow logging
- Example usage:
>>> benchmark = HybridBenchmark(...) >>> description = benchmark.default_description
- Returns:
string description for mlflow run
- property default_mlflow_tags
This property provides a getter for the default mlflow tags that are provided by the selection of the class
- Example usage:
>>> benchmark = HybridBenchmark(...) >>> mlflow_tags = benchmark.default_mlflow_tags
- Returns:
tags for mlflow
Metrics
- class src.benchmark.metrics.Metrics(config: BenchmarkConfig, logger: Logger, model_id: str, base_folder: str, classes: list[str], average: str = 'weighted')[source]
Bases:
objectThis class is used to compute metrics for a benchmarking run. It is capable of generating multiple different artifacts that can be used for further model performance analysis.
- compute(y_true: array, logits: array, suffix: str | None = None, save: bool = True) DataFrame[source]
This function brings all previous calculations together. Paired with the config, you can enable and disable certain artifacts and metrics
- Metrics that can be used:
Hamming score
f1 score
precision
recall
- Artifacts that can be generated:
Confusion matrix
classification report
precision-recall plot
overview plot
- Parameters:
y_true – the values that are expected to be predicted
logits – the logits for the predicted values
suffix – a suffix that can be used for custom naming of the artifacts
save – flag to save (currently not used -> config handles this)
- Returns:
pandas dataframe containing all the metric values
- f1_score(y_true: array, y_pred: array) array[source]
This function contains the code to compute the hamming_Score.
- F1 score formula:
>>> F1 = 2 * (precision * recall) / (precision + recall)
The F1 score is a statistical measure of a test’s accuracy, particularly in the context of binary classification. It is the harmonic mean of precision and recall, which are two important metrics for evaluating the performance of binary classifiers.
- Parameters:
y_true – matrix what the labels should be
y_pred – matrix with the predicted labels
- Returns:
the f1 scores
- static hamming_score(y_true: array, y_pred: array) array[source]
This function contains the code to compute the hamming_Score.
- Hamming score formula:
>>> Hamming score = (Σ (y_true_i == y_pred_i)) / (Σ 1)
The Hamming score is a useful metric for evaluating the performance of multi-label classification models, which are models that predict multiple labels for each instance. In multi-label classification, the Hamming score is more sensitive to errors than accuracy, as it considers both false positives and false negatives.
- Parameters:
y_true – matrix what the labels should be
y_pred – matrix with the predicted labels
- Returns:
the hamming scores
- precision_score(y_true: array, y_pred: array) array[source]
This function enables the computation of precision score
- The precision formula:
>>> Precision = TP / (TP + FP) >>> where: >>> TP is the number of true positives >>> FP is the number of false positives
In machine learning, precision, also known as positive predictive value (PPV), is the proportion of predicted positives that are actually positive. It is calculated as the number of true positives divided by the total number of predicted positives. Precision is a binary classification metric, meaning it is only relevant when a model is predicting one of two classes
- Parameters:
y_true – matrix what the labels should be
y_pred – matrix with the predicted labels
- Returns:
the precision scores
- recall_score(y_true: array, y_pred: array) array[source]
This function computes the recall for the given input
- The recall formula:
>>> Recall = TP / (TP + FN) >>> where: >>> TP is the number of true positives >>> FN is the number of false negatives
In machine learning, recall, also known as sensitivity, true positive rate (TPR), or completeness, is the proportion of actual positives that are correctly identified as such by the model. It is calculated as the number of true positives divided by the total number of actual positives. Recall is a binary classification metric, meaning it is only relevant when a model is predicting one of two classes.
- Parameters:
y_true – matrix what the labels should be
y_pred – matrix with the predicted labels
- Returns:
the recall scores
Zeroshot Benchmark
- class src.benchmark.zeroshot.ZeroshotBenchmark(config: Config, logger: Logger, request_handler: RequestHandler, model_ids: list[str] | str, taxonomy_reference: str = 'http://stad.gent/id/concepts/gent_words', checkpoint_dir: str = 'data', nested_mlflow_run: bool = False)[source]
Bases:
BenchmarkWrapperThis is the wrapper class for zeroshot models benchmarking, for more information check out the baseclass
>>> benchmark = ZeroshotBenchmark( config=Config(), logger=logging.logger, request_handler=RequestHandler(), model_ids=["...", ...], taxonomoy_reference="..." ) >>> benchmark()
- _abc_impl = <_abc._abc_data object>
- property default_description
This property provides a getter for the default description that should be provided for mlflow logging
- Example usage:
>>> benchmark = ZeroshotBenchmark(...) >>> description = benchmark.default_description
- Returns:
string description for mlflow run
- property default_mlflow_tags
This property provides a getter for the default mlflow tags that are provided by the selection of the class
- Example usage:
>>> benchmark = ZeroshotBenchmark(...) >>> mlflow_tags = benchmark.default_mlflow_tags
- Returns:
tags for mlflow