Multilabel datasets

This sub module contains all the current implementations for multilabel dataset. Each class below contains a different behaviour in presenting the data to the model.

Base dataset

class src.dataset.multilabel.base.MultilabelTrainingDataset(config: Config, logger: Logger, taxonomy: Taxonomy, dataset: list[dict[str, str]], tokenizer: AutoTokenizer = None, _device: device = device(type='cpu'), sub_node: str = None)[source]

Bases: TrainDataset

This class is the basic implementation for the multilabel dataset concept, as the name suggest these are datasets that support functionality to process multilabel data.

It has extended functionality compared to the single label dataset in order to manage the multible labels.

This class is based from ‘TrainDataset’, which on its own is derived from the torch.utils.Dataset.

Typical usage:

>>> from src.dataset.builder import DatasetBuilder
>>> dataset_builder = DatasetBuilder(...)
>>> ds = MultilabelTrainingDataset(
        config=Config(),
        logger=logging.logger,
        taxonomy=Taxonomy(...),
        dataset=dataset_builder.train_dataset
    )
>>> # getting the first custom formatted data example
>>> print(ds[0])

_abc_impl = <_abc._abc_data object>

_get_label(idx: int) → list[int][source]

This function implements the abstract logic for building the labels, it can be overwritten and adapted without problems (as long as the default input output signature is kept).

Parameters:: idx – the integer value of the input index
Returns:: a list of integer values where the labels should be predicted

_get_text(idx: int) → str[source]

This function implements the abstract logic in order to retrieve the text from the provided dataset. It is possible to overwrite this function and define custom behaviour to create the text input for the model.

Parameters:: idx – the integer value of the input index
Returns:: a list of integer values where the labels should be predicted

property binarized_label_dictionary: dict[str, int]

This property allows users to retrieve the blank dictionary that can be used for well formatted label binarization

Returns:: a dictionary containing the binarized candid labels

property candid_labels: list[str]

This property provides the functionality to set and retrieve the candid_labels. The candid_labels are the labels that are used for model prediction.

Returns:: list of strings representing the current labels

get_specific_record(idx: int, label_level: int) → dict[str, str | list][source]

property max_label_depth: int

This property can be used to set or retrieve the max depth (for labels handling logic)

Returns:: the integer value for the max depth

property taxonomy: Taxonomy

This property can be used to set or retrieve the taxonomy that is/will be used for processing and finding of labels.

Returns:: an instance of Taxonomy

Level 2 dataset

class src.dataset.multilabel.secondlevel.MultiLabelSecondLevelFullText(config: Config, logger: Logger, taxonomy: Taxonomy, dataset: list[dict[str, str]], tokenizer: AutoTokenizer = None, _device: device = device(type='cpu'), sub_node: str = None)[source]

Bases: MultilabelTrainingDataset

[Adaptation from base class] This implementation uses lvl2 labels

PARENT DOCS: —

This class is the basic implementation for the multilabel dataset concept, as the name suggest these are datasets that support functionality to process multilabel data.

It has extended functionality compared to the single label dataset in order to manage the multible labels.

This class is based from ‘TrainDataset’, which on its own is derived from the torch.utils.Dataset.

Typical usage:

>>> from src.dataset.builder import DatasetBuilder
>>> dataset_builder = DatasetBuilder(...)
>>> ds = MultilabelTrainingDataset(
        config=Config(),
        logger=logging.logger,
        taxonomy=Taxonomy(...),
        dataset=dataset_builder.train_dataset
    )
>>> # getting the first custom formatted data example
>>> print(ds[0])

_abc_impl = <_abc._abc_data object>

_get_label(idx: int) → list[int][source]

[Adapted implementation] get label returns only second level labels

This function implements the abstract logic for building the labels, it can be overwritten and adapted without problems (as long as the default input output signature is kept).

Parameters:: idx – the integer value of the input index
Returns:: a list of integer values where the labels should be predicted

_get_text(idx: int) → str[source]: [Adapted implementation] get text for second level dataset.

Summary stats dataset

class src.dataset.multilabel.summary_statistic_dataset.SummaryStatisticDataset(config: Config, logger: Logger, taxonomy: Taxonomy, dataset: list[dict[str, str]], tokenizer: AutoTokenizer = None, _device: device = device(type='cpu'), sub_node: str = None)[source]

Bases: MultilabelTrainingDataset, ABC

[Adapated from baseclase] This implementation uses lvl2 labels

PARENT DOCS: —

This class is the basic implementation for the multilabel dataset concept, as the name suggest these are datasets that support functionality to process multilabel data.

It has extended functionality compared to the single label dataset in order to manage the multible labels.

This class is based from ‘TrainDataset’, which on its own is derived from the torch.utils.Dataset.

Typical usage:

>>> from src.dataset.builder import DatasetBuilder
>>> dataset_builder = DatasetBuilder(...)
>>> ds = MultilabelTrainingDataset(
        config=Config(),
        logger=logging.logger,
        taxonomy=Taxonomy(...),
        dataset=dataset_builder.train_dataset
    )
>>> # getting the first custom formatted data example
>>> print(ds[0])

_abc_impl = <_abc._abc_data object>

_get_label(idx: int, label_level: int) → list[int][source]

[Adapted implementation] overwritten from baseclase, label responds with string value instead

This function implements the abstract logic for building the labels, it can be overwritten and adapted without problems (as long as the default input output signature is kept).

Parameters:: idx – the integer value of the input index
Returns:: a list of integer values where the labels should be predicted

get_specific_record(idx: int, label_level: int) → dict[str, str | list][source]

This function implements the functionality to retrieve what label is available at what level.

Parameters:

idx – the index to take as integer value
label_level – the label level as integer value

Returns:

Toplevel article based dataset

class src.dataset.multilabel.toplevel_article_based.MultiLabelTopLevelArticleBased(config: Config, logger: Logger, taxonomy: Taxonomy, dataset: list[dict[str, str]], tokenizer: AutoTokenizer = None, _device: device = device(type='cpu'), sub_node: str = None)[source]

Bases: MultiLabelTopLevelFullText

[Adaptation from base class] This implementation responds with Articles only

PARENT DOCS: —

This class is the basic implementation for the multilabel dataset concept, as the name suggest these are datasets that support functionality to process multilabel data.

It has extended functionality compared to the single label dataset in order to manage the multible labels.

This class is based from ‘TrainDataset’, which on its own is derived from the torch.utils.Dataset.

Typical usage:

>>> from src.dataset.builder import DatasetBuilder
>>> dataset_builder = DatasetBuilder(...)
>>> ds = MultilabelTrainingDataset(
        config=Config(),
        logger=logging.logger,
        taxonomy=Taxonomy(...),
        dataset=dataset_builder.train_dataset
    )
>>> # getting the first custom formatted data example
>>> print(ds[0])

_abc_impl = <_abc._abc_data object>

_get_text(idx: int) → str[source]

[Adapted implementation] get text returns only short title and article

This function implements the abstract logic in order to retrieve the text from the provided dataset. It is possible to overwrite this function and define custom behaviour to create the text input for the model.

Parameters:: idx – the integer value of the input index
Returns:: a list of integer values where the labels should be predicted

Toplevel article split dataset

class src.dataset.multilabel.toplevel_article_split.MultiLabelTopLevelArticleSplit(config: Config, logger: Logger, taxonomy: Taxonomy, dataset: list[dict[str, str]], tokenizer: AutoTokenizer = None, _device: device = device(type='cpu'), sub_node: str = None)[source]

Bases: MultiLabelTopLevelFullText

[Adaptation from base class] This adaptation splits decisions based on the articles

PARENT DOCS: —

This class is the basic implementation for the multilabel dataset concept, as the name suggest these are datasets that support functionality to process multilabel data.

It has extended functionality compared to the single label dataset in order to manage the multible labels.

This class is based from ‘TrainDataset’, which on its own is derived from the torch.utils.Dataset.

Typical usage:

>>> from src.dataset.builder import DatasetBuilder
>>> dataset_builder = DatasetBuilder(...)
>>> ds = MultilabelTrainingDataset(
        config=Config(),
        logger=logging.logger,
        taxonomy=Taxonomy(...),
        dataset=dataset_builder.train_dataset
    )
>>> # getting the first custom formatted data example
>>> print(ds[0])

_abc_impl = <_abc._abc_data object>

_get_text(idx: int) → str[source]

[Adapted implementation] get text returns only returns articles

This function implements the abstract logic in order to retrieve the text from the provided dataset. It is possible to overwrite this function and define custom behaviour to create the text input for the model.

Parameters:: idx – the integer value of the input index
Returns:: a list of integer values where the labels should be predicted

_remap_dataset() → None[source]: This function remaps the input dataset to an article based dataset, splitting documents on separate articles. :return:

Toplevel description dataset

class src.dataset.multilabel.toplevel_description_based.MultiLabelTopLevelDescriptionBased(config: Config, logger: Logger, taxonomy: Taxonomy, dataset: list[dict[str, str]], tokenizer: AutoTokenizer = None, _device: device = device(type='cpu'), sub_node: str = None)[source]

Bases: MultiLabelTopLevelFullText

[Adaptation from base class] This implementation responds with description only

PARENT DOCS: —

_abc_impl = <_abc._abc_data object>

_get_text(idx: int) → str[source]

[Adapted implementation] get text returns only returns description

This function implements the abstract logic in order to retrieve the text from the provided dataset. It is possible to overwrite this function and define custom behaviour to create the text input for the model.

Parameters:: idx – the integer value of the input index
Returns:: a list of integer values where the labels should be predicted

Toplevel general dataset

class src.dataset.multilabel.toplevel_general.MultiLabelTopLevelFullText(config: Config, logger: Logger, taxonomy: Taxonomy, dataset: list[dict[str, str]], tokenizer: AutoTokenizer = None, _device: device = device(type='cpu'), sub_node: str = None)[source]

Bases: MultilabelTrainingDataset

[Adaptation from base class] This implementation responds with the full text

PARENT DOCS: —

This class is the basic implementation for the multilabel dataset concept, as the name suggest these are datasets that support functionality to process multilabel data.

It has extended functionality compared to the single label dataset in order to manage the multible labels.

This class is based from ‘TrainDataset’, which on its own is derived from the torch.utils.Dataset.

Typical usage:

>>> from src.dataset.builder import DatasetBuilder
>>> dataset_builder = DatasetBuilder(...)
>>> ds = MultilabelTrainingDataset(
        config=Config(),
        logger=logging.logger,
        taxonomy=Taxonomy(...),
        dataset=dataset_builder.train_dataset
    )
>>> # getting the first custom formatted data example
>>> print(ds[0])

_abc_impl = <_abc._abc_data object>

_get_text(idx: int) → str[source]

[Adapted implementation] get text returns full text

This function implements the abstract logic in order to retrieve the text from the provided dataset. It is possible to overwrite this function and define custom behaviour to create the text input for the model.

Parameters:: idx – the integer value of the input index
Returns:: a list of integer values where the labels should be predicted

Toplevel motivation dataset

class src.dataset.multilabel.toplevel_motivation_based.MultiLabelTopLevelMotivationBased(config: Config, logger: Logger, taxonomy: Taxonomy, dataset: list[dict[str, str]], tokenizer: AutoTokenizer = None, _device: device = device(type='cpu'), sub_node: str = None)[source]

Bases: MultiLabelTopLevelFullText

[Adaptation from base class] This implementation responds with motivation only

PARENT DOCS: —

This class is the basic implementation for the multilabel dataset concept, as the name suggest these are datasets that support functionality to process multilabel data.

It has extended functionality compared to the single label dataset in order to manage the multible labels.

This class is based from ‘TrainDataset’, which on its own is derived from the torch.utils.Dataset.

Typical usage:

>>> from src.dataset.builder import DatasetBuilder
>>> dataset_builder = DatasetBuilder(...)
>>> ds = MultilabelTrainingDataset(
        config=Config(),
        logger=logging.logger,
        taxonomy=Taxonomy(...),
        dataset=dataset_builder.train_dataset
    )
>>> # getting the first custom formatted data example
>>> print(ds[0])

_abc_impl = <_abc._abc_data object>

_get_text(idx: int) → str[source]

[Adapted implementation] get text only motivation

This function implements the abstract logic in order to retrieve the text from the provided dataset. It is possible to overwrite this function and define custom behaviour to create the text input for the model.

Parameters:: idx – the integer value of the input index
Returns:: a list of integer values where the labels should be predicted

Toplevel short title dataset

class src.dataset.multilabel.toplevel_shortitle_based.MultiLabelTopLevelShortTitleBased(config: Config, logger: Logger, taxonomy: Taxonomy, dataset: list[dict[str, str]], tokenizer: AutoTokenizer = None, _device: device = device(type='cpu'), sub_node: str = None)[source]

Bases: MultiLabelTopLevelFullText

[Adaptation from base class] This implementation responds with short title only

PARENT DOCS: —

This class is the basic implementation for the multilabel dataset concept, as the name suggest these are datasets that support functionality to process multilabel data.

It has extended functionality compared to the single label dataset in order to manage the multible labels.

This class is based from ‘TrainDataset’, which on its own is derived from the torch.utils.Dataset.

Typical usage:

>>> from src.dataset.builder import DatasetBuilder
>>> dataset_builder = DatasetBuilder(...)
>>> ds = MultilabelTrainingDataset(
        config=Config(),
        logger=logging.logger,
        taxonomy=Taxonomy(...),
        dataset=dataset_builder.train_dataset
    )
>>> # getting the first custom formatted data example
>>> print(ds[0])

_abc_impl = <_abc._abc_data object>

_get_text(idx: int) → str[source]

[Adapted implementation] only return short title

This function implements the abstract logic in order to retrieve the text from the provided dataset. It is possible to overwrite this function and define custom behaviour to create the text input for the model.

Parameters:: idx – the integer value of the input index
Returns:: a list of integer values where the labels should be predicted