Multilabel datasets
This sub module contains all the current implementations for multilabel dataset. Each class below contains a different behaviour in presenting the data to the model.
Base dataset
- class src.dataset.multilabel.base.MultilabelTrainingDataset(config: Config, logger: Logger, taxonomy: Taxonomy, dataset: list[dict[str, str]], tokenizer: AutoTokenizer = None, _device: device = device(type='cpu'), sub_node: str = None)[source]
Bases:
TrainDatasetThis class is the basic implementation for the multilabel dataset concept, as the name suggest these are datasets that support functionality to process multilabel data.
It has extended functionality compared to the single label dataset in order to manage the multible labels.
This class is based from ‘TrainDataset’, which on its own is derived from the torch.utils.Dataset.
- Typical usage:
>>> from src.dataset.builder import DatasetBuilder >>> dataset_builder = DatasetBuilder(...) >>> ds = MultilabelTrainingDataset( config=Config(), logger=logging.logger, taxonomy=Taxonomy(...), dataset=dataset_builder.train_dataset ) >>> # getting the first custom formatted data example >>> print(ds[0])
- _abc_impl = <_abc._abc_data object>
- _get_label(idx: int) list[int][source]
This function implements the abstract logic for building the labels, it can be overwritten and adapted without problems (as long as the default input output signature is kept).
- Parameters:
idx – the integer value of the input index
- Returns:
a list of integer values where the labels should be predicted
- _get_text(idx: int) str[source]
This function implements the abstract logic in order to retrieve the text from the provided dataset. It is possible to overwrite this function and define custom behaviour to create the text input for the model.
- Parameters:
idx – the integer value of the input index
- Returns:
a list of integer values where the labels should be predicted
- property binarized_label_dictionary: dict[str, int]
This property allows users to retrieve the blank dictionary that can be used for well formatted label binarization
- Returns:
a dictionary containing the binarized candid labels
- property candid_labels: list[str]
This property provides the functionality to set and retrieve the candid_labels. The candid_labels are the labels that are used for model prediction.
- Returns:
list of strings representing the current labels
- property max_label_depth: int
This property can be used to set or retrieve the max depth (for labels handling logic)
- Returns:
the integer value for the max depth
Level 2 dataset
- class src.dataset.multilabel.secondlevel.MultiLabelSecondLevelFullText(config: Config, logger: Logger, taxonomy: Taxonomy, dataset: list[dict[str, str]], tokenizer: AutoTokenizer = None, _device: device = device(type='cpu'), sub_node: str = None)[source]
Bases:
MultilabelTrainingDataset[Adaptation from base class] This implementation uses lvl2 labels
PARENT DOCS: —
This class is the basic implementation for the multilabel dataset concept, as the name suggest these are datasets that support functionality to process multilabel data.
It has extended functionality compared to the single label dataset in order to manage the multible labels.
This class is based from ‘TrainDataset’, which on its own is derived from the torch.utils.Dataset.
- Typical usage:
>>> from src.dataset.builder import DatasetBuilder >>> dataset_builder = DatasetBuilder(...) >>> ds = MultilabelTrainingDataset( config=Config(), logger=logging.logger, taxonomy=Taxonomy(...), dataset=dataset_builder.train_dataset ) >>> # getting the first custom formatted data example >>> print(ds[0])
- _abc_impl = <_abc._abc_data object>
- _get_label(idx: int) list[int][source]
[Adapted implementation] get label returns only second level labels
This function implements the abstract logic for building the labels, it can be overwritten and adapted without problems (as long as the default input output signature is kept).
- Parameters:
idx – the integer value of the input index
- Returns:
a list of integer values where the labels should be predicted
Summary stats dataset
- class src.dataset.multilabel.summary_statistic_dataset.SummaryStatisticDataset(config: Config, logger: Logger, taxonomy: Taxonomy, dataset: list[dict[str, str]], tokenizer: AutoTokenizer = None, _device: device = device(type='cpu'), sub_node: str = None)[source]
Bases:
MultilabelTrainingDataset,ABC[Adapated from baseclase] This implementation uses lvl2 labels
PARENT DOCS: —
This class is the basic implementation for the multilabel dataset concept, as the name suggest these are datasets that support functionality to process multilabel data.
It has extended functionality compared to the single label dataset in order to manage the multible labels.
This class is based from ‘TrainDataset’, which on its own is derived from the torch.utils.Dataset.
- Typical usage:
>>> from src.dataset.builder import DatasetBuilder >>> dataset_builder = DatasetBuilder(...) >>> ds = MultilabelTrainingDataset( config=Config(), logger=logging.logger, taxonomy=Taxonomy(...), dataset=dataset_builder.train_dataset ) >>> # getting the first custom formatted data example >>> print(ds[0])
- _abc_impl = <_abc._abc_data object>
- _get_label(idx: int, label_level: int) list[int][source]
[Adapted implementation] overwritten from baseclase, label responds with string value instead
This function implements the abstract logic for building the labels, it can be overwritten and adapted without problems (as long as the default input output signature is kept).
- Parameters:
idx – the integer value of the input index
- Returns:
a list of integer values where the labels should be predicted
Toplevel article based dataset
- class src.dataset.multilabel.toplevel_article_based.MultiLabelTopLevelArticleBased(config: Config, logger: Logger, taxonomy: Taxonomy, dataset: list[dict[str, str]], tokenizer: AutoTokenizer = None, _device: device = device(type='cpu'), sub_node: str = None)[source]
Bases:
MultiLabelTopLevelFullText[Adaptation from base class] This implementation responds with Articles only
PARENT DOCS: —
This class is the basic implementation for the multilabel dataset concept, as the name suggest these are datasets that support functionality to process multilabel data.
It has extended functionality compared to the single label dataset in order to manage the multible labels.
This class is based from ‘TrainDataset’, which on its own is derived from the torch.utils.Dataset.
- Typical usage:
>>> from src.dataset.builder import DatasetBuilder >>> dataset_builder = DatasetBuilder(...) >>> ds = MultilabelTrainingDataset( config=Config(), logger=logging.logger, taxonomy=Taxonomy(...), dataset=dataset_builder.train_dataset ) >>> # getting the first custom formatted data example >>> print(ds[0])
- _abc_impl = <_abc._abc_data object>
- _get_text(idx: int) str[source]
[Adapted implementation] get text returns only short title and article
This function implements the abstract logic in order to retrieve the text from the provided dataset. It is possible to overwrite this function and define custom behaviour to create the text input for the model.
- Parameters:
idx – the integer value of the input index
- Returns:
a list of integer values where the labels should be predicted
Toplevel article split dataset
- class src.dataset.multilabel.toplevel_article_split.MultiLabelTopLevelArticleSplit(config: Config, logger: Logger, taxonomy: Taxonomy, dataset: list[dict[str, str]], tokenizer: AutoTokenizer = None, _device: device = device(type='cpu'), sub_node: str = None)[source]
Bases:
MultiLabelTopLevelFullText[Adaptation from base class] This adaptation splits decisions based on the articles
PARENT DOCS: —
This class is the basic implementation for the multilabel dataset concept, as the name suggest these are datasets that support functionality to process multilabel data.
It has extended functionality compared to the single label dataset in order to manage the multible labels.
This class is based from ‘TrainDataset’, which on its own is derived from the torch.utils.Dataset.
- Typical usage:
>>> from src.dataset.builder import DatasetBuilder >>> dataset_builder = DatasetBuilder(...) >>> ds = MultilabelTrainingDataset( config=Config(), logger=logging.logger, taxonomy=Taxonomy(...), dataset=dataset_builder.train_dataset ) >>> # getting the first custom formatted data example >>> print(ds[0])
- _abc_impl = <_abc._abc_data object>
- _get_text(idx: int) str[source]
[Adapted implementation] get text returns only returns articles
This function implements the abstract logic in order to retrieve the text from the provided dataset. It is possible to overwrite this function and define custom behaviour to create the text input for the model.
- Parameters:
idx – the integer value of the input index
- Returns:
a list of integer values where the labels should be predicted
Toplevel description dataset
- class src.dataset.multilabel.toplevel_description_based.MultiLabelTopLevelDescriptionBased(config: Config, logger: Logger, taxonomy: Taxonomy, dataset: list[dict[str, str]], tokenizer: AutoTokenizer = None, _device: device = device(type='cpu'), sub_node: str = None)[source]
Bases:
MultiLabelTopLevelFullText[Adaptation from base class] This implementation responds with description only
PARENT DOCS: —
- _abc_impl = <_abc._abc_data object>
- _get_text(idx: int) str[source]
[Adapted implementation] get text returns only returns description
This function implements the abstract logic in order to retrieve the text from the provided dataset. It is possible to overwrite this function and define custom behaviour to create the text input for the model.
- Parameters:
idx – the integer value of the input index
- Returns:
a list of integer values where the labels should be predicted
Toplevel general dataset
- class src.dataset.multilabel.toplevel_general.MultiLabelTopLevelFullText(config: Config, logger: Logger, taxonomy: Taxonomy, dataset: list[dict[str, str]], tokenizer: AutoTokenizer = None, _device: device = device(type='cpu'), sub_node: str = None)[source]
Bases:
MultilabelTrainingDataset[Adaptation from base class] This implementation responds with the full text
PARENT DOCS: —
This class is the basic implementation for the multilabel dataset concept, as the name suggest these are datasets that support functionality to process multilabel data.
It has extended functionality compared to the single label dataset in order to manage the multible labels.
This class is based from ‘TrainDataset’, which on its own is derived from the torch.utils.Dataset.
- Typical usage:
>>> from src.dataset.builder import DatasetBuilder >>> dataset_builder = DatasetBuilder(...) >>> ds = MultilabelTrainingDataset( config=Config(), logger=logging.logger, taxonomy=Taxonomy(...), dataset=dataset_builder.train_dataset ) >>> # getting the first custom formatted data example >>> print(ds[0])
- _abc_impl = <_abc._abc_data object>
- _get_text(idx: int) str[source]
[Adapted implementation] get text returns full text
This function implements the abstract logic in order to retrieve the text from the provided dataset. It is possible to overwrite this function and define custom behaviour to create the text input for the model.
- Parameters:
idx – the integer value of the input index
- Returns:
a list of integer values where the labels should be predicted
Toplevel motivation dataset
- class src.dataset.multilabel.toplevel_motivation_based.MultiLabelTopLevelMotivationBased(config: Config, logger: Logger, taxonomy: Taxonomy, dataset: list[dict[str, str]], tokenizer: AutoTokenizer = None, _device: device = device(type='cpu'), sub_node: str = None)[source]
Bases:
MultiLabelTopLevelFullText[Adaptation from base class] This implementation responds with motivation only
PARENT DOCS: —
This class is the basic implementation for the multilabel dataset concept, as the name suggest these are datasets that support functionality to process multilabel data.
It has extended functionality compared to the single label dataset in order to manage the multible labels.
This class is based from ‘TrainDataset’, which on its own is derived from the torch.utils.Dataset.
- Typical usage:
>>> from src.dataset.builder import DatasetBuilder >>> dataset_builder = DatasetBuilder(...) >>> ds = MultilabelTrainingDataset( config=Config(), logger=logging.logger, taxonomy=Taxonomy(...), dataset=dataset_builder.train_dataset ) >>> # getting the first custom formatted data example >>> print(ds[0])
- _abc_impl = <_abc._abc_data object>
- _get_text(idx: int) str[source]
[Adapted implementation] get text only motivation
This function implements the abstract logic in order to retrieve the text from the provided dataset. It is possible to overwrite this function and define custom behaviour to create the text input for the model.
- Parameters:
idx – the integer value of the input index
- Returns:
a list of integer values where the labels should be predicted
Toplevel short title dataset
- class src.dataset.multilabel.toplevel_shortitle_based.MultiLabelTopLevelShortTitleBased(config: Config, logger: Logger, taxonomy: Taxonomy, dataset: list[dict[str, str]], tokenizer: AutoTokenizer = None, _device: device = device(type='cpu'), sub_node: str = None)[source]
Bases:
MultiLabelTopLevelFullText[Adaptation from base class] This implementation responds with short title only
PARENT DOCS: —
This class is the basic implementation for the multilabel dataset concept, as the name suggest these are datasets that support functionality to process multilabel data.
It has extended functionality compared to the single label dataset in order to manage the multible labels.
This class is based from ‘TrainDataset’, which on its own is derived from the torch.utils.Dataset.
- Typical usage:
>>> from src.dataset.builder import DatasetBuilder >>> dataset_builder = DatasetBuilder(...) >>> ds = MultilabelTrainingDataset( config=Config(), logger=logging.logger, taxonomy=Taxonomy(...), dataset=dataset_builder.train_dataset ) >>> # getting the first custom formatted data example >>> print(ds[0])
- _abc_impl = <_abc._abc_data object>
- _get_text(idx: int) str[source]
[Adapted implementation] only return short title
This function implements the abstract logic in order to retrieve the text from the provided dataset. It is possible to overwrite this function and define custom behaviour to create the text input for the model.
- Parameters:
idx – the integer value of the input index
- Returns:
a list of integer values where the labels should be predicted