Single label datasets

This sub module contains all the current implementations for multilabel dataset. Each class below contains a different behaviour in presenting the data to the model.

Base dataset

class src.dataset.singlelabel.base.TrainingDataset(config: Config, logger: Logger, taxonomy: Taxonomy, data: Any, tokenizer: AutoTokenizer = None, device: str = 'cpu')[source]

Bases: TrainDataset

This class is the basic implementation for the single label dataset concept. This class is based from ‘TrainDataset’, which on its own is derived from the torch.utils.Dataset.

Typical usage:

>>> from src.dataset.builder import DatasetBuilder
>>> dataset_builder = DatasetBuilder(...)
>>> ds = TrainingDataset(
        config=Config(),
        logger=logging.logger,
        taxonomy=Taxonomy(...),
        dataset=dataset_builder.train_dataset
    )
>>> # getting the first custom formatted data example
>>> print(ds[0])

abstract __get_label(idx: int) → Tensor | str

abstract __get_text(idx: int) → dict[str, Tensor] | str

_abc_impl = <_abc._abc_data object>

iter_records()[source]

Iterator function that iterates over all the items in the dataset

Returns:

Basic dataset

class src.dataset.singlelabel.basic.BasicDataset(config: Config, logger: Logger, taxonomy: Taxonomy, tokenizer: AutoTokenizer, data: pd.DataFrame, device: str = 'cpu')[source]

Bases: TrainingDataset

[unused]

_abc_impl = <_abc._abc_data object>

Toplevel dataset

class src.dataset.singlelabel.single_toplevel.SingleTopLevel(config: Config, logger: Logger, taxonomy: Taxonomy, data: dict[str, str], tokenizer: AutoTokenizer = None, device: str = 'cpu')[source]

Bases: TrainingDataset

[Adapated from baseclase] This implementation returns full text

PARENT DOCS: —

This class is the basic implementation for the single label dataset concept. This class is based from ‘TrainDataset’, which on its own is derived from the torch.utils.Dataset.

Typical usage:

>>> from src.dataset.builder import DatasetBuilder
>>> dataset_builder = DatasetBuilder(...)
>>> ds = TrainingDataset(
        config=Config(),
        logger=logging.logger,
        taxonomy=Taxonomy(...),
        dataset=dataset_builder.train_dataset
    )
>>> # getting the first custom formatted data example
>>> print(ds[0])

_abc_impl = <_abc._abc_data object>

_get_label(idx: int) → str[source]

This function implements the abstract logic for building the labels, it can be overwritten and adapted without problems (as long as the default input output signature is kept).

Parameters:: idx – the integer value of the input index
Returns:: the label

_get_text(idx: int) → str[source]

This function implements the abstract logic in order to retrieve the text from the provided dataset. It is possible to overwrite this function and define custom behaviour to create the text input for the model.

Parameters:: idx – the integer value of the input index
Returns:: a list of integer values where the labels should be predicted