Single label datasets
This sub module contains all the current implementations for multilabel dataset. Each class below contains a different behaviour in presenting the data to the model.
Base dataset
- class src.dataset.singlelabel.base.TrainingDataset(config: Config, logger: Logger, taxonomy: Taxonomy, data: Any, tokenizer: AutoTokenizer = None, device: str = 'cpu')[source]
Bases:
TrainDatasetThis class is the basic implementation for the single label dataset concept. This class is based from ‘TrainDataset’, which on its own is derived from the torch.utils.Dataset.
- Typical usage:
>>> from src.dataset.builder import DatasetBuilder >>> dataset_builder = DatasetBuilder(...) >>> ds = TrainingDataset( config=Config(), logger=logging.logger, taxonomy=Taxonomy(...), dataset=dataset_builder.train_dataset ) >>> # getting the first custom formatted data example >>> print(ds[0])
- abstract __get_label(idx: int) Tensor | str
- abstract __get_text(idx: int) dict[str, Tensor] | str
- _abc_impl = <_abc._abc_data object>
Basic dataset
Toplevel dataset
- class src.dataset.singlelabel.single_toplevel.SingleTopLevel(config: Config, logger: Logger, taxonomy: Taxonomy, data: dict[str, str], tokenizer: AutoTokenizer = None, device: str = 'cpu')[source]
Bases:
TrainingDataset[Adapated from baseclase] This implementation returns full text
PARENT DOCS: —
This class is the basic implementation for the single label dataset concept. This class is based from ‘TrainDataset’, which on its own is derived from the torch.utils.Dataset.
- Typical usage:
>>> from src.dataset.builder import DatasetBuilder >>> dataset_builder = DatasetBuilder(...) >>> ds = TrainingDataset( config=Config(), logger=logging.logger, taxonomy=Taxonomy(...), dataset=dataset_builder.train_dataset ) >>> # getting the first custom formatted data example >>> print(ds[0])
- _abc_impl = <_abc._abc_data object>
- _get_label(idx: int) str[source]
This function implements the abstract logic for building the labels, it can be overwritten and adapted without problems (as long as the default input output signature is kept).
- Parameters:
idx – the integer value of the input index
- Returns:
the label
- _get_text(idx: int) str[source]
This function implements the abstract logic in order to retrieve the text from the provided dataset. It is possible to overwrite this function and define custom behaviour to create the text input for the model.
- Parameters:
idx – the integer value of the input index
- Returns:
a list of integer values where the labels should be predicted