neuralmonkey.dataset.dataset module¶
Implementation of the dataset class.
-
class
neuralmonkey.dataset.dataset.
Dataset
(name: str, series: Dict[str, List], series_outputs: Dict[str, str], preprocessors: List[Tuple[str, str, Callable]] = None) → None¶ Bases:
collections.abc.Sized
Base Dataset class.
This class serves as collection for data series for particular encoders and decoders in the model. If it is not provided a parent dataset, it also manages the vocabularies inferred from the data.
A data series is either a list of strings or a numpy array.
-
__init__
(name: str, series: Dict[str, List], series_outputs: Dict[str, str], preprocessors: List[Tuple[str, str, Callable]] = None) → None¶ Create a dataset from the provided series of data.
Parameters: - name – The name for the dataset
- series – Dictionary from the series name to the actual data.
- series_outputs – Output files for target series.
- preprocessors – The definition of the preprocessors.
-
add_series
(name: str, series: List[Any]) → None¶
-
batch_dataset
(batch_size: int) → Iterable[Dataset]¶ Split the dataset into a list of batched datasets.
Parameters: batch_size – The size of a batch. Returns: Generator yielding batched datasets.
-
batch_serie
(serie_name: str, batch_size: int) → Iterable[Iterable]¶ Split a data serie into batches.
Parameters: - serie_name – The name of the series
- batch_size – The size of a batch
Returns: Generator yielding batches of the data from the serie.
-
get_series
(name: str) → Iterable¶ Get the data series with a given name.
Parameters: name – The name of the series to fetch. Returns: The data series. Raises: KeyError if the series does not exists.
-
has_series
(name: str) → bool¶ Check if the dataset contains a series of a given name.
Parameters: name – Series name Returns: True if the dataset contains the series, False otherwise.
-
maybe_get_series
(name: str) → Union[Iterable, NoneType]¶ Get the data series with a given name.
Parameters: name – The name of the series to fetch. Returns: The data series or None if it does not exist.
-
series_ids
¶
-
shuffle
() → None¶ Shuffle the dataset randomly.
-
subset
(start: int, length: int) → neuralmonkey.dataset.dataset.Dataset¶
-