neuralmonkey.dataset.dataset module

Implementation of the dataset class.

class neuralmonkey.dataset.dataset.Dataset(name: str, series: Dict[str, List], series_outputs: Dict[str, str], preprocessors: List[Tuple[str, str, Callable]] = None) → None

Bases: collections.abc.Sized

Base Dataset class.

This class serves as collection for data series for particular encoders and decoders in the model. If it is not provided a parent dataset, it also manages the vocabularies inferred from the data.

A data series is either a list of strings or a numpy array.

__init__(name: str, series: Dict[str, List], series_outputs: Dict[str, str], preprocessors: List[Tuple[str, str, Callable]] = None) → None

Create a dataset from the provided series of data.

Parameters:
  • name – The name for the dataset
  • series – Dictionary from the series name to the actual data.
  • series_outputs – Output files for target series.
  • preprocessors – The definition of the preprocessors.
add_series(name: str, series: List[Any]) → None
batch_dataset(batch_size: int) → Iterable[Dataset]

Split the dataset into a list of batched datasets.

Parameters:batch_size – The size of a batch.
Returns:Generator yielding batched datasets.
batch_serie(serie_name: str, batch_size: int) → Iterable[Iterable]

Split a data serie into batches.

Parameters:
  • serie_name – The name of the series
  • batch_size – The size of a batch
Returns:

Generator yielding batches of the data from the serie.

get_series(name: str) → Iterable

Get the data series with a given name.

Parameters:name – The name of the series to fetch.
Returns:The data series.
Raises:KeyError if the series does not exists.
has_series(name: str) → bool

Check if the dataset contains a series of a given name.

Parameters:name – Series name
Returns:True if the dataset contains the series, False otherwise.
maybe_get_series(name: str) → Union[Iterable, NoneType]

Get the data series with a given name.

Parameters:name – The name of the series to fetch.
Returns:The data series or None if it does not exist.
series_ids
shuffle() → None

Shuffle the dataset randomly.

subset(start: int, length: int) → neuralmonkey.dataset.dataset.Dataset