neuralmonkey.dataset.lazy_dataset module

Lazy dataset which does not load the whole data into memory.

class neuralmonkey.dataset.lazy_dataset.LazyDataset(name: str, series_paths_and_readers: Dict[str, Tuple[List[str], Callable[[List[str]], Any]]], series_outputs: Dict[str, str], preprocessors: List[Tuple[str, str, Callable]] = None) → None

Bases: neuralmonkey.dataset.dataset.Dataset

Implements the lazy dataset.

The main difference between this implementation and the default one is that the contents of the file are not fully loaded to the memory. Instead, everytime the function get_series is called, a new file handle is created and a generator which yields lines from the file is returned.

__init__(name: str, series_paths_and_readers: Dict[str, Tuple[List[str], Callable[[List[str]], Any]]], series_outputs: Dict[str, str], preprocessors: List[Tuple[str, str, Callable]] = None) → None

Create a new instance of the lazy dataset.

Lazy dataset first initializes the parent Dataset object with empty series

Parameters:
  • name – The name of the dataset
  • series_paths_and_readers – Dictionary that maps each series ID to a list of files and a reader.
  • series_outputs – A mapping of series IDs to output files.
  • preprocessors – The preprocessors to apply to the data series. Each preprocessor is defined by source series ID, resulting preprocessed series ID, and a function that is applied on the source series.
add_lazy_series(name: str) → None
add_series(name: str, series: List) → None
get_series(name: str) → Iterable

Get the data series with a given name.

This function opens a new file handle and returns a generator which yields preprocessed lines from the file.

Parameters:name – The name of the series to fetch.
Returns:The data series.
Raises:KeyError if the series does not exist.
has_series(name: str) → bool

Check if the dataset contains a series of a given name.

Parameters:name – Series name
Returns:True if the dataset contains the series, False otherwise.
maybe_get_series(name: str) → Union[Iterable, NoneType]

Get the data series with a given name or None if it does not exist.

This function opens a new file handle and returns a generator which yields preprocessed lines from the file.

Parameters:name – The name of the series to fetch.
Returns:The data series or None if it does not exist.
series_ids
shuffle() → None

Do nothing, not in-memory shuffle is impossible.

TODO: this is related to the __len__ method.

subset(start: int, length: int) → neuralmonkey.dataset.dataset.Dataset