neuralmonkey.dataset.lazy_dataset module

Lazy dataset which does not load the whole data into memory.

class neuralmonkey.dataset.lazy_dataset.LazyDataset(name: str, series_paths_and_readers: Dict[str, Tuple[List[str], Callable[[List[str]], Any]]], series_outputs: Dict[str, str], preprocessors: List[Tuple[str, str, Callable]] = None) → None

Bases: neuralmonkey.dataset.dataset.Dataset

Implements the lazy dataset.

The main difference between this implementation and the default one is that the contents of the file are not fully loaded to the memory. Instead, everytime the function get_series is called, a new file handle is created and a generator which yields lines from the file is returned.

__init__(name: str, series_paths_and_readers: Dict[str, Tuple[List[str], Callable[[List[str]], Any]]], series_outputs: Dict[str, str], preprocessors: List[Tuple[str, str, Callable]] = None) → None

Create a new instance of the lazy dataset.

Parameters:
  • name – The name of the dataset series_paths_and_readers: The mapping
  • series name to its file series_outputs (of) – Dictionary mapping
  • names to their output file preprocess (series) – The preprocessor to
  • to the read lines (apply) –
add_series(name: str, series: Iterable[Any]) → None
get_series(name: str) → Iterable

Get the data series with a given name.

This function opens a new file handle and returns a generator which yields preprocessed lines from the file.

Parameters:name – The name of the series to fetch.
Returns:The data series.
Raises:KeyError if the series does not exist.
has_series(name: str) → bool

Check if the dataset contains a series of a given name.

Parameters:name – Series name
Returns:True if the dataset contains the series, False otherwise.
maybe_get_series(name: str) → Union[Iterable, NoneType]

Get the data series with a given name or None if it does not exist.

This function opens a new file handle and returns a generator which yields preprocessed lines from the file.

Parameters:name – The name of the series to fetch.
Returns:The data series or None if it does not exist.
series_ids
shuffle() → None

Do nothing, not in-memory shuffle is impossible.

TODO: this is related to the __len__ method.

subset(start: int, length: int) → neuralmonkey.dataset.dataset.Dataset