neuralmonkey.dataset.lazy_dataset module¶
Lazy dataset which does not load the whole data into memory.
-
class
neuralmonkey.dataset.lazy_dataset.
LazyDataset
(name: str, series_paths_and_readers: Dict[str, Tuple[List[str], Callable[[List[str]], Any]]], series_outputs: Dict[str, str], preprocessors: List[Tuple[str, str, Callable]] = None) → None¶ Bases:
neuralmonkey.dataset.dataset.Dataset
Implements the lazy dataset.
The main difference between this implementation and the default one is that the contents of the file are not fully loaded to the memory. Instead, everytime the function
get_series
is called, a new file handle is created and a generator which yields lines from the file is returned.-
__init__
(name: str, series_paths_and_readers: Dict[str, Tuple[List[str], Callable[[List[str]], Any]]], series_outputs: Dict[str, str], preprocessors: List[Tuple[str, str, Callable]] = None) → None¶ Create a new instance of the lazy dataset.
Parameters: - name – The name of the dataset series_paths_and_readers: The mapping
- series name to its file series_outputs (of) – Dictionary mapping
- names to their output file preprocess (series) – The preprocessor to
- to the read lines (apply) –
-
add_series
(name: str, series: Iterable[Any]) → None¶
-
get_series
(name: str) → Iterable¶ Get the data series with a given name.
This function opens a new file handle and returns a generator which yields preprocessed lines from the file.
Parameters: name – The name of the series to fetch. Returns: The data series. Raises: KeyError if the series does not exist.
-
has_series
(name: str) → bool¶ Check if the dataset contains a series of a given name.
Parameters: name – Series name Returns: True if the dataset contains the series, False otherwise.
-
maybe_get_series
(name: str) → Union[Iterable, NoneType]¶ Get the data series with a given name or None if it does not exist.
This function opens a new file handle and returns a generator which yields preprocessed lines from the file.
Parameters: name – The name of the series to fetch. Returns: The data series or None if it does not exist.
-
series_ids
¶
-
shuffle
() → None¶ Do nothing, not in-memory shuffle is impossible.
TODO: this is related to the
__len__
method.
-
subset
(start: int, length: int) → neuralmonkey.dataset.dataset.Dataset¶
-