neuralmonkey.dataset module¶

Implementation of the dataset class.

class neuralmonkey.dataset.BatchingScheme(batch_size: int, batch_bucket_span: int = None, token_level_batching: bool = False, bucketing_ignore_series: List[str] = None, use_leftover_buckets: bool = True) → None¶

Bases: object

__init__(batch_size: int, batch_bucket_span: int = None, token_level_batching: bool = False, bucketing_ignore_series: List[str] = None, use_leftover_buckets: bool = True) → None¶

Construct the baching scheme.

batch_size¶: Number of examples in one mini-batch.

batch_bucket_span¶: The span of the bucket for bucketed batching.

token_level_batching¶: Count the batch_size per individual tokens in the batch instead of examples.

bucketing_ignore_series¶: Series to ignore during bucketing.

use_leftover_buckets¶: Whether to throw out bucket contents at the end of the epoch or to use them.

class neuralmonkey.dataset.Dataset(name: str, iterators: Dict[str, Callable[[], Iterator]], outputs: Dict[str, Tuple[str, Callable[[str, Any], NoneType]]] = None, buffer_size: Tuple[int, int] = None, shuffled: bool = False) → None¶

Bases: object

Buffered and batched dataset.

This class serves as collection of data series for particular encoders and decoders in the model.

Dataset has a number of data series, which are sequences of data (of any type) that should have the same length. The sequences are loaded in a buffer and can be loaded lazily.

Using the batches method, dataset yields batches, through which the data are accessed by the model.

__init__(name: str, iterators: Dict[str, Callable[[], Iterator]], outputs: Dict[str, Tuple[str, Callable[[str, Any], NoneType]]] = None, buffer_size: Tuple[int, int] = None, shuffled: bool = False) → None¶

Construct a new instance of the dataset class.

Do not call this method from the configuration directly. Instead, use the from_files function of this module.

The dataset iterators are provided through factory functions, which return the opened iterators when called with no arguments.

Parameters:

name – The name for the dataset.
iterators – A series-iterator generator mapping.
lazy – If False, load the data from iterators to a list and store the list in memory.
buffer_size – Use this tuple as a minimum and maximum buffer size for pre-loading data. This should be (a few times) larger than the batch size used for mini-batching. When the buffer size gets under the lower threshold, it is refilled with the new data and optionally reshuffled. If the buffer size is None, all data is loaded into memory.
shuffled – Whether to shuffle the buffer during batching.

batches(scheme: neuralmonkey.dataset.BatchingScheme) → Iterator[Dataset]¶

Split the dataset into batches.

Parameters:	scheme – BatchingScheme configuration object.
Returns:	Generator yielding the batches.

get_series(name: str) → Iterator¶

Get the data series with a given name.

Parameters:	name – The name of the series to fetch.
Returns:	A freshly initialized iterator over the data series.
Raises:	KeyError if the series does not exists.

has_series(name: str) → bool¶

Check if the dataset contains a series of a given name.

Parameters:	name – Series name
Returns:	True if the dataset contains the series, False otherwise.

maybe_get_series(name: str) → Union[Iterator, NoneType]¶

Get the data series with a given name, if it exists.

Parameters:	name – The name of the series to fetch.
Returns:	The data series or None if it does not exist.

series¶

subset(start: int, length: int) → neuralmonkey.dataset.Dataset¶

Create a subset of the dataset.

The sub-dataset will inherit the laziness and buffer size and shuffling from the parent dataset.

Parameters:	start – Index of the first data instance in the dataset. length – Number of instances to include in the subset.
Returns:	A subset Dataset object.

neuralmonkey.dataset.from_files(*args, **kwargs)¶

neuralmonkey.dataset.load(name: str, series: List[str], data: List[Union[str, List[str], Tuple[Union[str, List[str]], Callable[[List[str]], Any]], Tuple[Callable, str], Callable[[Dataset], Iterator[DataType]]]], outputs: List[Union[Tuple[str, str], Tuple[str, str, Callable[[str, Any], NoneType]]]] = None, buffer_size: int = None, shuffled: bool = False) → neuralmonkey.dataset.Dataset¶

Create a dataset using specification from the configuration.

The dataset provides iterators over data series. The dataset has a buffer, which pre-fetches a given number of the data series lazily. In case the dataset is not lazy (buffer size is None), the iterators are built on top of in-memory arrays. Otherwise, the iterators operate on the data sources directly.

Parameters:

name – The name of the dataset.
series – A list of names of data series the dataset contains.
data – The specification of the data sources for each series.
outputs – A list of output specifications.
buffer_size – The size of the buffer. If set, the dataset will be loaded lazily into the buffer (useful for large datasets). The buffer size specifies the number of sequences to pre-load. This is useful for pseudo-shuffling of large data on-the-fly. Ideally, this should be (much) larger than the batch size. Note that the buffer gets refilled each time its size is less than half the buffer_size. When refilling, the buffer gets refilled to the specified size.
shuffled – Whether to shuffle the dataset buffer (done upon refill).

neuralmonkey.dataset.load_dataset_from_files(name: str, lazy: bool = False, preprocessors: List[Tuple[str, str, Callable]] = None, **kwargs) → neuralmonkey.dataset.Dataset¶

Load a dataset from the files specified by the provided arguments.

Paths to the data are provided in a form of dictionary.

Keyword Arguments:

name – The name of the dataset to use. If None (default), the name will be inferred from the file names.
lazy – Boolean flag specifying whether to use lazy loading (useful for large files). Note that the lazy dataset cannot be shuffled. Defaults to False.
preprocessor – A callable used for preprocessing of the input sentences.
kwargs – Dataset keyword argument specs. These parameters should begin with ‘s_’ prefix and may end with ‘_out’ suffix. For example, a data series ‘source’ which specify the source sentences should be initialized with the ‘s_source’ parameter, which specifies the path and optinally reader of the source file. If runners generate data of the ‘target’ series, the output file should be initialized with the ‘s_target_out’ parameter. Series identifiers should not contain underscores. Dataset-level preprocessors are defined with ‘pre_’ prefix followed by a new series name. In case of the pre-processed series, a callable taking the dataset and returning a new series is expected as a value.

Returns:

The newly created dataset.