neuralmonkey.dataset module

Implementation of the dataset class.

class neuralmonkey.dataset.BatchingScheme(batch_size: int, batch_bucket_span: int = None, token_level_batching: bool = False, bucketing_ignore_series: List[str] = None, use_leftover_buckets: bool = True) → None

Bases: object

__init__(batch_size: int, batch_bucket_span: int = None, token_level_batching: bool = False, bucketing_ignore_series: List[str] = None, use_leftover_buckets: bool = True) → None

Construct the baching scheme.

batch_size

Number of examples in one mini-batch.

batch_bucket_span

The span of the bucket for bucketed batching.

token_level_batching

Count the batch_size per individual tokens in the batch instead of examples.

bucketing_ignore_series

Series to ignore during bucketing.

use_leftover_buckets

Whether to throw out bucket contents at the end of the epoch or to use them.

class neuralmonkey.dataset.Dataset(name: str, iterators: Dict[str, Callable[[], Iterator]], outputs: Dict[str, Tuple[str, Callable[[str, Any], NoneType]]] = None, buffer_size: Tuple[int, int] = None, shuffled: bool = False) → None

Bases: object

Buffered and batched dataset.

This class serves as collection of data series for particular encoders and decoders in the model.

Dataset has a number of data series, which are sequences of data (of any type) that should have the same length. The sequences are loaded in a buffer and can be loaded lazily.

Using the batches method, dataset yields batches, through which the data are accessed by the model.

__init__(name: str, iterators: Dict[str, Callable[[], Iterator]], outputs: Dict[str, Tuple[str, Callable[[str, Any], NoneType]]] = None, buffer_size: Tuple[int, int] = None, shuffled: bool = False) → None

Construct a new instance of the dataset class.

Do not call this method from the configuration directly. Instead, use the from_files function of this module.

The dataset iterators are provided through factory functions, which return the opened iterators when called with no arguments.

Parameters:
  • name – The name for the dataset.
  • iterators – A series-iterator generator mapping.
  • lazy – If False, load the data from iterators to a list and store the list in memory.
  • buffer_size – Use this tuple as a minimum and maximum buffer size for pre-loading data. This should be (a few times) larger than the batch size used for mini-batching. When the buffer size gets under the lower threshold, it is refilled with the new data and optionally reshuffled. If the buffer size is None, all data is loaded into memory.
  • shuffled – Whether to shuffle the buffer during batching.
batches(scheme: neuralmonkey.dataset.BatchingScheme) → Iterator[Dataset]

Split the dataset into batches.

Parameters:schemeBatchingScheme configuration object.
Returns:Generator yielding the batches.
get_series(name: str) → Iterator

Get the data series with a given name.

Parameters:name – The name of the series to fetch.
Returns:A freshly initialized iterator over the data series.
Raises:KeyError if the series does not exists.
has_series(name: str) → bool

Check if the dataset contains a series of a given name.

Parameters:name – Series name
Returns:True if the dataset contains the series, False otherwise.
maybe_get_series(name: str) → Union[Iterator, NoneType]

Get the data series with a given name, if it exists.

Parameters:name – The name of the series to fetch.
Returns:The data series or None if it does not exist.
series
subset(start: int, length: int) → neuralmonkey.dataset.Dataset

Create a subset of the dataset.

The sub-dataset will inherit the laziness and buffer size and shuffling from the parent dataset.

Parameters:
  • start – Index of the first data instance in the dataset.
  • length – Number of instances to include in the subset.
Returns:

A subset Dataset object.

neuralmonkey.dataset.from_files(*args, **kwargs)
neuralmonkey.dataset.load(name: str, series: List[str], data: List[Union[str, List[str], Tuple[Union[str, List[str]], Callable[[List[str]], Any]], Tuple[Callable, str], Callable[[Dataset], Iterator[DataType]]]], outputs: List[Union[Tuple[str, str], Tuple[str, str, Callable[[str, Any], NoneType]]]] = None, buffer_size: int = None, shuffled: bool = False) → neuralmonkey.dataset.Dataset

Create a dataset using specification from the configuration.

The dataset provides iterators over data series. The dataset has a buffer, which pre-fetches a given number of the data series lazily. In case the dataset is not lazy (buffer size is None), the iterators are built on top of in-memory arrays. Otherwise, the iterators operate on the data sources directly.

Parameters:
  • name – The name of the dataset.
  • series – A list of names of data series the dataset contains.
  • data – The specification of the data sources for each series.
  • outputs – A list of output specifications.
  • buffer_size – The size of the buffer. If set, the dataset will be loaded lazily into the buffer (useful for large datasets). The buffer size specifies the number of sequences to pre-load. This is useful for pseudo-shuffling of large data on-the-fly. Ideally, this should be (much) larger than the batch size. Note that the buffer gets refilled each time its size is less than half the buffer_size. When refilling, the buffer gets refilled to the specified size.
  • shuffled – Whether to shuffle the dataset buffer (done upon refill).
neuralmonkey.dataset.load_dataset_from_files(name: str, lazy: bool = False, preprocessors: List[Tuple[str, str, Callable]] = None, **kwargs) → neuralmonkey.dataset.Dataset

Load a dataset from the files specified by the provided arguments.

Paths to the data are provided in a form of dictionary.

Keyword Arguments:
 
  • name – The name of the dataset to use. If None (default), the name will be inferred from the file names.
  • lazy – Boolean flag specifying whether to use lazy loading (useful for large files). Note that the lazy dataset cannot be shuffled. Defaults to False.
  • preprocessor – A callable used for preprocessing of the input sentences.
  • kwargs – Dataset keyword argument specs. These parameters should begin with ‘s_’ prefix and may end with ‘_out’ suffix. For example, a data series ‘source’ which specify the source sentences should be initialized with the ‘s_source’ parameter, which specifies the path and optinally reader of the source file. If runners generate data of the ‘target’ series, the output file should be initialized with the ‘s_target_out’ parameter. Series identifiers should not contain underscores. Dataset-level preprocessors are defined with ‘pre_’ prefix followed by a new series name. In case of the pre-processed series, a callable taking the dataset and returning a new series is expected as a value.
Returns:

The newly created dataset.