neuralmonkey.dataset.helpers module

Helper functions for building datasets.

neuralmonkey.dataset.helpers.from_files(name: str, lazy: bool = False, preprocessors: List[Tuple[str, str, Callable]] = None, **kwargs) → neuralmonkey.dataset.dataset.Dataset

Load a dataset from the files specified by the provided arguments.

Paths to the data are provided in a form of dictionary.

Keyword Arguments:
 
  • name – The name of the dataset to use. If None (default), the name will be inferred from the file names.
  • lazy – Boolean flag specifying whether to use lazy loading (useful for large files). Note that the lazy dataset cannot be shuffled. Defaults to False.
  • preprocessor – A callable used for preprocessing of the input sentences.
  • kwargs – Dataset keyword argument specs. These parameters should begin with ‘s_’ prefix and may end with ‘_out’ suffix. For example, a data series ‘source’ which specify the source sentences should be initialized with the ‘s_source’ parameter, which specifies the path and optinally reader of the source file. If runners generate data of the ‘target’ series, the output file should be initialized with the ‘s_target_out’ parameter. Series identifiers should not contain underscores. Dataset-level preprocessors are defined with ‘pre_’ prefix followed by a new series name. In case of the pre-processed series, a callable taking the dataset and returning a new series is expected as a value.
Returns:

The newly created dataset.

Raises:

Exception when no input files are provided.

neuralmonkey.dataset.helpers.load_dataset_from_files(name: str, lazy: bool = False, preprocessors: List[Tuple[str, str, Callable]] = None, **kwargs) → neuralmonkey.dataset.dataset.Dataset

Load a dataset from the files specified by the provided arguments.

Paths to the data are provided in a form of dictionary.

Keyword Arguments:
 
  • name – The name of the dataset to use. If None (default), the name will be inferred from the file names.
  • lazy – Boolean flag specifying whether to use lazy loading (useful for large files). Note that the lazy dataset cannot be shuffled. Defaults to False.
  • preprocessor – A callable used for preprocessing of the input sentences.
  • kwargs – Dataset keyword argument specs. These parameters should begin with ‘s_’ prefix and may end with ‘_out’ suffix. For example, a data series ‘source’ which specify the source sentences should be initialized with the ‘s_source’ parameter, which specifies the path and optinally reader of the source file. If runners generate data of the ‘target’ series, the output file should be initialized with the ‘s_target_out’ parameter. Series identifiers should not contain underscores. Dataset-level preprocessors are defined with ‘pre_’ prefix followed by a new series name. In case of the pre-processed series, a callable taking the dataset and returning a new series is expected as a value.
Returns:

The newly created dataset.

Raises:

Exception when no input files are provided.