neuralmonkey package

Subpackages

Submodules

neuralmonkey.checking module

This module servers as a library of API checks used as assertions during constructing the computational graph.

exception neuralmonkey.checking.CheckingException

Bases: Exception

neuralmonkey.checking.assert_same_shape(tensor_a: tensorflow.python.framework.ops.Tensor, tensor_b: tensorflow.python.framework.ops.Tensor) → None

Check if two tensors have the same shape.

neuralmonkey.checking.assert_shape(tensor: tensorflow.python.framework.ops.Tensor, expected_shape: typing.List[typing.Union[int, NoneType]]) → None

Check shape of a tensor.

Parameters:
  • tensor – Tensor to be chcecked.
  • expected_shape – Expected shape where None means the same as in TF and -1 means not checking the dimension.
neuralmonkey.checking.assert_type(obj, name, value, expected_type, can_be_none=False)
neuralmonkey.checking.check_dataset_and_coders(dataset, runners)
neuralmonkey.checking.missing_attributes(obj, attributes)
neuralmonkey.checking.type_to_str(type_obj)

neuralmonkey.dataset module

Implementation of the dataset class.

class neuralmonkey.dataset.Dataset(name: str, series: typing.Dict[str, typing.List], series_outputs: typing.Dict[str, str]) → None

Bases: collections.abc.Sized

This class serves as collection for data series for particular encoders and decoders in the model. If it is not provided a parent dataset, it also manages the vocabularies inferred from the data.

A data series is either a list of strings or a numpy array.

add_series(name: str, series: typing.List[typing.Any]) → None
batch_dataset(batch_size: int) → typing.Iterable[typing.Dataset]

Split the dataset into a list of batched datasets.

Parameters:batch_size – The size of a batch.
Returns:Generator yielding batched datasets.
batch_serie(serie_name: str, batch_size: int) → typing.Iterable[typing.Iterable]

Split a data serie into batches.

Parameters:
  • serie_name – The name of the series
  • batch_size – The size of a batch
Returns:

Generator yielding batches of the data from the serie.

get_series(name: str, allow_none: bool = False) → typing.Iterable

Get the data series with a given name.

Parameters:
  • name – The name of the series to fetch.
  • allow_none – If True, return None if the series does not exist.
Returns:

The data series.

Raises:

KeyError if the series does not exists and allow_none is False

has_series(name: str) → bool

Check if the dataset contains a series of a given name.

Parameters:name – Series name
Returns:True if the dataset contains the series, False otherwise.
series_ids
shuffle() → None

Shuffle the dataset randomly

class neuralmonkey.dataset.LazyDataset(name: str, series_paths_and_readers: typing.Dict[str, typing.Tuple[typing.List[str], typing.Callable[[typing.List[str]], typing.Any]]], series_outputs: typing.Dict[str, str], preprocessors: typing.List[typing.Tuple[str, str, typing.Callable]] = None) → None

Bases: neuralmonkey.dataset.Dataset

Implements the lazy dataset.

The main difference between this implementation and the default one is that the contents of the file are not fully loaded to the memory. Instead, everytime the function get_series is called, a new file handle is created and a generator which yields lines from the file is returned.

add_series(name: str, series: typing.Iterable[typing.Any]) → None
get_series(name: str, allow_none: bool = False) → typing.Iterable

Get the data series with a given name.

This function opens a new file handle and returns a generator which yields preprocessed lines from the file.

Parameters:
  • name – The name of the series to fetch.
  • allow_none – If True, return None if the series does not exist.
Returns:

The data series.

Raises:

KeyError if the series does not exists and allow_none is False

has_series(name: str) → bool

Check if the dataset contains a series of a given name.

Parameters:name – Series name
Returns:True if the dataset contains the series, False otherwise.
series_ids
shuffle()

Does nothing, not in-memory shuffle is impossible.

TODO: this is related to the __len__ method.

neuralmonkey.dataset.load_dataset_from_files(name: str = None, lazy: bool = False, preprocessors: typing.List[typing.Tuple[str, str, typing.Callable]] = None, **kwargs) → neuralmonkey.dataset.Dataset

Load a dataset from the files specified by the provided arguments. Paths to the data are provided in a form of dictionary.

Keyword Arguments:
 
  • name – The name of the dataset to use. If None (default), the name will be inferred from the file names.
  • lazy – Boolean flag specifying whether to use lazy loading (useful for large files). Note that the lazy dataset cannot be shuffled. Defaults to False.
  • preprocessor – A callable used for preprocessing of the input sentences.
  • kwargs – Dataset keyword argument specs. These parameters should begin with ‘s_‘ prefix and may end with ‘_out’ suffix. For example, a data series ‘source’ which specify the source sentences should be initialized with the ‘s_source’ parameter, which specifies the path and optinally reader of the source file. If runners generate data of the ‘target’ series, the output file should be initialized with the ‘s_target_out’ parameter. Series identifiers should not contain underscores. Dataset-level preprocessors are defined with ‘pre_‘ prefix followed by a new series name. In case of the pre-processed series, a callable taking the dataset and returning a new series is expected as a value.
Returns:

The newly created dataset.

Raises:

Exception when no input files are provided.

neuralmonkey.decoding_function module

Module which implements decoding functions using multiple attentions for RNN decoders.

See http://arxiv.org/abs/1606.07481

class neuralmonkey.decoding_function.Attention(attention_states, scope, input_weights=None, attention_fertility=None)

Bases: object

attention(query_state)

Put attention masks on att_states_reshaped using hidden_features and query.

get_logits(y)
class neuralmonkey.decoding_function.CoverageAttention(attention_states, scope, input_weights=None, attention_fertility=5)

Bases: neuralmonkey.decoding_function.Attention

get_logits(y)

neuralmonkey.decorators module

neuralmonkey.decorators.tensor(func)

neuralmonkey.functions module

neuralmonkey.functions.inverse_sigmoid_decay(param, rate, min_value=0.0, max_value=1.0, name=None, dtype=tf.float32)

Inverse sigmoid decay: k/(k+exp(x/k)).

The result will be scaled to the range (min_value, max_value).

Parameters:
  • param – The parameter x from the formula.
  • rate – Non-negative k from the formula.
neuralmonkey.functions.piecewise_function(param, values, changepoints, name=None, dtype=tf.float32)

A piecewise function.

Parameters:
  • param – The function parameter.
  • values – List of function values (numbers or tensors).
  • changepoints – Sorted list of points where the function changes from one value to the next. Must be one item shorter than values.

neuralmonkey.learning_utils module

neuralmonkey.learning_utils.evaluation(evaluators, dataset, runners, execution_results, result_data)

Evaluate the model outputs.

Parameters:
  • evaluators – List of tuples of series and evaluation functions.
  • dataset – Dataset against which the evaluation is done.
  • runners – List of runners (contains series ids and loss names).
  • execution_results – Execution results that include the loss values.
  • result_data – Dictionary from series names to list of outputs.
Returns:

Dictionary of evaluation names and their values which includes the metrics applied on respective series loss and loss values from the run.

neuralmonkey.learning_utils.print_final_evaluation(name: str, eval_result: typing.Dict[str, float]) → None

Print final evaluation from a test dataset.

neuralmonkey.learning_utils.run_on_dataset(tf_manager: neuralmonkey.tf_manager.TensorFlowManager, runners: typing.List[neuralmonkey.runners.base_runner.BaseRunner], dataset: neuralmonkey.dataset.Dataset, postprocess: typing.Union[typing.List[typing.Tuple[str, typing.Callable]], NoneType], write_out: bool = False, batch_size: typing.Union[int, NoneType] = None) → typing.Tuple[typing.List[neuralmonkey.runners.base_runner.ExecutionResult], typing.Dict[str, typing.List[typing.Any]]]

Apply the model on a dataset and optionally write outputs to files.

Parameters:
  • tf_manager – TensorFlow manager with initialized sessions.
  • runners – A function that runs the code
  • dataset – The dataset on which the model will be executed.
  • evaluators – List of evaluators that are used for the model evaluation if the target data are provided.
  • postprocess – an object to use as postprocessing of the
  • write_out – Flag whether the outputs should be printed to a file defined in the dataset object.
  • extra_fetches – Extra tensors to evaluate for each batch.
Returns:

Tuple of resulting sentences/numpy arrays, and evaluation results if they are available which are dictionary function -> value.

neuralmonkey.learning_utils.training_loop(tf_manager: neuralmonkey.tf_manager.TensorFlowManager, epochs: int, trainer: neuralmonkey.trainers.generic_trainer.GenericTrainer, batch_size: int, train_dataset: neuralmonkey.dataset.Dataset, val_dataset: neuralmonkey.dataset.Dataset, log_directory: str, evaluators: typing.List[typing.Union[typing.Tuple[str, typing.Any], typing.Tuple[str, str, typing.Any]]], runners: typing.List[neuralmonkey.runners.base_runner.BaseRunner], test_datasets: typing.Union[typing.List[neuralmonkey.dataset.Dataset], NoneType] = None, link_best_vars='/tmp/variables.data.best', vars_prefix='/tmp/variables.data', logging_period: int = 20, validation_period: int = 500, val_preview_input_series: typing.Union[typing.List[str], NoneType] = None, val_preview_output_series: typing.Union[typing.List[str], NoneType] = None, val_preview_num_examples: int = 15, train_start_offset: int = 0, runners_batch_size: typing.Union[int, NoneType] = None, initial_variables: typing.Union[str, typing.List[str], NoneType] = None, postprocess: typing.Union[typing.List[typing.Tuple[str, typing.Callable]], NoneType] = None, minimize_metric: bool = False)

Performs the training loop for given graph and data.

Parameters:
  • tf_manager – TensorFlowManager with initialized sessions.
  • epochs – Number of epochs for which the algoritm will learn.
  • trainer – The trainer object containg the TensorFlow code for computing the loss and optimization operation.
  • train_dataset
  • val_dataset
  • postprocess – Function that takes the output sentence as produced by the decoder and transforms into tokenized sentence.
  • log_directory – Directory where the TensordBoard log will be generated. If None, nothing will be done.
  • evaluators – List of evaluators. The last evaluator is used as the main. An evaluator is a tuple of the name of the generated series, the name of the dataset series the generated one is evaluated with and the evaluation function. If only one series names is provided, it means the generated and dataset series have the same name.

neuralmonkey.logging module

class neuralmonkey.logging.Logging

Bases: object

static debug(message, label=None)
debug_disabled = ['']
debug_enabled = ['none']
static log(message, color='yellow')

Logs message with a colored timestamp.

log_file = None
static log_print(text: str) → None

Prints a string both to console and a log file is it is defined.

static print_header(title)

Prints the title of the experiment and the set of arguments it uses.

static set_log_file(path)

Sets up the file where the logging will be done.

strict_mode = None
static warn(message)

Logs a warning.

neuralmonkey.logging.debug(message, label=None)
neuralmonkey.logging.log(message, color='yellow')

Logs message with a colored timestamp.

neuralmonkey.logging.log_print(text: str) → None

Prints a string both to console and a log file is it is defined.

neuralmonkey.logging.warn(message)

Logs a warning.

neuralmonkey.run module

neuralmonkey.run.default_variable_file(output_dir)
neuralmonkey.run.initialize_for_running(output_dir, tf_manager, variable_files) → None

Restore either default variables of from configuration.

Parameters:
  • output_dir – Training output directory.
  • tf_manager – TensorFlow manager.
  • variable_files – Files with variables to be restored or None if the default variables should be used.
neuralmonkey.run.main() → None

neuralmonkey.server module

neuralmonkey.server.main() → None
neuralmonkey.server.post_request()

neuralmonkey.tf_manager module

TensorFlow Manager

TensorFlow manager is a helper object in Neural Monkey which manages TensorFlow sessions, execution of the computation graph, and saving and restoring of model variables.

class neuralmonkey.tf_manager.TensorFlowManager(num_sessions, num_threads, save_n_best=1, variable_files=None, gpu_allow_growth=True, per_process_gpu_memory_fraction=1.0, report_gpu_memory_consumption=False, enable_tf_debug=False)

Bases: object

Inteface between computational graph, data and TF sessions.

sessions

List of active Tensorflow sessions.

execute(dataset: neuralmonkey.dataset.Dataset, execution_scripts, train=False, compute_losses=True, summaries=True, batch_size=None) → typing.List[neuralmonkey.runners.base_runner.ExecutionResult]
initialize_model_parts(runners) → None

Initialize model parts variables from their checkpoints.

restore(variable_files: typing.Union[str, typing.List[str]]) → None
save(variable_files: typing.Union[str, typing.List[str]]) → None

neuralmonkey.tf_utils module

Small helper functions for TensorFlow.

neuralmonkey.tf_utils.gpu_memusage() → str

Return ‘’ or a string showing current GPU memory usage.

nvidia-smi result parsing based on https://github.com/wookayin/gpustat

neuralmonkey.tf_utils.has_gpu() → bool

Check if TensorFlow can access GPU.

The test is based on
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/platform/test.py

...but we are interested only in CUDA GPU devices.

Returns:True, if TF can access the GPU

neuralmonkey.train module

This is a training script for sequence to sequence learning.

neuralmonkey.train.create_config() → neuralmonkey.config.configuration.Configuration
neuralmonkey.train.main() → None

neuralmonkey.vocabulary module

This module implements the Vocabulary class and the helper functions that can be used to obtain a Vocabulary instance.

class neuralmonkey.vocabulary.Vocabulary(tokenized_text: typing.List[str] = None, unk_sample_prob: float = 0.0) → None

Bases: collections.abc.Sized

add_tokenized_text(tokenized_text: typing.List[str]) → None

Add words from a list to the vocabulary.

Parameters:tokenized_text – The list of words to add.
add_word(word: str) → None

Add a word to the vocablulary.

Parameters:word – The word to add. If it’s already there, increment the count.
get_unk_sampled_word_index(word)

Return index of the specified word with sampling of unknown words.

This method returns the index of the specified word in the vocabulary. If the frequency of the word in the vocabulary is 1 (the word was only seen once in the whole training dataset), with probability of self.unk_sample_prob, generate the index of the unknown token instead.

Parameters:word – The word to look up.
Returns:Index of the word, index of the unknown token if sampled, or index of the unknown token if the word is not present in the vocabulary.
get_word_index(word: str) → int

Return index of the specified word.

Parameters:word – The word to look up.
Returns:Index of the word or index of the unknown token if the word is not present in the vocabulary.
log_sample(size: int = 5)

Logs a sample of the vocabulary

Parameters:size – How many sample words to log.
save_to_file(path: str, overwrite: bool = False) → None

Save the vocabulary to a file.

Parameters:
  • path – The path to save the file to.
  • overwrite – Flag whether to overwrite existing file. Defaults to False.
Raises:
  • FileExistsError if the file exists and overwrite flag is
  • disabled.
sentences_to_tensor(sentences: typing.List[typing.List[str]], max_len: typing.Union[int, NoneType] = None, pad_to_max_len: bool = True, train_mode: bool = False, add_start_symbol: bool = False, add_end_symbol: bool = False) → typing.Tuple[numpy.ndarray, numpy.ndarray]

Generate the tensor representation for the provided sentences.

Parameters:
  • sentences – List of sentences as lists of tokens.
  • max_len – If specified, all sentences will be truncated to this length.
  • pad_to_max_len – If True, the tensor will be padded to max_len, even if all of the sentences are shorter. If False, the shape of the tensor will be determined by the maximum length of the sentences in the batch.
  • train_mode – Flag whether we are training or not (enables/disables unk sampling).
  • add_start_symbol – If True, the <s> token will be added to the beginning of each sentence vector. Enabling this option extends the maximum length by one.
  • add_end_symbol – If True, the </s> token will be added to the end of each sentence vector, provided that the sentence is shorter than max_len. If not, the end token is not added. Unlike add_start_symbol, enabling this option does not alter the maximum length.
Returns:

A tuple of a sentence tensor and a padding weight vector.

The shape of the tensor representing the sentences is either (batch_max_len, batch_size) or (batch_max_len+1, batch_size), depending on the value of the add_start_symbol argument. batch_max_len is the length of the longest sentence in the batch (including the optional </s> token), limited by max_len (if specified).

The shape of the padding vector is the same as of the sentence vector.

trunkate(size: int) → None

Truncate the vocabulary to the requested size by discarding infrequent tokens.

Parameters:size – The final size of the vocabulary
vectors_to_sentences(vectors: typing.List[numpy.ndarray]) → typing.List[typing.List[str]]

Convert vectors of indexes of vocabulary items to lists of words.

Parameters:vectors – List of vectors of vocabulary indices.
Returns:List of lists of words.
neuralmonkey.vocabulary.from_bpe(path: str, encoding: str = 'utf-8') → neuralmonkey.vocabulary.Vocabulary

Loads vocabulary from Byte-pair encoding merge list.

NOTE: The frequencies of words in this vocabulary are not computed from data. Instead, they correspond to the number of times the subword units occurred in the BPE merge list. This means that smaller words will tend to have larger frequencies assigned and therefore the truncation of the vocabulary can be somehow performed (but not without a great deal of thought).

Parameters:
  • path – File name to load the vocabulary from.
  • encoding – The encoding of the merge file (defaults to UTF-8)
neuralmonkey.vocabulary.from_dataset(datasets: typing.List[neuralmonkey.dataset.Dataset], series_ids: typing.List[str], max_size: int, save_file: str = None, overwrite: bool = False, unk_sample_prob: float = 0.5) → neuralmonkey.vocabulary.Vocabulary

Loads vocabulary from a dataset with an option to save it.

Parameters:
  • datasets – A list of datasets from which to create the vocabulary
  • series_ids – A list of ids of series of the datasets that should be used producing the vocabulary
  • max_size – The maximum size of the vocabulary
  • save_file – A file to save the vocabulary to. If None (default), the vocabulary will not be saved.
  • unk_sample_prob – The probability with which to sample unks out of words with frequency 1. Defaults to 0.5.
Returns:

The new Vocabulary instance.

neuralmonkey.vocabulary.from_file(path: str) → neuralmonkey.vocabulary.Vocabulary

Loads vocabulary from a pickled file

Parameters:path – The path to the pickle file
Returns:The newly created vocabulary.
neuralmonkey.vocabulary.from_wordlist(path: str, encoding: str = 'utf-8') → neuralmonkey.vocabulary.Vocabulary

Loads vocabulary from a wordlist.

Parameters:
  • path – The path to the wordlist file
  • encoding – The encoding of the merge file (defaults to UTF-8)
Returns:

The new Vocabulary instance.

neuralmonkey.vocabulary.initialize_vocabulary(directory: str, name: str, datasets: typing.List[neuralmonkey.dataset.Dataset] = None, series_ids: typing.List[str] = None, max_size: int = None) → neuralmonkey.vocabulary.Vocabulary

This function is supposed to initialize vocabulary when called from the configuration file. It first checks whether the vocabulary is already loaded on the provided path and if not, it tries to generate it from the provided dataset.

Parameters:
  • directory – Directory where the vocabulary should be stored.
  • name – Name of the vocabulary which is also the name of the file it is stored it.
  • datasets – A a list of datasets from which the vocabulary can be created.
  • series_ids – A list of ids of series of the datasets that should be used for producing the vocabulary.
  • max_size – The maximum size of the vocabulary
Returns:

The new vocabulary

Module contents

The neuralmonkey package is the root package of this project.