neuralmonkey package¶

Subpackages¶

Submodules¶

neuralmonkey.checking module¶

This module servers as a library of API checks used as assertions during constructing the computational graph.

exception neuralmonkey.checking.CheckingException¶: Bases: Exception

neuralmonkey.checking.assert_same_shape(tensor_a: tensorflow.python.framework.ops.Tensor, tensor_b: tensorflow.python.framework.ops.Tensor) → None¶: Check if two tensors have the same shape.

neuralmonkey.checking.assert_shape(tensor: tensorflow.python.framework.ops.Tensor, expected_shape: typing.List[typing.Union[int, NoneType]]) → None¶

Check shape of a tensor.

Parameters:	tensor – Tensor to be chcecked. expected_shape – Expected shape where None means the same as in TF and -1 means not checking the dimension.

neuralmonkey.checking.check_dataset_and_coders(dataset: neuralmonkey.dataset.Dataset, runners: typing.Iterable[neuralmonkey.runners.base_runner.BaseRunner]) → None¶

neuralmonkey.dataset module¶

Implementation of the dataset class.

class neuralmonkey.dataset.Dataset(name: str, series: typing.Dict[str, typing.List], series_outputs: typing.Dict[str, str]) → None¶

Bases: collections.abc.Sized

This class serves as collection for data series for particular encoders and decoders in the model. If it is not provided a parent dataset, it also manages the vocabularies inferred from the data.

A data series is either a list of strings or a numpy array.

add_series(name: str, series: typing.List[typing.Any]) → None¶

batch_dataset(batch_size: int) → typing.Iterable[typing.Dataset]¶

Split the dataset into a list of batched datasets.

Parameters:	batch_size – The size of a batch.
Returns:	Generator yielding batched datasets.

batch_serie(serie_name: str, batch_size: int) → typing.Iterable[typing.Iterable]¶

Split a data serie into batches.

Parameters:	serie_name – The name of the series batch_size – The size of a batch
Returns:	Generator yielding batches of the data from the serie.

get_series(name: str, allow_none: bool = False) → typing.Iterable¶

Get the data series with a given name.

Parameters:	name – The name of the series to fetch. allow_none – If True, return None if the series does not exist.
Returns:	The data series.
Raises:	KeyError if the series does not exists and allow_none is False

has_series(name: str) → bool¶

Check if the dataset contains a series of a given name.

Parameters:	name – Series name
Returns:	True if the dataset contains the series, False otherwise.

series_ids¶

shuffle() → None¶: Shuffle the dataset randomly

subset(start: int, length: int) → neuralmonkey.dataset.Dataset¶

class neuralmonkey.dataset.LazyDataset(name: str, series_paths_and_readers: typing.Dict[str, typing.Tuple[typing.List[str], typing.Callable[[typing.List[str]], typing.Any]]], series_outputs: typing.Dict[str, str], preprocessors: typing.List[typing.Tuple[str, str, typing.Callable]] = None) → None¶

Bases: neuralmonkey.dataset.Dataset

Implements the lazy dataset.

The main difference between this implementation and the default one is that the contents of the file are not fully loaded to the memory. Instead, everytime the function get_series is called, a new file handle is created and a generator which yields lines from the file is returned.

add_series(name: str, series: typing.Iterable[typing.Any]) → None¶

get_series(name: str, allow_none: bool = False) → typing.Iterable¶

Get the data series with a given name.

This function opens a new file handle and returns a generator which yields preprocessed lines from the file.

Parameters:	name – The name of the series to fetch. allow_none – If True, return None if the series does not exist.
Returns:	The data series.
Raises:	KeyError if the series does not exists and allow_none is False

has_series(name: str) → bool¶

Check if the dataset contains a series of a given name.

Parameters:	name – Series name
Returns:	True if the dataset contains the series, False otherwise.

series_ids¶

shuffle() → None¶

Does nothing, not in-memory shuffle is impossible.

TODO: this is related to the __len__ method.

subset(start: int, length: int) → neuralmonkey.dataset.Dataset¶

neuralmonkey.dataset.load_dataset_from_files(name: str = None, lazy: bool = False, preprocessors: typing.List[typing.Tuple[str, str, typing.Callable]] = None, **kwargs) → neuralmonkey.dataset.Dataset¶

Load a dataset from the files specified by the provided arguments. Paths to the data are provided in a form of dictionary.

Keyword Arguments:
	name – The name of the dataset to use. If None (default), the name will be inferred from the file names. lazy – Boolean flag specifying whether to use lazy loading (useful for large files). Note that the lazy dataset cannot be shuffled. Defaults to False. preprocessor – A callable used for preprocessing of the input sentences. kwargs – Dataset keyword argument specs. These parameters should begin with ‘s_‘ prefix and may end with ‘_out’ suffix. For example, a data series ‘source’ which specify the source sentences should be initialized with the ‘s_source’ parameter, which specifies the path and optinally reader of the source file. If runners generate data of the ‘target’ series, the output file should be initialized with the ‘s_target_out’ parameter. Series identifiers should not contain underscores. Dataset-level preprocessors are defined with ‘pre_‘ prefix followed by a new series name. In case of the pre-processed series, a callable taking the dataset and returning a new series is expected as a value.
Returns:	The newly created dataset.
Raises:	Exception when no input files are provided.

neuralmonkey.decoding_function module¶

Module which implements decoding functions using multiple attentions for RNN decoders.

See http://arxiv.org/abs/1606.07481

The attention mechanisms used in Neural Monkey are inherited from the BaseAttention class defined in this module.

Each attention object has the attention function which operates on the attention_states tensor. The attention function receives the query tensor, the decoder previous state and input, and its inner state, which could bear an arbitrary structure of information. The default structure for this is the AttentionLoopState, which contains a growing array of attention distributions and context vectors in time. That’s why there is the initial_loop_state function in the BaseAttention class.

Mainly for illustration purposes, the attention objects can keep their histories, which is a dictionary populated with attention distributions in time for every decoder, that used this attention object. This is because for example the recurrent decoder is can be run twice for each sentence - once in the training mode, in which the decoder gets the reference tokens on the input, and once in the running mode, in which it gets its own outputs. The histories object is constructed after the decoding and its construction should be triggered manually from the decoder by calling the finalize_loop method.

class neuralmonkey.decoding_function.Attention(attention_states: tensorflow.python.framework.ops.Tensor, scope: str, attention_state_size: int = None, input_weights: tensorflow.python.framework.ops.Tensor = None, attention_fertility: int = None) → None¶

Bases: neuralmonkey.decoding_function.BaseAttention

attention(decoder_state: tensorflow.python.framework.ops.Tensor, decoder_prev_state: tensorflow.python.framework.ops.Tensor, _, loop_state: neuralmonkey.decoding_function.AttentionLoopState, step: tensorflow.python.framework.ops.Tensor) → typing.Tuple[tensorflow.python.framework.ops.Tensor, neuralmonkey.decoding_function.AttentionLoopState]¶: put attention masks on att_states_reshaped using hidden_features and query.

finalize_loop(key: str, last_loop_state: neuralmonkey.decoding_function.AttentionLoopState) → None¶

get_logits(y, _)¶

initial_loop_state() → neuralmonkey.decoding_function.AttentionLoopState¶

class neuralmonkey.decoding_function.AttentionLoopState(contexts, weights)¶

Bases: tuple

contexts¶: Alias for field number 0

weights¶: Alias for field number 1

class neuralmonkey.decoding_function.BaseAttention(scope: str, attention_states: tensorflow.python.framework.ops.Tensor, attention_state_size: int, input_weights: tensorflow.python.framework.ops.Tensor = None) → None¶

Bases: object

attention(decoder_state: tensorflow.python.framework.ops.Tensor, decoder_prev_state: tensorflow.python.framework.ops.Tensor, decoder_input: tensorflow.python.framework.ops.Tensor, loop_state: typing.Any, step: tensorflow.python.framework.ops.Tensor) → typing.Tuple[tensorflow.python.framework.ops.Tensor, typing.Any]¶: Get context vector for given decoder state.

finalize_loop(key: str, last_loop_state: typing.Any) → None¶

histories¶

initial_loop_state() → typing.Any¶: Get initial loop state for the attention object.

class neuralmonkey.decoding_function.CoverageAttention(attention_states: tensorflow.python.framework.ops.Tensor, scope: str, input_weights: tensorflow.python.framework.ops.Tensor = None, attention_fertility: int = 5) → None¶

Bases: neuralmonkey.decoding_function.Attention

get_logits(y, weights_in_time)¶

class neuralmonkey.decoding_function.RecurrentAttention(scope: str, attention_states: tensorflow.python.framework.ops.Tensor, input_weights: tensorflow.python.framework.ops.Tensor, attention_state_size: int, **kwargs) → None¶

Bases: neuralmonkey.decoding_function.BaseAttention

From article `Recurrent Neural Machine Translation `<https://arxiv.org/pdf/1607.08725v1.pdf>`_

In time i of the decoder with state s_i-1, and encoder states h_j, we run a bidirectional RNN with initial state set to

c_0 = tanh(V*s_i-1 + b_0)

Then we run the GRU net (in paper just forward, we do bidi) and we get N+1 hidden states c_0 ... c_N

to compute the context vector, they try either last state or mean of all the states. Last state was better in their experiments so that’s what we’re gonna use.

attention(decoder_state: tensorflow.python.framework.ops.Tensor, decoder_prev_state: tensorflow.python.framework.ops.Tensor, _, loop_state: typing.Any, step: tensorflow.python.framework.ops.Tensor) → typing.Tuple[tensorflow.python.framework.ops.Tensor, typing.Any]¶

finalize_loop(key: str, last_loop_state: typing.Any) → None¶

initial_loop_state() → typing.List¶

neuralmonkey.decoding_function.empty_attention_loop_state() → neuralmonkey.decoding_function.AttentionLoopState¶

Create an empty attention loop state.

The attention loop state is a technical object for storing the attention distributions and the context vectors in time. It is used with the tf.while_loop dynamic implementation of the decoder.

This function returns an empty attention loop state which means there are two empty arrays, one for attention distributions in time, and one for the attention context vectors in time.

neuralmonkey.decorators module¶

neuralmonkey.decorators.tensor(func)¶

neuralmonkey.functions module¶

neuralmonkey.functions.inverse_sigmoid_decay(param, rate, min_value: float = 0.0, max_value: float = 1.0, name: typing.Union[str, NoneType] = None, dtype=tf.float32) → tensorflow.python.framework.ops.Tensor¶

Inverse sigmoid decay: k/(k+exp(x/k)).

The result will be scaled to the range (min_value, max_value).

Parameters:	param – The parameter x from the formula. rate – Non-negative k from the formula.

neuralmonkey.functions.piecewise_function(param, values, changepoints, name=None, dtype=tf.float32)¶

A piecewise function.

Parameters:	param – The function parameter. values – List of function values (numbers or tensors). changepoints – Sorted list of points where the function changes from one value to the next. Must be one item shorter than values.

neuralmonkey.learning_utils module¶

neuralmonkey.learning_utils.evaluation(evaluators, dataset, runners, execution_results, result_data)¶

Evaluate the model outputs.

Parameters:	evaluators – List of tuples of series and evaluation functions. dataset – Dataset against which the evaluation is done. runners – List of runners (contains series ids and loss names). execution_results – Execution results that include the loss values. result_data – Dictionary from series names to list of outputs.
Returns:	Dictionary of evaluation names and their values which includes the metrics applied on respective series loss and loss values from the run.

neuralmonkey.learning_utils.print_final_evaluation(name: str, eval_result: typing.Dict[str, float]) → None¶: Print final evaluation from a test dataset.

neuralmonkey.learning_utils.run_on_dataset(tf_manager: neuralmonkey.tf_manager.TensorFlowManager, runners: typing.List[neuralmonkey.runners.base_runner.BaseRunner], dataset: neuralmonkey.dataset.Dataset, postprocess: typing.Union[typing.List[typing.Tuple[str, typing.Callable]], NoneType], write_out: bool = False, batch_size: typing.Union[int, NoneType] = None, log_progress: int = 0) → typing.Tuple[typing.List[neuralmonkey.runners.base_runner.ExecutionResult], typing.Dict[str, typing.List[typing.Any]]]¶

Apply the model on a dataset and optionally write outputs to files.

Parameters:

tf_manager – TensorFlow manager with initialized sessions.
runners – A function that runs the code
dataset – The dataset on which the model will be executed.
evaluators – List of evaluators that are used for the model evaluation if the target data are provided.
postprocess – an object to use as postprocessing of the
write_out – Flag whether the outputs should be printed to a file defined in the dataset object.
batch_size – size of the minibatch
log_progress – log progress every X seconds
extra_fetches – Extra tensors to evaluate for each batch.

Returns:

Tuple of resulting sentences/numpy arrays, and evaluation results if they are available which are dictionary function -> value.

neuralmonkey.learning_utils.training_loop(tf_manager: neuralmonkey.tf_manager.TensorFlowManager, epochs: int, trainer: neuralmonkey.trainers.generic_trainer.GenericTrainer, batch_size: int, log_directory: str, evaluators: typing.List[typing.Union[typing.Tuple[str, typing.Any], typing.Tuple[str, str, typing.Any]]], runners: typing.List[neuralmonkey.runners.base_runner.BaseRunner], train_dataset: neuralmonkey.dataset.Dataset, val_dataset: typing.Union[neuralmonkey.dataset.Dataset, typing.List[neuralmonkey.dataset.Dataset]], test_datasets: typing.Union[typing.List[neuralmonkey.dataset.Dataset], NoneType] = None, logging_period: typing.Union[str, int] = 20, validation_period: typing.Union[str, int] = 500, val_preview_input_series: typing.Union[typing.List[str], NoneType] = None, val_preview_output_series: typing.Union[typing.List[str], NoneType] = None, val_preview_num_examples: int = 15, train_start_offset: int = 0, runners_batch_size: typing.Union[int, NoneType] = None, initial_variables: typing.Union[str, typing.List[str], NoneType] = None, postprocess: typing.Union[typing.List[typing.Tuple[str, typing.Callable]], NoneType] = None) → None¶

Performs the training loop for given graph and data. :param tf_manager: TensorFlowManager with initialized sessions. :param epochs: Number of epochs for which the algoritm will learn. :param trainer: The trainer object containg the TensorFlow code for computing

the loss and optimization operation.

Parameters:

batch_size – number of examples in one mini-batch
log_directory – Directory where the TensordBoard log will be generated. If None, nothing will be done.
evaluators – List of evaluators. The last evaluator is used as the main. An evaluator is a tuple of the name of the generated series, the name of the dataset series the generated one is evaluated with and the evaluation function. If only one series names is provided, it means the generated and dataset series have the same name.
runners – List of runners for logging and evaluation runs
train_dataset – Dataset used for training
val_dataset – used for validation. Can be Dataset or a list of datasets. The last dataset is used as the main one for storing best results. When using multiple datasets. It is recommended to name them for better Tensorboard visualization.
test_datasets – List of datasets used for testing
logging_period – after how many batches should the logging happen. It can also be defined as a time period in format like: 3s; 4m; 6h; 1d; 3m15s; 3seconds; 4minutes; 6hours; 1days
validation_period – after how many batches should the validation happen. It can also be defined as a time period in same format as logging
val_preview_input_series – which input series to preview in validation
val_preview_output_series – which output series to preview in validation
val_preview_num_examples – how many examples should be printed during validation
train_start_offset – how many lines from the training dataset should be skipped. The training starts from the next batch.
runners_batch_size – batch size of runners. It is the same as batch_size if not specified
initial_variables – variables used for initialization, for example for continuation of training
postprocess – A function which takes the dataset with its output series and generates additional series from them.

neuralmonkey.logging module¶

class neuralmonkey.logging.Logging¶

Bases: object

static debug(message: str, label: typing.Union[str, NoneType] = None)¶

debug_disabled = ['']¶

debug_enabled = ['none']¶

static log(message: str, color: str = 'yellow') → None¶: Logs message with a colored timestamp.

log_file = None¶

static log_print(text: str) → None¶: Prints a string both to console and a log file is it is defined.

static notice(message: str) → None¶: Logs notice with a colored timestamp.

static print_header(title: str, path: str) → None¶: Prints the title of the experiment and the set of arguments it uses.

static set_log_file(path: str) → None¶: Sets up the file where the logging will be done.

strict_mode = None¶

static warn(message: str) → None¶: Logs a warning.

neuralmonkey.logging.debug(message: str, label: typing.Union[str, NoneType] = None)¶

neuralmonkey.logging.log(message: str, color: str = 'yellow') → None¶: Logs message with a colored timestamp.

neuralmonkey.logging.log_print(text: str) → None¶: Prints a string both to console and a log file is it is defined.

neuralmonkey.logging.notice(message: str) → None¶: Logs notice with a colored timestamp.

neuralmonkey.logging.warn(message: str) → None¶: Logs a warning.

neuralmonkey.run module¶

neuralmonkey.run.default_variable_file(output_dir)¶

neuralmonkey.run.initialize_for_running(output_dir, tf_manager, variable_files) → None¶

Restore either default variables of from configuration.

Parameters:	output_dir – Training output directory. tf_manager – TensorFlow manager. variable_files – Files with variables to be restored or None if the default variables should be used.

neuralmonkey.run.main() → None¶

neuralmonkey.server module¶

neuralmonkey.server.main() → None¶

neuralmonkey.server.post_request()¶

neuralmonkey.tf_manager module¶

TensorFlow Manager¶

TensorFlow manager is a helper object in Neural Monkey which manages TensorFlow sessions, execution of the computation graph, and saving and restoring of model variables.

class neuralmonkey.tf_manager.TensorFlowManager(num_sessions: int, num_threads: int, save_n_best: int = 1, minimize_metric: bool = False, variable_files: typing.Union[typing.List[str], NoneType] = None, gpu_allow_growth: bool = True, per_process_gpu_memory_fraction: float = 1.0, report_gpu_memory_consumption: bool = False, enable_tf_debug: bool = False) → None¶

Bases: object

Inteface between computational graph, data and TF sessions.

sessions¶: List of active Tensorflow sessions.

execute(dataset: neuralmonkey.dataset.Dataset, execution_scripts, train=False, compute_losses=True, summaries=True, batch_size=None, log_progress: int = 0) → typing.List[neuralmonkey.runners.base_runner.ExecutionResult]¶

init_saving(vars_prefix: str) → None¶

initialize_model_parts(runners, save=False) → None¶: Initialize model parts variables from their checkpoints.

restore(variable_files: typing.Union[str, typing.List[str]]) → None¶

restore_best_vars() → None¶

save(variable_files: typing.Union[str, typing.List[str]]) → None¶

validation_hook(score: float, epoch: int, batch: int) → None¶

neuralmonkey.tf_utils module¶

Small helper functions for TensorFlow.

neuralmonkey.tf_utils.gpu_memusage() → str¶

Return ‘’ or a string showing current GPU memory usage.

nvidia-smi result parsing based on https://github.com/wookayin/gpustat

neuralmonkey.tf_utils.has_gpu() → bool¶

Check if TensorFlow can access GPU.

The test is based on: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/platform/test.py

...but we are interested only in CUDA GPU devices.

Returns:	True, if TF can access the GPU

neuralmonkey.train module¶

This is a training script for sequence to sequence learning.

neuralmonkey.train.create_config() → neuralmonkey.config.configuration.Configuration¶

neuralmonkey.train.main() → None¶

neuralmonkey.vocabulary module¶

This module implements the Vocabulary class and the helper functions that can be used to obtain a Vocabulary instance.

class neuralmonkey.vocabulary.Vocabulary(tokenized_text: typing.List[str] = None, unk_sample_prob: float = 0.0) → None¶

Bases: collections.abc.Sized

add_tokenized_text(tokenized_text: typing.List[str]) → None¶

Add words from a list to the vocabulary.

Parameters:	tokenized_text – The list of words to add.

add_word(word: str, occurences: int = 1) → None¶

Add a word to the vocablulary.

Parameters:	word – The word to add. If it’s already there, increment the count. occurences – increment the count of word by the number of occurences

get_unk_sampled_word_index(word)¶

Return index of the specified word with sampling of unknown words.

This method returns the index of the specified word in the vocabulary. If the frequency of the word in the vocabulary is 1 (the word was only seen once in the whole training dataset), with probability of self.unk_sample_prob, generate the index of the unknown token instead.

Parameters:	word – The word to look up.
Returns:	Index of the word, index of the unknown token if sampled, or index of the unknown token if the word is not present in the vocabulary.

get_word_index(word: str) → int¶

Return index of the specified word.

Parameters:	word – The word to look up.
Returns:	Index of the word or index of the unknown token if the word is not present in the vocabulary.

log_sample(size: int = 5)¶

Logs a sample of the vocabulary

Parameters:	size – How many sample words to log.

save_wordlist(path: str, overwrite: bool = False, save_frequencies: bool = False, encoding: str = 'utf-8') → None¶

Save the vocabulary as a wordlist. The file is ordered by the ids of words. This function is used mainly for embedding visualization.

Parameters:	path – The path to save the file to. overwrite – Flag whether to overwrite existing file. Defaults to False. save_frequencies – flag if frequencies should be stored. This parameter adds header into the output file.
Raises:	FileExistsError if the file exists and overwrite flag is `disabled.`

sentences_to_tensor(sentences: typing.List[typing.List[str]], max_len: int = None, pad_to_max_len: bool = True, train_mode: bool = False, add_start_symbol: bool = False, add_end_symbol: bool = False) → typing.Tuple[numpy.ndarray, numpy.ndarray]¶

Generate the tensor representation for the provided sentences.

Parameters:

sentences – List of sentences as lists of tokens.
max_len – If specified, all sentences will be truncated to this length.
pad_to_max_len – If True, the tensor will be padded to max_len, even if all of the sentences are shorter. If False, the shape of the tensor will be determined by the maximum length of the sentences in the batch.
train_mode – Flag whether we are training or not (enables/disables unk sampling).
add_start_symbol – If True, the <s> token will be added to the beginning of each sentence vector. Enabling this option extends the maximum length by one.
add_end_symbol – If True, the </s> token will be added to the end of each sentence vector, provided that the sentence is shorter than max_len. If not, the end token is not added. Unlike add_start_symbol, enabling this option does not alter the maximum length.

Returns:

A tuple of a sentence tensor and a padding weight vector.

The shape of the tensor representing the sentences is either (batch_max_len, batch_size) or (batch_max_len+1, batch_size), depending on the value of the add_start_symbol argument. batch_max_len is the length of the longest sentence in the batch (including the optional </s> token), limited by max_len (if specified).

The shape of the padding vector is the same as of the sentence vector.

truncate(size: int) → None¶

Truncate the vocabulary to the requested size by discarding infrequent tokens.

Parameters:	size – The final size of the vocabulary

truncate_by_min_freq(min_freq: int) → None¶

Truncate the vocabulary only keeping words with a minimum frequency.

Parameters:	min_freq – The minimum frequency of included words.

vectors_to_sentences(vectors: typing.List[numpy.ndarray]) → typing.List[typing.List[str]]¶

Convert vectors of indexes of vocabulary items to lists of words.

Parameters:	vectors – List of vectors of vocabulary indices.
Returns:	List of lists of words.

neuralmonkey.vocabulary.from_bpe(path: str, encoding: str = 'utf-8') → neuralmonkey.vocabulary.Vocabulary¶

Loads vocabulary from Byte-pair encoding merge list.

NOTE: The frequencies of words in this vocabulary are not computed from data. Instead, they correspond to the number of times the subword units occurred in the BPE merge list. This means that smaller words will tend to have larger frequencies assigned and therefore the truncation of the vocabulary can be somehow performed (but not without a great deal of thought).

Parameters:	path – File name to load the vocabulary from. encoding – The encoding of the merge file (defaults to UTF-8)

neuralmonkey.vocabulary.from_dataset(datasets: typing.List[neuralmonkey.dataset.Dataset], series_ids: typing.List[str], max_size: int, save_file: str = None, overwrite: bool = False, min_freq: typing.Union[int, NoneType] = None, unk_sample_prob: float = 0.5) → neuralmonkey.vocabulary.Vocabulary¶

Loads vocabulary from a dataset with an option to save it.

Parameters:

datasets – A list of datasets from which to create the vocabulary
series_ids – A list of ids of series of the datasets that should be used producing the vocabulary
max_size – The maximum size of the vocabulary
save_file – A file to save the vocabulary to. If None (default), the vocabulary will not be saved.
overwrite – Overwrite existing file.
min_freq – Do not include words with frequency smaller than this.
unk_sample_prob – The probability with which to sample unks out of words with frequency 1. Defaults to 0.5.

Returns:

The new Vocabulary instance.

neuralmonkey.vocabulary.from_file(*args, **kwargs) → neuralmonkey.vocabulary.Vocabulary¶

neuralmonkey.vocabulary.from_wordlist(path: str, encoding: str = 'utf-8', contains_header: bool = True, contains_frequencies: bool = True) → neuralmonkey.vocabulary.Vocabulary¶

Loads vocabulary from a wordlist. The file can contain either list of words with no header. Or it can contain words and their counts separated by tab and a header on the first line.

Parameters:	path – The path to the wordlist file encoding – The encoding of the merge file (defaults to UTF-8) contains_header – if the file have a header on first line contains_frequencies – if the file contains frequencies in second column
Returns:	The new Vocabulary instance.

neuralmonkey.vocabulary.initialize_vocabulary(directory: str, name: str, datasets: typing.List[neuralmonkey.dataset.Dataset] = None, series_ids: typing.List[str] = None, max_size: int = None) → neuralmonkey.vocabulary.Vocabulary¶

This function is supposed to initialize vocabulary when called from the configuration file. It first checks whether the vocabulary is already loaded on the provided path and if not, it tries to generate it from the provided dataset.

Parameters:	directory – Directory where the vocabulary should be stored. name – Name of the vocabulary which is also the name of the file it is stored it. datasets – A a list of datasets from which the vocabulary can be created. series_ids – A list of ids of series of the datasets that should be used for producing the vocabulary. max_size – The maximum size of the vocabulary
Returns:	The new vocabulary

Module contents¶

The neuralmonkey package is the root package of this project.