neuralmonkey package¶
Subpackages¶
- neuralmonkey.config package
- neuralmonkey.decoders package
- Submodules
- neuralmonkey.decoders.beam_search_decoder module
- neuralmonkey.decoders.classifier module
- neuralmonkey.decoders.ctc_decoder module
- neuralmonkey.decoders.decoder module
- neuralmonkey.decoders.encoder_projection module
- neuralmonkey.decoders.output_projection module
- neuralmonkey.decoders.sequence_labeler module
- neuralmonkey.decoders.sequence_regressor module
- neuralmonkey.decoders.word_alignment_decoder module
- Module contents
- neuralmonkey.encoders package
- Submodules
- neuralmonkey.encoders.attentive module
- neuralmonkey.encoders.cnn_encoder module
- neuralmonkey.encoders.encoder_wrapper module
- neuralmonkey.encoders.facebook_conv module
- neuralmonkey.encoders.imagenet_encoder module
- neuralmonkey.encoders.numpy_encoder module
- neuralmonkey.encoders.raw_rnn_encoder module
- neuralmonkey.encoders.recurrent module
- neuralmonkey.encoders.sentence_cnn_encoder module
- neuralmonkey.encoders.sequence_cnn_encoder module
- Module contents
- neuralmonkey.evaluators package
- Submodules
- neuralmonkey.evaluators.accuracy module
- neuralmonkey.evaluators.average module
- neuralmonkey.evaluators.beer module
- neuralmonkey.evaluators.bleu module
- neuralmonkey.evaluators.bleu_ref module
- neuralmonkey.evaluators.chrf module
- neuralmonkey.evaluators.edit_distance module
- neuralmonkey.evaluators.f1_bio module
- neuralmonkey.evaluators.gleu module
- neuralmonkey.evaluators.mse module
- neuralmonkey.evaluators.multeval module
- neuralmonkey.evaluators.ter module
- neuralmonkey.evaluators.wer module
- Module contents
- neuralmonkey.model package
- neuralmonkey.nn package
- neuralmonkey.processors package
- neuralmonkey.readers package
- neuralmonkey.runners package
- Submodules
- neuralmonkey.runners.base_runner module
- neuralmonkey.runners.beamsearch_runner module
- neuralmonkey.runners.label_runner module
- neuralmonkey.runners.logits_runner module
- neuralmonkey.runners.perplexity_runner module
- neuralmonkey.runners.plain_runner module
- neuralmonkey.runners.regression_runner module
- neuralmonkey.runners.representation_runner module
- neuralmonkey.runners.runner module
- neuralmonkey.runners.word_alignment_runner module
- Module contents
- neuralmonkey.tests package
- Submodules
- neuralmonkey.tests.test_bleu module
- neuralmonkey.tests.test_config module
- neuralmonkey.tests.test_dataset module
- neuralmonkey.tests.test_decoder module
- neuralmonkey.tests.test_encoders_init module
- neuralmonkey.tests.test_eval_wrappers module
- neuralmonkey.tests.test_functions module
- neuralmonkey.tests.test_model_part module
- neuralmonkey.tests.test_nn_utils module
- neuralmonkey.tests.test_readers module
- neuralmonkey.tests.test_ter module
- neuralmonkey.tests.test_vocabulary module
- Module contents
- neuralmonkey.trainers package
Submodules¶
neuralmonkey.checking module¶
This module servers as a library of API checks used as assertions during constructing the computational graph.
-
exception
neuralmonkey.checking.
CheckingException
¶ Bases:
Exception
-
neuralmonkey.checking.
assert_same_shape
(tensor_a: tensorflow.python.framework.ops.Tensor, tensor_b: tensorflow.python.framework.ops.Tensor) → None¶ Check if two tensors have the same shape.
-
neuralmonkey.checking.
assert_shape
(tensor: tensorflow.python.framework.ops.Tensor, expected_shape: typing.List[typing.Union[int, NoneType]]) → None¶ Check shape of a tensor.
Parameters: - tensor – Tensor to be chcecked.
- expected_shape – Expected shape where None means the same as in TF and -1 means not checking the dimension.
-
neuralmonkey.checking.
check_dataset_and_coders
(dataset: neuralmonkey.dataset.Dataset, runners: typing.Iterable[neuralmonkey.runners.base_runner.BaseRunner]) → None¶
neuralmonkey.dataset module¶
Implementation of the dataset class.
-
class
neuralmonkey.dataset.
Dataset
(name: str, series: typing.Dict[str, typing.List], series_outputs: typing.Dict[str, str]) → None¶ Bases:
collections.abc.Sized
This class serves as collection for data series for particular encoders and decoders in the model. If it is not provided a parent dataset, it also manages the vocabularies inferred from the data.
A data series is either a list of strings or a numpy array.
-
add_series
(name: str, series: typing.List[typing.Any]) → None¶
-
batch_dataset
(batch_size: int) → typing.Iterable[typing.Dataset]¶ Split the dataset into a list of batched datasets.
Parameters: batch_size – The size of a batch. Returns: Generator yielding batched datasets.
-
batch_serie
(serie_name: str, batch_size: int) → typing.Iterable[typing.Iterable]¶ Split a data serie into batches.
Parameters: - serie_name – The name of the series
- batch_size – The size of a batch
Returns: Generator yielding batches of the data from the serie.
-
get_series
(name: str, allow_none: bool = False) → typing.Iterable¶ Get the data series with a given name.
Parameters: - name – The name of the series to fetch.
- allow_none – If True, return None if the series does not exist.
Returns: The data series.
Raises: KeyError if the series does not exists and allow_none is False
-
has_series
(name: str) → bool¶ Check if the dataset contains a series of a given name.
Parameters: name – Series name Returns: True if the dataset contains the series, False otherwise.
-
series_ids
¶
-
shuffle
() → None¶ Shuffle the dataset randomly
-
subset
(start: int, length: int) → neuralmonkey.dataset.Dataset¶
-
-
class
neuralmonkey.dataset.
LazyDataset
(name: str, series_paths_and_readers: typing.Dict[str, typing.Tuple[typing.List[str], typing.Callable[[typing.List[str]], typing.Any]]], series_outputs: typing.Dict[str, str], preprocessors: typing.List[typing.Tuple[str, str, typing.Callable]] = None) → None¶ Bases:
neuralmonkey.dataset.Dataset
Implements the lazy dataset.
The main difference between this implementation and the default one is that the contents of the file are not fully loaded to the memory. Instead, everytime the function
get_series
is called, a new file handle is created and a generator which yields lines from the file is returned.-
add_series
(name: str, series: typing.Iterable[typing.Any]) → None¶
-
get_series
(name: str, allow_none: bool = False) → typing.Iterable¶ Get the data series with a given name.
This function opens a new file handle and returns a generator which yields preprocessed lines from the file.
Parameters: - name – The name of the series to fetch.
- allow_none – If True, return None if the series does not exist.
Returns: The data series.
Raises: KeyError if the series does not exists and allow_none is False
-
has_series
(name: str) → bool¶ Check if the dataset contains a series of a given name.
Parameters: name – Series name Returns: True if the dataset contains the series, False otherwise.
-
series_ids
¶
-
shuffle
() → None¶ Does nothing, not in-memory shuffle is impossible.
TODO: this is related to the
__len__
method.
-
subset
(start: int, length: int) → neuralmonkey.dataset.Dataset¶
-
-
neuralmonkey.dataset.
load_dataset_from_files
(name: str = None, lazy: bool = False, preprocessors: typing.List[typing.Tuple[str, str, typing.Callable]] = None, **kwargs) → neuralmonkey.dataset.Dataset¶ Load a dataset from the files specified by the provided arguments. Paths to the data are provided in a form of dictionary.
Keyword Arguments: - name – The name of the dataset to use. If None (default), the name will be inferred from the file names.
- lazy – Boolean flag specifying whether to use lazy loading (useful for large files). Note that the lazy dataset cannot be shuffled. Defaults to False.
- preprocessor – A callable used for preprocessing of the input sentences.
- kwargs – Dataset keyword argument specs. These parameters should begin with ‘s_‘ prefix and may end with ‘_out’ suffix. For example, a data series ‘source’ which specify the source sentences should be initialized with the ‘s_source’ parameter, which specifies the path and optinally reader of the source file. If runners generate data of the ‘target’ series, the output file should be initialized with the ‘s_target_out’ parameter. Series identifiers should not contain underscores. Dataset-level preprocessors are defined with ‘pre_‘ prefix followed by a new series name. In case of the pre-processed series, a callable taking the dataset and returning a new series is expected as a value.
Returns: The newly created dataset.
Raises: Exception when no input files are provided.
neuralmonkey.decoding_function module¶
Module which implements decoding functions using multiple attentions for RNN decoders.
See http://arxiv.org/abs/1606.07481
The attention mechanisms used in Neural Monkey are inherited from the
BaseAttention
class defined in this module.
Each attention object has the attention
function which operates on the
attention_states
tensor. The attention function receives the query tensor,
the decoder previous state and input, and its inner state, which could bear an
arbitrary structure of information. The default structure for this is the
AttentionLoopState
, which contains a growing array of attention
distributions and context vectors in time. That’s why there is the
initial_loop_state
function in the BaseAttention
class.
Mainly for illustration purposes, the attention objects can keep their
histories, which is a dictionary populated with attention distributions in
time for every decoder, that used this attention object. This is because for
example the recurrent decoder is can be run twice for each sentence - once in
the training mode, in which the decoder gets the reference tokens on the
input, and once in the running mode, in which it gets its own outputs. The
histories object is constructed after the decoding and its construction
should be triggered manually from the decoder by calling the finalize_loop
method.
-
class
neuralmonkey.decoding_function.
Attention
(attention_states: tensorflow.python.framework.ops.Tensor, scope: str, attention_state_size: int = None, input_weights: tensorflow.python.framework.ops.Tensor = None, attention_fertility: int = None) → None¶ Bases:
neuralmonkey.decoding_function.BaseAttention
-
attention
(decoder_state: tensorflow.python.framework.ops.Tensor, decoder_prev_state: tensorflow.python.framework.ops.Tensor, _, loop_state: neuralmonkey.decoding_function.AttentionLoopState, step: tensorflow.python.framework.ops.Tensor) → typing.Tuple[tensorflow.python.framework.ops.Tensor, neuralmonkey.decoding_function.AttentionLoopState]¶ put attention masks on att_states_reshaped using hidden_features and query.
-
finalize_loop
(key: str, last_loop_state: neuralmonkey.decoding_function.AttentionLoopState) → None¶
-
get_logits
(y, _)¶
-
initial_loop_state
() → neuralmonkey.decoding_function.AttentionLoopState¶
-
-
class
neuralmonkey.decoding_function.
AttentionLoopState
(contexts, weights)¶ Bases:
tuple
-
contexts
¶ Alias for field number 0
-
weights
¶ Alias for field number 1
-
-
class
neuralmonkey.decoding_function.
BaseAttention
(scope: str, attention_states: tensorflow.python.framework.ops.Tensor, attention_state_size: int, input_weights: tensorflow.python.framework.ops.Tensor = None) → None¶ Bases:
object
-
attention
(decoder_state: tensorflow.python.framework.ops.Tensor, decoder_prev_state: tensorflow.python.framework.ops.Tensor, decoder_input: tensorflow.python.framework.ops.Tensor, loop_state: typing.Any, step: tensorflow.python.framework.ops.Tensor) → typing.Tuple[tensorflow.python.framework.ops.Tensor, typing.Any]¶ Get context vector for given decoder state.
-
finalize_loop
(key: str, last_loop_state: typing.Any) → None¶
-
histories
¶
-
initial_loop_state
() → typing.Any¶ Get initial loop state for the attention object.
-
-
class
neuralmonkey.decoding_function.
CoverageAttention
(attention_states: tensorflow.python.framework.ops.Tensor, scope: str, input_weights: tensorflow.python.framework.ops.Tensor = None, attention_fertility: int = 5) → None¶ Bases:
neuralmonkey.decoding_function.Attention
-
get_logits
(y, weights_in_time)¶
-
-
class
neuralmonkey.decoding_function.
RecurrentAttention
(scope: str, attention_states: tensorflow.python.framework.ops.Tensor, input_weights: tensorflow.python.framework.ops.Tensor, attention_state_size: int, **kwargs) → None¶ Bases:
neuralmonkey.decoding_function.BaseAttention
From article `Recurrent Neural Machine Translation `<https://arxiv.org/pdf/1607.08725v1.pdf>`_
In time i of the decoder with state s_i-1, and encoder states h_j, we run a bidirectional RNN with initial state set to
c_0 = tanh(V*s_i-1 + b_0)
Then we run the GRU net (in paper just forward, we do bidi) and we get N+1 hidden states c_0 ... c_N
to compute the context vector, they try either last state or mean of all the states. Last state was better in their experiments so that’s what we’re gonna use.
-
attention
(decoder_state: tensorflow.python.framework.ops.Tensor, decoder_prev_state: tensorflow.python.framework.ops.Tensor, _, loop_state: typing.Any, step: tensorflow.python.framework.ops.Tensor) → typing.Tuple[tensorflow.python.framework.ops.Tensor, typing.Any]¶
-
finalize_loop
(key: str, last_loop_state: typing.Any) → None¶
-
initial_loop_state
() → typing.List¶
-
-
neuralmonkey.decoding_function.
empty_attention_loop_state
() → neuralmonkey.decoding_function.AttentionLoopState¶ Create an empty attention loop state.
The attention loop state is a technical object for storing the attention distributions and the context vectors in time. It is used with the
tf.while_loop
dynamic implementation of the decoder.This function returns an empty attention loop state which means there are two empty arrays, one for attention distributions in time, and one for the attention context vectors in time.
neuralmonkey.functions module¶
-
neuralmonkey.functions.
inverse_sigmoid_decay
(param, rate, min_value: float = 0.0, max_value: float = 1.0, name: typing.Union[str, NoneType] = None, dtype=tf.float32) → tensorflow.python.framework.ops.Tensor¶ Inverse sigmoid decay: k/(k+exp(x/k)).
The result will be scaled to the range (min_value, max_value).
Parameters: - param – The parameter x from the formula.
- rate – Non-negative k from the formula.
-
neuralmonkey.functions.
piecewise_function
(param, values, changepoints, name=None, dtype=tf.float32)¶ A piecewise function.
Parameters: - param – The function parameter.
- values – List of function values (numbers or tensors).
- changepoints – Sorted list of points where the function changes from one value to the next. Must be one item shorter than values.
neuralmonkey.learning_utils module¶
-
neuralmonkey.learning_utils.
evaluation
(evaluators, dataset, runners, execution_results, result_data)¶ Evaluate the model outputs.
Parameters: - evaluators – List of tuples of series and evaluation functions.
- dataset – Dataset against which the evaluation is done.
- runners – List of runners (contains series ids and loss names).
- execution_results – Execution results that include the loss values.
- result_data – Dictionary from series names to list of outputs.
Returns: Dictionary of evaluation names and their values which includes the metrics applied on respective series loss and loss values from the run.
-
neuralmonkey.learning_utils.
print_final_evaluation
(name: str, eval_result: typing.Dict[str, float]) → None¶ Print final evaluation from a test dataset.
-
neuralmonkey.learning_utils.
run_on_dataset
(tf_manager: neuralmonkey.tf_manager.TensorFlowManager, runners: typing.List[neuralmonkey.runners.base_runner.BaseRunner], dataset: neuralmonkey.dataset.Dataset, postprocess: typing.Union[typing.List[typing.Tuple[str, typing.Callable]], NoneType], write_out: bool = False, batch_size: typing.Union[int, NoneType] = None, log_progress: int = 0) → typing.Tuple[typing.List[neuralmonkey.runners.base_runner.ExecutionResult], typing.Dict[str, typing.List[typing.Any]]]¶ Apply the model on a dataset and optionally write outputs to files.
Parameters: - tf_manager – TensorFlow manager with initialized sessions.
- runners – A function that runs the code
- dataset – The dataset on which the model will be executed.
- evaluators – List of evaluators that are used for the model evaluation if the target data are provided.
- postprocess – an object to use as postprocessing of the
- write_out – Flag whether the outputs should be printed to a file defined in the dataset object.
- batch_size – size of the minibatch
- log_progress – log progress every X seconds
- extra_fetches – Extra tensors to evaluate for each batch.
Returns: Tuple of resulting sentences/numpy arrays, and evaluation results if they are available which are dictionary function -> value.
-
neuralmonkey.learning_utils.
training_loop
(tf_manager: neuralmonkey.tf_manager.TensorFlowManager, epochs: int, trainer: neuralmonkey.trainers.generic_trainer.GenericTrainer, batch_size: int, log_directory: str, evaluators: typing.List[typing.Union[typing.Tuple[str, typing.Any], typing.Tuple[str, str, typing.Any]]], runners: typing.List[neuralmonkey.runners.base_runner.BaseRunner], train_dataset: neuralmonkey.dataset.Dataset, val_dataset: typing.Union[neuralmonkey.dataset.Dataset, typing.List[neuralmonkey.dataset.Dataset]], test_datasets: typing.Union[typing.List[neuralmonkey.dataset.Dataset], NoneType] = None, logging_period: typing.Union[str, int] = 20, validation_period: typing.Union[str, int] = 500, val_preview_input_series: typing.Union[typing.List[str], NoneType] = None, val_preview_output_series: typing.Union[typing.List[str], NoneType] = None, val_preview_num_examples: int = 15, train_start_offset: int = 0, runners_batch_size: typing.Union[int, NoneType] = None, initial_variables: typing.Union[str, typing.List[str], NoneType] = None, postprocess: typing.Union[typing.List[typing.Tuple[str, typing.Callable]], NoneType] = None) → None¶ Performs the training loop for given graph and data. :param tf_manager: TensorFlowManager with initialized sessions. :param epochs: Number of epochs for which the algoritm will learn. :param trainer: The trainer object containg the TensorFlow code for computing
the loss and optimization operation.Parameters: - batch_size – number of examples in one mini-batch
- log_directory – Directory where the TensordBoard log will be generated. If None, nothing will be done.
- evaluators – List of evaluators. The last evaluator is used as the main. An evaluator is a tuple of the name of the generated series, the name of the dataset series the generated one is evaluated with and the evaluation function. If only one series names is provided, it means the generated and dataset series have the same name.
- runners – List of runners for logging and evaluation runs
- train_dataset – Dataset used for training
- val_dataset – used for validation. Can be Dataset or a list of datasets. The last dataset is used as the main one for storing best results. When using multiple datasets. It is recommended to name them for better Tensorboard visualization.
- test_datasets – List of datasets used for testing
- logging_period – after how many batches should the logging happen. It can also be defined as a time period in format like: 3s; 4m; 6h; 1d; 3m15s; 3seconds; 4minutes; 6hours; 1days
- validation_period – after how many batches should the validation happen. It can also be defined as a time period in same format as logging
- val_preview_input_series – which input series to preview in validation
- val_preview_output_series – which output series to preview in validation
- val_preview_num_examples – how many examples should be printed during validation
- train_start_offset – how many lines from the training dataset should be skipped. The training starts from the next batch.
- runners_batch_size – batch size of runners. It is the same as batch_size if not specified
- initial_variables – variables used for initialization, for example for continuation of training
- postprocess – A function which takes the dataset with its output series and generates additional series from them.
neuralmonkey.logging module¶
-
class
neuralmonkey.logging.
Logging
¶ Bases:
object
-
static
debug
(message: str, label: typing.Union[str, NoneType] = None)¶
-
debug_disabled
= ['']¶
-
debug_enabled
= ['none']¶
-
static
log
(message: str, color: str = 'yellow') → None¶ Logs message with a colored timestamp.
-
log_file
= None¶
-
static
log_print
(text: str) → None¶ Prints a string both to console and a log file is it is defined.
-
static
notice
(message: str) → None¶ Logs notice with a colored timestamp.
-
static
print_header
(title: str, path: str) → None¶ Prints the title of the experiment and the set of arguments it uses.
-
static
set_log_file
(path: str) → None¶ Sets up the file where the logging will be done.
-
strict_mode
= None¶
-
static
warn
(message: str) → None¶ Logs a warning.
-
static
-
neuralmonkey.logging.
debug
(message: str, label: typing.Union[str, NoneType] = None)¶
-
neuralmonkey.logging.
log
(message: str, color: str = 'yellow') → None¶ Logs message with a colored timestamp.
-
neuralmonkey.logging.
log_print
(text: str) → None¶ Prints a string both to console and a log file is it is defined.
-
neuralmonkey.logging.
notice
(message: str) → None¶ Logs notice with a colored timestamp.
-
neuralmonkey.logging.
warn
(message: str) → None¶ Logs a warning.
neuralmonkey.run module¶
-
neuralmonkey.run.
default_variable_file
(output_dir)¶
-
neuralmonkey.run.
initialize_for_running
(output_dir, tf_manager, variable_files) → None¶ Restore either default variables of from configuration.
Parameters: - output_dir – Training output directory.
- tf_manager – TensorFlow manager.
- variable_files – Files with variables to be restored or None if the default variables should be used.
-
neuralmonkey.run.
main
() → None¶
neuralmonkey.tf_manager module¶
TensorFlow Manager¶
TensorFlow manager is a helper object in Neural Monkey which manages TensorFlow sessions, execution of the computation graph, and saving and restoring of model variables.
-
class
neuralmonkey.tf_manager.
TensorFlowManager
(num_sessions: int, num_threads: int, save_n_best: int = 1, minimize_metric: bool = False, variable_files: typing.Union[typing.List[str], NoneType] = None, gpu_allow_growth: bool = True, per_process_gpu_memory_fraction: float = 1.0, report_gpu_memory_consumption: bool = False, enable_tf_debug: bool = False) → None¶ Bases:
object
Inteface between computational graph, data and TF sessions.
-
sessions
¶ List of active Tensorflow sessions.
-
execute
(dataset: neuralmonkey.dataset.Dataset, execution_scripts, train=False, compute_losses=True, summaries=True, batch_size=None, log_progress: int = 0) → typing.List[neuralmonkey.runners.base_runner.ExecutionResult]¶
-
init_saving
(vars_prefix: str) → None¶
-
initialize_model_parts
(runners, save=False) → None¶ Initialize model parts variables from their checkpoints.
-
restore
(variable_files: typing.Union[str, typing.List[str]]) → None¶
-
restore_best_vars
() → None¶
-
save
(variable_files: typing.Union[str, typing.List[str]]) → None¶
-
validation_hook
(score: float, epoch: int, batch: int) → None¶
-
neuralmonkey.tf_utils module¶
Small helper functions for TensorFlow.
-
neuralmonkey.tf_utils.
gpu_memusage
() → str¶ Return ‘’ or a string showing current GPU memory usage.
nvidia-smi result parsing based on https://github.com/wookayin/gpustat
-
neuralmonkey.tf_utils.
has_gpu
() → bool¶ Check if TensorFlow can access GPU.
- The test is based on
- https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/platform/test.py
...but we are interested only in CUDA GPU devices.
Returns: True, if TF can access the GPU
neuralmonkey.train module¶
This is a training script for sequence to sequence learning.
-
neuralmonkey.train.
create_config
() → neuralmonkey.config.configuration.Configuration¶
-
neuralmonkey.train.
main
() → None¶
neuralmonkey.vocabulary module¶
This module implements the Vocabulary class and the helper functions that can be used to obtain a Vocabulary instance.
-
class
neuralmonkey.vocabulary.
Vocabulary
(tokenized_text: typing.List[str] = None, unk_sample_prob: float = 0.0) → None¶ Bases:
collections.abc.Sized
-
add_tokenized_text
(tokenized_text: typing.List[str]) → None¶ Add words from a list to the vocabulary.
Parameters: tokenized_text – The list of words to add.
-
add_word
(word: str, occurences: int = 1) → None¶ Add a word to the vocablulary.
Parameters: - word – The word to add. If it’s already there, increment the count.
- occurences – increment the count of word by the number of occurences
-
get_unk_sampled_word_index
(word)¶ Return index of the specified word with sampling of unknown words.
This method returns the index of the specified word in the vocabulary. If the frequency of the word in the vocabulary is 1 (the word was only seen once in the whole training dataset), with probability of self.unk_sample_prob, generate the index of the unknown token instead.
Parameters: word – The word to look up. Returns: Index of the word, index of the unknown token if sampled, or index of the unknown token if the word is not present in the vocabulary.
-
get_word_index
(word: str) → int¶ Return index of the specified word.
Parameters: word – The word to look up. Returns: Index of the word or index of the unknown token if the word is not present in the vocabulary.
-
log_sample
(size: int = 5)¶ Logs a sample of the vocabulary
Parameters: size – How many sample words to log.
-
save_wordlist
(path: str, overwrite: bool = False, save_frequencies: bool = False, encoding: str = 'utf-8') → None¶ Save the vocabulary as a wordlist. The file is ordered by the ids of words. This function is used mainly for embedding visualization.
Parameters: - path – The path to save the file to.
- overwrite – Flag whether to overwrite existing file. Defaults to False.
- save_frequencies – flag if frequencies should be stored. This parameter adds header into the output file.
Raises: - FileExistsError if the file exists and overwrite flag is
disabled.
-
sentences_to_tensor
(sentences: typing.List[typing.List[str]], max_len: int = None, pad_to_max_len: bool = True, train_mode: bool = False, add_start_symbol: bool = False, add_end_symbol: bool = False) → typing.Tuple[numpy.ndarray, numpy.ndarray]¶ Generate the tensor representation for the provided sentences.
Parameters: - sentences – List of sentences as lists of tokens.
- max_len – If specified, all sentences will be truncated to this length.
- pad_to_max_len – If True, the tensor will be padded to max_len, even if all of the sentences are shorter. If False, the shape of the tensor will be determined by the maximum length of the sentences in the batch.
- train_mode – Flag whether we are training or not (enables/disables unk sampling).
- add_start_symbol – If True, the <s> token will be added to the beginning of each sentence vector. Enabling this option extends the maximum length by one.
- add_end_symbol – If True, the </s> token will be added to the end of each sentence vector, provided that the sentence is shorter than max_len. If not, the end token is not added. Unlike add_start_symbol, enabling this option does not alter the maximum length.
Returns: A tuple of a sentence tensor and a padding weight vector.
The shape of the tensor representing the sentences is either (batch_max_len, batch_size) or (batch_max_len+1, batch_size), depending on the value of the add_start_symbol argument. batch_max_len is the length of the longest sentence in the batch (including the optional </s> token), limited by max_len (if specified).
The shape of the padding vector is the same as of the sentence vector.
-
truncate
(size: int) → None¶ Truncate the vocabulary to the requested size by discarding infrequent tokens.
Parameters: size – The final size of the vocabulary
-
truncate_by_min_freq
(min_freq: int) → None¶ Truncate the vocabulary only keeping words with a minimum frequency.
Parameters: min_freq – The minimum frequency of included words.
-
vectors_to_sentences
(vectors: typing.List[numpy.ndarray]) → typing.List[typing.List[str]]¶ Convert vectors of indexes of vocabulary items to lists of words.
Parameters: vectors – List of vectors of vocabulary indices. Returns: List of lists of words.
-
-
neuralmonkey.vocabulary.
from_bpe
(path: str, encoding: str = 'utf-8') → neuralmonkey.vocabulary.Vocabulary¶ Loads vocabulary from Byte-pair encoding merge list.
NOTE: The frequencies of words in this vocabulary are not computed from data. Instead, they correspond to the number of times the subword units occurred in the BPE merge list. This means that smaller words will tend to have larger frequencies assigned and therefore the truncation of the vocabulary can be somehow performed (but not without a great deal of thought).
Parameters: - path – File name to load the vocabulary from.
- encoding – The encoding of the merge file (defaults to UTF-8)
-
neuralmonkey.vocabulary.
from_dataset
(datasets: typing.List[neuralmonkey.dataset.Dataset], series_ids: typing.List[str], max_size: int, save_file: str = None, overwrite: bool = False, min_freq: typing.Union[int, NoneType] = None, unk_sample_prob: float = 0.5) → neuralmonkey.vocabulary.Vocabulary¶ Loads vocabulary from a dataset with an option to save it.
Parameters: - datasets – A list of datasets from which to create the vocabulary
- series_ids – A list of ids of series of the datasets that should be used producing the vocabulary
- max_size – The maximum size of the vocabulary
- save_file – A file to save the vocabulary to. If None (default), the vocabulary will not be saved.
- overwrite – Overwrite existing file.
- min_freq – Do not include words with frequency smaller than this.
- unk_sample_prob – The probability with which to sample unks out of words with frequency 1. Defaults to 0.5.
Returns: The new Vocabulary instance.
-
neuralmonkey.vocabulary.
from_file
(*args, **kwargs) → neuralmonkey.vocabulary.Vocabulary¶
-
neuralmonkey.vocabulary.
from_wordlist
(path: str, encoding: str = 'utf-8', contains_header: bool = True, contains_frequencies: bool = True) → neuralmonkey.vocabulary.Vocabulary¶ Loads vocabulary from a wordlist. The file can contain either list of words with no header. Or it can contain words and their counts separated by tab and a header on the first line.
Parameters: - path – The path to the wordlist file
- encoding – The encoding of the merge file (defaults to UTF-8)
- contains_header – if the file have a header on first line
- contains_frequencies – if the file contains frequencies in second column
Returns: The new Vocabulary instance.
-
neuralmonkey.vocabulary.
initialize_vocabulary
(directory: str, name: str, datasets: typing.List[neuralmonkey.dataset.Dataset] = None, series_ids: typing.List[str] = None, max_size: int = None) → neuralmonkey.vocabulary.Vocabulary¶ This function is supposed to initialize vocabulary when called from the configuration file. It first checks whether the vocabulary is already loaded on the provided path and if not, it tries to generate it from the provided dataset.
Parameters: - directory – Directory where the vocabulary should be stored.
- name – Name of the vocabulary which is also the name of the file it is stored it.
- datasets – A a list of datasets from which the vocabulary can be created.
- series_ids – A list of ids of series of the datasets that should be used for producing the vocabulary.
- max_size – The maximum size of the vocabulary
Returns: The new vocabulary
Module contents¶
The neuralmonkey package is the root package of this project.