Vocabulary class module.
This module implements the Vocabulary class and the helper functions that can be used to obtain a Vocabulary instance.
Vocabulary(tokenized_text: List[str] = None, unk_sample_prob: float = 0.0) → None¶
__init__(tokenized_text: List[str] = None, unk_sample_prob: float = 0.0) → None¶
Create a new instance of a vocabulary.
Parameters: tokenized_text – The initial list of words to add.
add_characters(word: str) → None¶
add_tokenized_text(tokenized_text: List[str]) → None¶
Add words from a list to the vocabulary.
Parameters: tokenized_text – The list of words to add.
add_word(word: str, occurences: int = 1) → None¶
Add a word to the vocablulary.
- word – The word to add. If it’s already there, increment the count.
- occurences – increment the count of word by the number of occurences
Return index of the specified word with sampling of unknown words.
This method returns the index of the specified word in the vocabulary. If the frequency of the word in the vocabulary is 1 (the word was only seen once in the whole training dataset), with probability of self.unk_sample_prob, generate the index of the unknown token instead.
Parameters: word – The word to look up. Returns: Index of the word, index of the unknown token if sampled, or index of the unknown token if the word is not present in the vocabulary.
get_word_index(word: str) → int¶
Return index of the specified word.
Parameters: word – The word to look up. Returns: Index of the word or index of the unknown token if the word is not present in the vocabulary.
log_sample(size: int = 5) → None¶
Log a sample of the vocabulary.
Parameters: size – How many sample words to log.
save_wordlist(path: str, overwrite: bool = False, save_frequencies: bool = False, encoding: str = 'utf-8') → None¶
Save the vocabulary as a wordlist.
The file is ordered by the ids of words. This function is used mainly for embedding visualization.
- path – The path to save the file to.
- overwrite – Flag whether to overwrite existing file. Defaults to False.
- save_frequencies – flag if frequencies should be stored. This parameter adds header into the output file.
- FileExistsError if the file exists and overwrite flag is
sentences_to_tensor(sentences: List[List[str]], max_len: int = None, pad_to_max_len: bool = True, train_mode: bool = False, add_start_symbol: bool = False, add_end_symbol: bool = False) → Tuple[numpy.ndarray, numpy.ndarray]¶
Generate the tensor representation for the provided sentences.
- sentences – List of sentences as lists of tokens.
- max_len – If specified, all sentences will be truncated to this length.
- pad_to_max_len – If True, the tensor will be padded to max_len, even if all of the sentences are shorter. If False, the shape of the tensor will be determined by the maximum length of the sentences in the batch.
- train_mode – Flag whether we are training or not (enables/disables unk sampling).
- add_start_symbol – If True, the <s> token will be added to the beginning of each sentence vector. Enabling this option extends the maximum length by one.
- add_end_symbol – If True, the </s> token will be added to the end of each sentence vector, provided that the sentence is shorter than max_len. If not, the end token is not added. Unlike add_start_symbol, enabling this option does not alter the maximum length.
A tuple of a sentence tensor and a padding weight vector.
The shape of the tensor representing the sentences is either (batch_max_len, batch_size) or (batch_max_len+1, batch_size), depending on the value of the add_start_symbol argument. batch_max_len is the length of the longest sentence in the batch (including the optional </s> token), limited by max_len (if specified).
The shape of the padding vector is the same as of the sentence vector.
truncate(size: int) → None¶
Truncate the vocabulary to the requested size.
The infrequent tokens are discarded.
Parameters: size – The final size of the vocabulary
truncate_by_min_freq(min_freq: int) → None¶
Truncate the vocabulary only keeping words with a minimum frequency.
Parameters: min_freq – The minimum frequency of included words.
vectors_to_sentences(vectors: Union[List[numpy.ndarray], numpy.ndarray]) → List[List[str]]¶
Convert vectors of indexes of vocabulary items to lists of words.
Parameters: vectors – List of vectors of vocabulary indices. Returns: List of lists of words.
from_dataset(datasets: List[neuralmonkey.dataset.Dataset], series_ids: List[str], max_size: int, save_file: str = None, overwrite: bool = False, min_freq: Union[int, NoneType] = None, unk_sample_prob: float = 0.5) → neuralmonkey.vocabulary.Vocabulary¶
Load a vocabulary from a dataset with an option to save it.
- datasets – A list of datasets from which to create the vocabulary
- series_ids – A list of ids of series of the datasets that should be used producing the vocabulary
- max_size – The maximum size of the vocabulary
- save_file – A file to save the vocabulary to. If None (default), the vocabulary will not be saved.
- overwrite – Overwrite existing file.
- min_freq – Do not include words with frequency smaller than this.
- unk_sample_prob – The probability with which to sample unks out of words with frequency 1. Defaults to 0.5.
The new Vocabulary instance.
from_file(*args, **kwargs) → neuralmonkey.vocabulary.Vocabulary¶
from_nematus_json(path: str, max_size: int = None, pad_to_max_size: bool = False) → neuralmonkey.vocabulary.Vocabulary¶
Load vocabulary from Nematus JSON format.
The JSON format is a flat dictionary that maps words to their index in the vocabulary.
- path – Path to the file.
- max_size – Maximum vocabulary size including ‘unk’ and ‘eos’ symbols, but not including <pad> and <s> symbol.
- pad_to_max_size – If specified, the vocabulary is padded with dummy symbols up to the specified maximum size.
from_t2t_vocabulary(path: str, encoding: str = 'utf-8') → neuralmonkey.vocabulary.Vocabulary¶
Load a vocabulary generated during tensor2tensor training.
- path – The path to the vocabulary file.
- encoding – The encoding of the vocabulary file (defaults to UTF-8).
The new Vocabulary instantce.
from_wordlist(path: str, encoding: str = 'utf-8', contains_header: bool = True, contains_frequencies: bool = True) → neuralmonkey.vocabulary.Vocabulary¶
Load a vocabulary from a wordlist.
The file can contain either list of words with no header. Or it can contain words and their counts separated by tab and a header on the first line.
- path – The path to the wordlist file
- encoding – The encoding of the wordlist file (defaults to UTF-8)
- contains_header – if the file have a header on first line
- contains_frequencies – if the file contains frequencies in second column
The new Vocabulary instance.
initialize_vocabulary(directory: str, name: str, datasets: List[neuralmonkey.dataset.Dataset] = None, series_ids: List[str] = None, max_size: int = None) → neuralmonkey.vocabulary.Vocabulary¶
Initialize a vocabulary.
This function is supposed to initialize vocabulary when called from the configuration file. It first checks whether the vocabulary is already loaded on the provided path and if not, it tries to generate it from the provided dataset.
- directory – Directory where the vocabulary should be stored.
- name – Name of the vocabulary which is also the name of the file it is stored it.
- datasets – A a list of datasets from which the vocabulary can be created.
- series_ids – A list of ids of series of the datasets that should be used for producing the vocabulary.
- max_size – The maximum size of the vocabulary
The new vocabulary
is_special_token(word: str) → bool¶
Check whether word is a special token (such as <pad> or <s>).
Parameters: word – The word to check Returns: True if the word is special, False otherwise.