neuralmonkey.processors.wordpiece module¶
Loose reimplementation of the t2t tokenizer.
Original code: https://github.com/tensorflow/tensor2tensor/blob/v1.5.5/tensor2tensor/data_generators/tokenizer.py
Provides a WordpiecePreprocessor, a higher order function which takes a vocabulary object and returns a preprocessor, and a WordpiecePostprocessor.
Note that the latter is not a higher order function and can be used directly without making a new section in the configuration.
-
neuralmonkey.processors.wordpiece.
WordpiecePostprocessor
(sentences: List[List[str]]) → List[List[str]]¶
-
neuralmonkey.processors.wordpiece.
WordpiecePreprocessor
(vocabulary: neuralmonkey.vocabulary.Vocabulary) → Callable[[List[str]], List[str]]¶
-
neuralmonkey.processors.wordpiece.
escape_token
(token: str, alphabet: Set[str]) → str¶ Escapes the token in the t2t fashion.
Underscores are regarded as an end of a token, so they must be escaped. Additionally, they/we escape also the OOA (out-of-alphabet) characters using their unicode code.
-
neuralmonkey.processors.wordpiece.
get_wordpiece_preprocessor
(vocabulary: neuralmonkey.vocabulary.Vocabulary) → Callable[[List[str]], List[str]]¶
-
neuralmonkey.processors.wordpiece.
unescape_token
(escaped_token: str) → str¶ Inverse function for escape_token.
-
neuralmonkey.processors.wordpiece.
wordpiece_decode
(sentence: List[str]) → List[str]¶ Postprocess the wordpieces into a sentence.
First, retokenize the sentence - join and split around underscores. Second, unescape tokens throwing away any empty tokens encountered.
-
neuralmonkey.processors.wordpiece.
wordpiece_decode_batch
(sentences: List[List[str]]) → List[List[str]]¶
-
neuralmonkey.processors.wordpiece.
wordpiece_encode
(sentence: List[str], vocabulary: neuralmonkey.vocabulary.Vocabulary) → List[str]¶ Convert tokens to subtokens using a vocabulary of subtokens.
A greedy implementation, as in t2t referenced above.
We search for the longest subtoken available in the vocabulary from left to right.