neuralmonkey.processors.wordpiece module

Loose reimplementation of the t2t tokenizer.

Original code: https://github.com/tensorflow/tensor2tensor/blob/v1.5.5/tensor2tensor/data_generators/tokenizer.py

Provides a WordpiecePreprocessor, a higher order function which takes a vocabulary object and returns a preprocessor, and a WordpiecePostprocessor.

Note that the latter is not a higher order function and can be used directly without making a new section in the configuration.

neuralmonkey.processors.wordpiece.WordpiecePostprocessor(sentences: List[List[str]]) → List[List[str]]
neuralmonkey.processors.wordpiece.WordpiecePreprocessor(vocabulary: neuralmonkey.vocabulary.Vocabulary) → Callable[[List[str]], List[str]]
neuralmonkey.processors.wordpiece.escape_token(token: str, alphabet: Set[str]) → str

Escapes the token in the t2t fashion.

Underscores are regarded as an end of a token, so they must be escaped. Additionally, they/we escape also the OOA (out-of-alphabet) characters using their unicode code.

neuralmonkey.processors.wordpiece.get_wordpiece_preprocessor(vocabulary: neuralmonkey.vocabulary.Vocabulary) → Callable[[List[str]], List[str]]
neuralmonkey.processors.wordpiece.unescape_token(escaped_token: str) → str

Inverse function for escape_token.

neuralmonkey.processors.wordpiece.wordpiece_decode(sentence: List[str]) → List[str]

Postprocess the wordpieces into a sentence.

First, retokenize the sentence - join and split around underscores. Second, unescape tokens throwing away any empty tokens encountered.

neuralmonkey.processors.wordpiece.wordpiece_decode_batch(sentences: List[List[str]]) → List[List[str]]
neuralmonkey.processors.wordpiece.wordpiece_encode(sentence: List[str], vocabulary: neuralmonkey.vocabulary.Vocabulary) → List[str]

Convert tokens to subtokens using a vocabulary of subtokens.

A greedy implementation, as in t2t referenced above.

We search for the longest subtoken available in the vocabulary from left to right.