neuralmonkey.readers.plain_text_reader module

neuralmonkey.readers.plain_text_reader.T2TReader(files: List[str]) → Iterable[List[str]]
neuralmonkey.readers.plain_text_reader.UtfPlainTextReader(files: List[str]) → Iterable[List[str]]
neuralmonkey.readers.plain_text_reader.column_separated_reader(column: int, delimiter: str = '\t', quotechar: str = None, encoding: str = 'utf-8') → Callable[[List[str]], Iterable[List[str]]]

Get reader for delimiter-separated tokenized text.

Parameters:column – number of column to be returned. It starts with 1 for the first
neuralmonkey.readers.plain_text_reader.csv_reader(column: int)
neuralmonkey.readers.plain_text_reader.string_reader(encoding: str = 'utf-8') → Callable[[List[str]], Iterable[str]]
neuralmonkey.readers.plain_text_reader.t2t_tokenized_text_reader(encoding: str = 'utf-8') → Callable[[List[str]], Iterable[List[str]]]

Get a tokenizing reader for plain text.

Tokenization is inspired by the tensor2tensor tokenizer: https://github.com/tensorflow/tensor2tensor/blob/v1.5.5/tensor2tensor/data_generators/text_encoder.py

The text is split to groups of consecutive alphanumeric or non-alphanumeric tokens, dropping single spaces inside the text. Basically the goal here is to preserve the whitespace around weird characters and whitespace on weird positions (beginning and end of the text).

neuralmonkey.readers.plain_text_reader.tokenized_text_reader(encoding: str = 'utf-8') → Callable[[List[str]], Iterable[List[str]]]

Get reader for space-separated tokenized text.

neuralmonkey.readers.plain_text_reader.tsv_reader(column: int)