neuralmonkey.evaluators.gleu module¶

class neuralmonkey.evaluators.gleu.GLEUEvaluator(n: int = 4, deduplicate: bool = False, name: str = None) → None¶

Bases: neuralmonkey.evaluators.evaluator.Evaluator

Sentence-level evaluation metric correlating with BLEU on corpus-level.

From “Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation” by Wu et al. (https://arxiv.org/pdf/1609.08144v2.pdf)

GLEU is the minimum of recall and precision of all n-grams up to n in references and hypotheses.

Ngram counts are based on the bleu methods.

__init__(n: int = 4, deduplicate: bool = False, name: str = None) → None¶: Initialize self. See help(type(self)) for accurate signature.

static gleu(references: List[List[List[str]]], ngrams: int = 4, case_sensitive: bool = True) → float¶

Compute GLEU on a corpus with multiple references (no smoothing).

Parameters:	hypotheses – List of hypotheses references – LIst of references. There can be more than one reference. ngrams – Maximum order of n-grams. Default 4. case_sensitive – Perform case-sensitive computation. Default True.

score_batch(hypotheses: List[List[str]], references: List[List[str]]) → float¶

Score a batch of hyp/ref pairs.

The default implementation of this method calls score_instance for each instance in the batch and returns the average score.

Parameters:	hypotheses – List of model predictions. references – List of golden outputs.
Returns:	A float.

static total_precision_recall(references_list: List[List[List[str]]], ngrams: int, case_sensitive: bool) → Tuple[float, float]¶

Compute a modified n-gram precision and recall on a sentence list.

Parameters:	hypotheses – List of output sentences as lists of words references_list – List of lists of reference sentences (as lists of words) ngrams – n-gram order case_sensitive – Whether to perform case-sensitive computation