neuralmonkey.evaluators.gleu module

class neuralmonkey.evaluators.gleu.GLEUEvaluator(n: int = 4, deduplicate: bool = False, name: str = None) → None

Bases: neuralmonkey.evaluators.evaluator.Evaluator

Sentence-level evaluation metric correlating with BLEU on corpus-level.

From “Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation” by Wu et al. (

GLEU is the minimum of recall and precision of all n-grams up to n in references and hypotheses.

Ngram counts are based on the bleu methods.

__init__(n: int = 4, deduplicate: bool = False, name: str = None) → None

Initialize self. See help(type(self)) for accurate signature.

static gleu(references: List[List[List[str]]], ngrams: int = 4, case_sensitive: bool = True) → float

Compute GLEU on a corpus with multiple references (no smoothing).

  • hypotheses – List of hypotheses
  • references – LIst of references. There can be more than one reference.
  • ngrams – Maximum order of n-grams. Default 4.
  • case_sensitive – Perform case-sensitive computation. Default True.
score_batch(hypotheses: List[List[str]], references: List[List[str]]) → float

Score a batch of hyp/ref pairs.

The default implementation of this method calls score_instance for each instance in the batch and returns the average score.

  • hypotheses – List of model predictions.
  • references – List of golden outputs.

A float.

static total_precision_recall(references_list: List[List[List[str]]], ngrams: int, case_sensitive: bool) → Tuple[float, float]

Compute a modified n-gram precision and recall on a sentence list.

  • hypotheses – List of output sentences as lists of words
  • references_list – List of lists of reference sentences (as lists of words)
  • ngrams – n-gram order
  • case_sensitive – Whether to perform case-sensitive computation