neuralmonkey.evaluators package

Submodules

neuralmonkey.evaluators.accuracy module

class neuralmonkey.evaluators.accuracy.AccuracyEvaluator(name: str = 'Accuracy') → None

Bases: object

static compare_scores(score1: float, score2: float) → int
class neuralmonkey.evaluators.accuracy.AccuracySeqLevelEvaluator(name: str = 'AccuracySeqLevel') → None

Bases: object

static compare_scores(score1: float, score2: float) → int

neuralmonkey.evaluators.average module

class neuralmonkey.evaluators.average.AverageEvaluator(name: str) → None

Bases: object

Just average the numeric output of a runner.

neuralmonkey.evaluators.beer module

class neuralmonkey.evaluators.beer.BeerWrapper(wrapper: str, name: str = 'BEER', encoding: str = 'utf-8') → None

Bases: object

Wrapper for BEER scorer.

Paper: http://aclweb.org/anthology/D14-1025 Code: https://github.com/stanojevic/beer

serialize_to_bytes(sentences: typing.List[typing.List[str]]) → bytes

neuralmonkey.evaluators.bleu module

class neuralmonkey.evaluators.bleu.BLEUEvaluator(n: int = 4, deduplicate: bool = False, name: typing.Union[str, NoneType] = None) → None

Bases: object

static bleu(hypotheses: typing.List[typing.List[str]], references: typing.List[typing.List[typing.List[str]]], ngrams: int = 4, case_sensitive: bool = True)

Compute BLEU on a corpus with multiple references.

The n-grams are uniformly weighted.

Default is to use smoothing as in reference implementation on: https://github.com/ufal/qtleap/blob/master/cuni_train/bin/mteval-v13a.pl#L831-L873

Parameters:
  • hypotheses – List of hypotheses
  • references – LIst of references. There can be more than one reference.
  • ngrams – Maximum order of n-grams. Default 4.
  • case_sensitive – Perform case-sensitive computation. Default True.
static compare_scores(score1: float, score2: float) → int
static deduplicate_sentences(sentences: typing.List[typing.List[str]]) → typing.List[typing.List[str]]
static effective_reference_length(hypotheses: typing.List[typing.List[str]], references_list: typing.List[typing.List[typing.List[str]]]) → int

Compute the effective reference corpus length.

The effective reference corpus length is based on best match length.

Parameters:
  • hypotheses – List of output sentences as lists of words
  • references_list – List of lists of references (as lists of words)
static merge_max_counters(counters: typing.List[collections.Counter]) → collections.Counter

Merge counters using maximum values.

static minimum_reference_length(hypotheses: typing.List[typing.List[str]], references_list: typing.List[typing.List[str]]) → int

Compute the minimum reference corpus length.

The minimum reference corpus length is based on the shortest reference sentence length.

Parameters:
  • hypotheses – List of output sentences as lists of words
  • references_list – List of lists of references (as lists of words)
static modified_ngram_precision(hypotheses: typing.List[typing.List[str]], references_list: typing.List[typing.List[typing.List[str]]], n: int, case_sensitive: bool) → typing.Tuple[float, int]

Compute the modified n-gram precision on a list of sentences.

Parameters:
  • hypotheses – List of output sentences as lists of words
  • references_list – List of lists of reference sentences (as lists of words)
  • n – n-gram order
  • case_sensitive – Whether to perform case-sensitive computation
static ngram_counts(sentence: typing.List[str], n: int, lowercase: bool, delimiter: str = ' ') → collections.Counter

Get n-grams from a sentence.

Parameters:
  • sentence – Sentence as a list of words
  • n – n-gram order
  • lowercase – Convert ngrams to lowercase
  • delimiter – delimiter to use to create counter entries

neuralmonkey.evaluators.bleu_ref module

class neuralmonkey.evaluators.bleu_ref.BLEUReferenceImplWrapper(wrapper, name='BLEU', encoding='utf-8')

Bases: object

Wrapper for TectoMT’s wrapper for reference NIST and BLEU scorer.

serialize_to_bytes(sentences: typing.List[typing.List[str]]) → bytes

neuralmonkey.evaluators.chrf module

class neuralmonkey.evaluators.chrf.ChrFEvaluator(n: int = 6, beta: float = 1, ignored_symbols: typing.Union[typing.List[str], NoneType] = None, name: typing.Union[str, NoneType] = None) → None

Bases: object

Compute ChrF score.

See http://www.statmt.org/wmt15/pdf/WMT49.pdf

neuralmonkey.evaluators.edit_distance module

class neuralmonkey.evaluators.edit_distance.EditDistanceEvaluator(name: str = 'Edit distance') → None

Bases: object

static compare_scores(score1: float, score2: float) → int
static ratio(str1: str, str2: str) → float

neuralmonkey.evaluators.f1_bio module

class neuralmonkey.evaluators.f1_bio.F1Evaluator(name: str = 'F1 measure') → None

Bases: object

F1 evaluator for BIO tagging, e.g. NP chunking.

The entities are annotated as beginning of the entity (B), continuation of the entity (I), the rest is outside the entity (O).

static chunk2set(seq: typing.List[str]) → typing.Set[str]
static f1_score(decoded: typing.List[str], reference: typing.List[str]) → float

neuralmonkey.evaluators.gleu module

class neuralmonkey.evaluators.gleu.GLEUEvaluator(n: int = 4, deduplicate: bool = False, name: typing.Union[str, NoneType] = None) → None

Bases: object

Sentence-level evaluation metric correlating with BLEU on corpus-level.

From “Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation” by Wu et al. (https://arxiv.org/pdf/1609.08144v2.pdf)

GLEU is the minimum of recall and precision of all n-grams up to n in references and hypotheses.

Ngram counts are based on the bleu methods.

static gleu(hypotheses: typing.List[typing.List[str]], references: typing.List[typing.List[typing.List[str]]], ngrams: int = 4, case_sensitive: bool = True) → float

Compute GLEU on a corpus with multiple references (no smoothing).

Parameters:
  • hypotheses – List of hypotheses
  • references – LIst of references. There can be more than one reference.
  • ngrams – Maximum order of n-grams. Default 4.
  • case_sensitive – Perform case-sensitive computation. Default True.
static total_precision_recall(hypotheses: typing.List[typing.List[str]], references_list: typing.List[typing.List[typing.List[str]]], ngrams: int, case_sensitive: bool) → typing.Tuple[float, float]

Compute a modified n-gram precision and recall on a sentence list.

Parameters:
  • hypotheses – List of output sentences as lists of words
  • references_list – List of lists of reference sentences (as lists of words)
  • ngrams – n-gram order
  • case_sensitive – Whether to perform case-sensitive computation

neuralmonkey.evaluators.mse module

class neuralmonkey.evaluators.mse.MeanSquaredErrorEvaluator(name: str = 'MeanSquaredError') → None

Bases: object

static compare_scores(score1: float, score2: float) → int

neuralmonkey.evaluators.multeval module

class neuralmonkey.evaluators.multeval.MultEvalWrapper(wrapper: str, name: str = 'MultEval', encoding: str = 'utf-8', metric: str = 'bleu', language: str = 'en') → None

Bases: object

Wrapper for mult-eval’s reference BLEU and METEOR scorer.

serialize_to_bytes(sentences: typing.List[typing.List[str]]) → bytes

neuralmonkey.evaluators.ter module

class neuralmonkey.evaluators.ter.TEREvaluator(name: str = 'TER') → None

Bases: object

Compute TER using the pyter library.

neuralmonkey.evaluators.wer module

class neuralmonkey.evaluators.wer.WEREvaluator(name: str = 'WER') → None

Bases: object

Compute WER (word error rate, used in speech recognition).

Module contents