Documentation

NLTK Documentation

  • API Reference
  • Example Usage
  • Module Index
  • Open Issues
  • NLTK on GitHub

Installation

  • Installing NLTK
  • Installing NLTK Data
  • Release Notes
  • Contributing to NLTK

nltk.translate.bleu_score module ¶

BLEU score implementation.

Bases: object

This is an implementation of the smoothing techniques for segment-level BLEU scores that was presented in Boxing Chen and Collin Cherry (2014) A Systematic Comparison of Smoothing Techniques for Sentence-Level BLEU. In WMT14. http://acl2014.org/acl2014/W14-33/pdf/W14-3346.pdf

This will initialize the parameters required for the various smoothing techniques, the default values are set to the numbers used in the experiments from Chen and Cherry (2014).

epsilon ( float ) – the epsilon value use in method 1

alpha ( int ) – the alpha value use in method 6

k ( int ) – the k value use in method 4

No smoothing.

Smoothing method 1: Add epsilon counts to precision with 0 counts.

Smoothing method 2: Add 1 to both numerator and denominator from Chin-Yew Lin and Franz Josef Och (2004) ORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation. In COLING 2004.

Smoothing method 3: NIST geometric sequence smoothing The smoothing is computed by taking 1 / ( 2^k ), instead of 0, for each precision score whose matching n-gram count is null. k is 1 for the first ‘n’ value for which the n-gram match count is null/

For example, if the text contains:

one 2-gram match

and (consequently) two 1-gram matches

the n-gram count for each individual precision score would be:

n=1 => prec_count = 2 (two unigrams)

n=2 => prec_count = 1 (one bigram)

n=3 => prec_count = 1/2 (no trigram, taking ‘smoothed’ value of 1 / ( 2^k ), with k=1)

n=4 => prec_count = 1/4 (no fourgram, taking ‘smoothed’ value of 1 / ( 2^k ), with k=2)

Smoothing method 4: Shorter translations may have inflated precision values due to having smaller denominators; therefore, we give them proportionally smaller smoothed counts. Instead of scaling to 1/(2^k), Chen and Cherry suggests dividing by 1/ln(len(T)), where T is the length of the translation.

Smoothing method 5: The matched counts for similar values of n should be similar. To a calculate the n-gram matched count, it averages the n−1, n and n+1 gram matched counts.

Smoothing method 6: Interpolates the maximum likelihood estimate of the precision p_n with a prior estimate pi0 . The prior is estimated by assuming that the ratio between pn and pn−1 will be the same as that between pn−1 and pn−2; from Gao and He (2013) Training MRF-Based Phrase Translation Models using Gradient Ascent. In NAACL.

Smoothing method 7: Interpolates methods 4 and 5.

Calculate brevity penalty.

As the modified n-gram precision still has the problem from the short length sentence, brevity penalty is used to modify the overall BLEU score according to length.

An example from the paper. There are three references with length 12, 15 and 17. And a concise hypothesis of the length 12. The brevity penalty is 1.

In case a hypothesis translation is shorter than the references, penalty is applied.

The length of the closest reference is used to compute the penalty. If the length of a hypothesis is 12, and the reference lengths are 13 and 2, the penalty is applied because the hypothesis length (12) is less then the closest reference length (13).

The brevity penalty doesn’t depend on reference order. More importantly, when two reference sentences are at the same distance, the shortest reference sentence length is used.

A test example from mteval-v13a.pl (starting from the line 705):

hyp_len ( int ) – The length of the hypothesis for a single sentence OR the sum of all the hypotheses’ lengths for a corpus

closest_ref_len ( int ) – The length of the closest reference for a single hypothesis OR the sum of all the closest references for every hypotheses.

BLEU’s brevity penalty.

This function finds the reference that is the closest length to the hypothesis. The closest reference length is referred to as r variable from the brevity penalty formula in Papineni et. al. (2002)

references ( list ( list ( str ) ) ) – A list of reference translations.

hyp_len ( int ) – The length of the hypothesis.

The length of the reference that’s closest to the hypothesis.

Calculate a single corpus-level BLEU score (aka. system-level BLEU) for all the hypotheses and their respective references.

Instead of averaging the sentence level BLEU scores (i.e. macro-average precision), the original BLEU metric (Papineni et al. 2002) accounts for the micro-average precision (i.e. summing the numerators and denominators for each hypothesis-reference(s) pairs before the division).

The example below show that corpus_bleu() is different from averaging sentence_bleu() for hypotheses

Custom weights may be supplied to fine-tune the BLEU score further. A tuple of float weights for unigrams, bigrams, trigrams and so on can be given. >>> weights = (0.1, 0.3, 0.5, 0.1) >>> corpus_bleu(list_of_references, hypotheses, weights=weights) # doctest: +ELLIPSIS 0.5818…

This particular weight gave extra value to trigrams. Furthermore, multiple weights can be given, resulting in multiple BLEU scores. >>> weights = [ … (0.5, 0.5), … (0.333, 0.333, 0.334), … (0.25, 0.25, 0.25, 0.25), … (0.2, 0.2, 0.2, 0.2, 0.2) … ] >>> corpus_bleu(list_of_references, hypotheses, weights=weights) # doctest: +ELLIPSIS [0.8242…, 0.7067…, 0.5920…, 0.4719…]

list_of_references ( list ( list ( list ( str ) ) ) ) – a corpus of lists of reference sentences, w.r.t. hypotheses

hypotheses ( list ( list ( str ) ) ) – a list of hypothesis sentences

weights ( tuple ( float ) / list ( tuple ( float ) ) ) – weights for unigrams, bigrams, trigrams and so on (one or a list of weights)

smoothing_function ( SmoothingFunction ) –

auto_reweigh ( bool ) – Option to re-normalize the weights uniformly.

The corpus-level BLEU score.

Calculate modified ngram precision.

The normal precision method may lead to some wrong translations with high-precision, e.g., the translation, in which a word of reference repeats several times, has very high precision.

This function only returns the Fraction object that contains the numerator and denominator necessary to calculate the corpus-level precision. To calculate the modified precision for a single pair of hypothesis and references, cast the Fraction object into a float.

The famous “the the the … ” example shows that you can get BLEU precision by duplicating high frequency words.

In the modified n-gram precision, a reference word will be considered exhausted after a matching hypothesis word is identified, e.g.

An example of a normal machine translation hypothesis:

hypothesis ( list ( str ) ) – A hypothesis translation.

n ( int ) – The ngram order.

BLEU’s modified precision for the nth order ngram.

Calculate BLEU score (Bilingual Evaluation Understudy) from Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. “BLEU: a method for automatic evaluation of machine translation.” In Proceedings of ACL. https://www.aclweb.org/anthology/P02-1040.pdf

If there is no ngrams overlap for any order of n-grams, BLEU returns the value 0. This is because the precision for the order of n-grams without overlap is 0, and the geometric mean in the final BLEU score computation multiplies the 0 with the precision of other n-grams. This results in 0 (independently of the precision of the other n-gram orders). The following example has zero 3-gram and 4-gram overlaps:

To avoid this harsh behaviour when no ngram overlaps are found a smoothing function can be used.

The default BLEU calculates a score for up to 4-grams using uniform weights (this is called BLEU-4). To evaluate your translations with higher/lower order ngrams, use customized weights. E.g. when accounting for up to 5-grams with uniform weights (this is called BLEU-5) use:

Multiple BLEU scores can be computed at once, by supplying a list of weights. E.g. for computing BLEU-2, BLEU-3 and BLEU-4 in one computation, use: >>> weights = [ … (1./2., 1./2.), … (1./3., 1./3., 1./3.), … (1./4., 1./4., 1./4., 1./4.) … ] >>> sentence_bleu([reference1, reference2, reference3], hypothesis1, weights) # doctest: +ELLIPSIS [0.7453…, 0.6240…, 0.5045…]

references ( list ( list ( str ) ) ) – reference sentences

hypothesis ( list ( str ) ) – a hypothesis sentence

The sentence-level BLEU score. Returns a list if multiple weights were supplied.

float / list(float)

Help Center Help Center

  • Help Center
  • Trial Software
  • Product Updates
  • Documentation

bleuEvaluationScore

Evaluate translation or summarization with BLEU similarity score

Since R2020a

Description

The BiLingual Evaluation Understudy (BLEU) scoring algorithm evaluates the similarity between a candidate document and a collection of reference documents. Use the BLEU score to evaluate the quality of document translation and summarization models.

score = bleuEvaluationScore( candidate , references ) returns the BLEU similarity score between the specified candidate document and the reference documents. The function computes n-gram overlaps between candidate and references for n-gram lengths one through four, with equal weighting. For more information, see BLEU Score .

score = bleuEvaluationScore( candidate , references , Name=Value ) specifies additional options using one or more name-value arguments.

collapse all

Evaluate Summary

Create an array of tokenized documents and extract a summary using the extractSummary function.

Specify the reference documents as a tokenizedDocument array.

Calculate the BLEU score between the summary and the reference documents using the bleuEvaluationScore function.

This score indicates a fairly good similarity. A BLEU score close to one indicates strong similarity.

Specify N-Gram Weights

Calculate the BLEU score between the candidate document and the reference documents using the default options. The bleuEvaluationScore function, by default, uses n-grams of length one through four with equal weights.

Given that the summary document differs only by one word to one of the reference documents, this score might suggest a lower similarity than might be expected. This behavior is due to the function using n-grams which are too large for the short document length.

To address this, use shorter n-grams by setting the 'NgramWeights' option to a shorter vector. Calculate the BLEU score again using only unigrams and bigrams by setting the 'NgramWeights' option to a two-element vector. Treat unigrams and bigrams equally by specifying equal weights.

This score suggests a better similarity than before.

Input Arguments

Candidate — candidate document tokenizeddocument scalar | string array | cell array of character vectors.

Candidate document, specified as a tokenizedDocument scalar, a string array, or a cell array of character vectors. If candidate is not a tokenizedDocument scalar, then it must be a row vector representing a single document, where each element is a word.

references — Reference documents tokenizedDocument array | string array | cell array of character vectors

Reference documents, specified as a tokenizedDocument array, a string array, or a cell array of character vectors. If references is not a tokenizedDocument array, then it must be a row vector representing a single document, where each element is a word. To evaluate against multiple reference documents, use a tokenizedDocument array.

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN , where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: bleuEvaluationScore(candidate,references,IgnoreCase=true) evaluate the BLEU similarity score ignoring case

NgramWeights — N-gram weights [0.25 0.25 0.25 0.25] (default) | row vector of finite nonnegative values

N-gram weights, specified as a row vector of finite nonnegative values, where NgramWeights(i) corresponds to the weight for n-grams of length i . The length of the weight vector determines the range of n-gram lengths to use for the BLEU score evaluation. The function normalizes the n-gram weights to sum to one.

If the number of words in candidate is smaller than the number of elements in ngramWeights , then the resulting BLEU score is zero. To ensure that bleuEvaluationScore returns nonzero scores for very short documents, set ngramWeights to a vector with fewer elements than the number of words in candidate .

Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64

IgnoreCase — Option to ignore case 0 ( false ) (default) | 1 ( true )

Option to ignore case, specified as one of these values:

0 ( false ) – use case-sensitive comparisons between candidates and references.

1 ( true ) – compare candidates and references ignoring case.

Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64 | logical

Output Arguments

Score — bleu score scalar.

BLEU score, returned as a scalar value in the range [0,1] or NaN .

A BLEU score close to zero indicates poor similarity between candidate and references . A BLEU score close to one indicates strong similarity. If candidate is identical to one of the reference documents, then score is 1. If candidate and references are both empty documents, then score is NaN . For more information, see BLEU Score .

The BiLingual Evaluation Understudy (BLEU) scoring algorithm [1] evaluates the similarity between a candidate document and a collection of reference documents. Use the BLEU score to evaluate the quality of document translation and summarization models.

To compute the BLEU score, the algorithm uses n-gram counts, clipped n-gram counts , modified n-gram precision scores , and a brevity penalty .

The clipped n-gram counts function Count clip , if necessary, truncates the n-gram count for each n-gram so that it does not exceed the largest count observed in any single reference for that n-gram. The clipped counts function is given by

Count clip ( n-gram ) = min ( Count ( n-gram ) , MaxRefCount ( n-gram ) ) ,

where Count ( n-gram ) denotes the n-gram counts and MaxRefCount ( n-gram ) is the largest n-gram count observed in a single reference document for that n-gram.

The modified n-gram precision scores are given by

p n = ∑ C ∈ { Candidates } ∑ n-gram ∈ C Count clip ( n-gram ) ∑ C ' ∈ { Candidates } ∑ n-gram ′ ∈ C ′ Count ( n-gram ′ ) ,

where n corresponds to the n-gram length and { candidates } is the set of sentences in the candidate documents.

Given a vector of n-gram weights w , the BLEU score is given by

bleuScore = BP · exp ( ∑ n = 1 N w n log p ¯ n ) ,

where N is the largest n-gram length, the entries in p ¯ correspond to the geometric averages of the modified n-gram precisions, and BP is the brevity penalty given by

BP = { 1 if  c > r e 1 − r c if  c ≤ r

where c is the length of the candidate document and r is the length of the reference document with length closest to the candidate length.

[1] Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. "BLEU: A Method for Automatic Evaluation of Machine Translation." In Proceedings of the 40th annual meeting on association for computational linguistics , pp. 311-318. Association for Computational Linguistics, 2002.

Version History

Introduced in R2020a

tokenizedDocument | rougeEvaluationScore | bm25Similarity | cosineSimilarity | textrankScores | lexrankScores | mmrScores | extractSummary

  • Sequence-to-Sequence Translation Using Attention

MATLAB Command

You clicked a link that corresponds to this MATLAB command:

Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.

Select a Web Site

Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .

  • Switzerland (English)
  • Switzerland (Deutsch)
  • Switzerland (Français)
  • 中国 (English)

You can also select a web site from the following list:

How to Get Best Site Performance

Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.

  • América Latina (Español)
  • Canada (English)
  • United States (English)
  • Belgium (English)
  • Denmark (English)
  • Deutschland (Deutsch)
  • España (Español)
  • Finland (English)
  • France (Français)
  • Ireland (English)
  • Italia (Italiano)
  • Luxembourg (English)
  • Netherlands (English)
  • Norway (English)
  • Österreich (Deutsch)
  • Portugal (English)
  • Sweden (English)
  • United Kingdom (English)

Asia Pacific

  • Australia (English)
  • India (English)
  • New Zealand (English)

Contact your local office

nltk . translate . bleu_score

Calculate brevity penalty.

As the modified n-gram precision still has the problem from the short length sentence, brevity penalty is used to modify the overall BLEU score according to length.

An example from the paper. There are three references with length 12, 15 and 17. And a concise hypothesis of the length 12. The brevity penalty is 1.

>>> reference1 = list ( 'aaaaaaaaaaaa' ) # i.e. ['a'] * 12 >>> reference2 = list ( 'aaaaaaaaaaaaaaa' ) # i.e. ['a'] * 15 >>> reference3 = list ( 'aaaaaaaaaaaaaaaaa' ) # i.e. ['a'] * 17 >>> hypothesis = list ( 'aaaaaaaaaaaa' ) # i.e. ['a'] * 12 >>> references = [reference1, reference2, reference3] >>> hyp_len = len (hypothesis) >>> closest_ref_len = closest_ref_length(references, hyp_len) >>> brevity_penalty(closest_ref_len, hyp_len) 1.0

In case a hypothesis translation is shorter than the references, penalty is applied.

>>> references = [[ 'a' ] * 28, [ 'a' ] * 28] >>> hypothesis = [ 'a' ] * 12 >>> hyp_len = len (hypothesis) >>> closest_ref_len = closest_ref_length(references, hyp_len) >>> brevity_penalty(closest_ref_len, hyp_len) 0.2635971381157267

The length of the closest reference is used to compute the penalty. If the length of a hypothesis is 12, and the reference lengths are 13 and 2, the penalty is applied because the hypothesis length (12) is less then the closest reference length (13).

>>> references = [[ 'a' ] * 13, [ 'a' ] * 2] >>> hypothesis = [ 'a' ] * 12 >>> hyp_len = len (hypothesis) >>> closest_ref_len = closest_ref_length(references, hyp_len) >>> brevity_penalty(closest_ref_len, hyp_len) # doctest: +ELLIPSIS 0.9200...

The brevity penalty doesn't depend on reference order. More importantly, when two reference sentences are at the same distance, the shortest reference sentence length is used.

>>> references = [[ 'a' ] * 13, [ 'a' ] * 11] >>> hypothesis = [ 'a' ] * 12 >>> hyp_len = len (hypothesis) >>> closest_ref_len = closest_ref_length(references, hyp_len) >>> bp1 = brevity_penalty(closest_ref_len, hyp_len) >>> hyp_len = len (hypothesis) >>> closest_ref_len = closest_ref_length( reversed (references), hyp_len) >>> bp2 = brevity_penalty(closest_ref_len, hyp_len) >>> bp1 == bp2 == 1 True

A test example from mteval-v13a.pl (starting from the line 705):

>>> references = [[ 'a' ] * 11, [ 'a' ] * 8] >>> hypothesis = [ 'a' ] * 7 >>> hyp_len = len (hypothesis) >>> closest_ref_len = closest_ref_length(references, hyp_len) >>> brevity_penalty(closest_ref_len, hyp_len) # doctest: +ELLIPSIS 0.8668... >>> references = [[ 'a' ] * 11, [ 'a' ] * 8, [ 'a' ] * 6, [ 'a' ] * 7] >>> hypothesis = [ 'a' ] * 7 >>> hyp_len = len (hypothesis) >>> closest_ref_len = closest_ref_length(references, hyp_len) >>> brevity_penalty(closest_ref_len, hyp_len) 1.0

sum of all the hypotheses' lengths for a corpus :type hyp_len: int :param closest_ref_len: The length of the closest reference for a single hypothesis OR the sum of all the closest references for every hypotheses. :type closest_ref_len: int :return: BLEU's brevity penalty. :rtype: float

Calculate a single corpus-level BLEU score (aka. system-level BLEU) for all the hypotheses and their respective references.

Instead of averaging the sentence level BLEU scores (i.e. macro-average precision), the original BLEU metric (Papineni et al. 2002) accounts for the micro-average precision (i.e. summing the numerators and denominators for each hypothesis-reference(s) pairs before the division).

The example below show that corpus_bleu() is different from averaging sentence_bleu() for hypotheses

Calculate modified ngram precision.

The normal precision method may lead to some wrong translations with high-precision, e.g., the translation, in which a word of reference repeats several times, has very high precision.

This function only returns the Fraction object that contains the numerator and denominator necessary to calculate the corpus-level precision. To calculate the modified precision for a single pair of hypothesis and references, cast the Fraction object into a float.

The famous "the the the ... " example shows that you can get BLEU precision by duplicating high frequency words.

>>> reference1 = 'the cat is on the mat' .split() >>> reference2 = 'there is a cat on the mat' .split() >>> hypothesis1 = 'the the the the the the the' .split() >>> references = [reference1, reference2] >>> float (modified_precision(references, hypothesis1, n=1)) # doctest: +ELLIPSIS 0.2857...

In the modified n-gram precision, a reference word will be considered exhausted after a matching hypothesis word is identified, e.g.

>>> reference1 = [ 'It' , 'is' , 'a' , 'guide' , 'to' , 'action' , 'that' , ... 'ensures' , 'that' , 'the' , 'military' , 'will' , ... 'forever' , 'heed' , 'Party' , 'commands' ] >>> reference2 = [ 'It' , 'is' , 'the' , 'guiding' , 'principle' , 'which' , ... 'guarantees' , 'the' , 'military' , 'forces' , 'always' , ... 'being' , 'under' , 'the' , 'command' , 'of' , 'the' , ... 'Party' ] >>> reference3 = [ 'It' , 'is' , 'the' , 'practical' , 'guide' , 'for' , 'the' , ... 'army' , 'always' , 'to' , 'heed' , 'the' , 'directions' , ... 'of' , 'the' , 'party' ] >>> hypothesis = 'of the' .split() >>> references = [reference1, reference2, reference3] >>> float (modified_precision(references, hypothesis, n=1)) 1.0 >>> float (modified_precision(references, hypothesis, n=2)) 1.0

An example of a normal machine translation hypothesis:

>>> hypothesis1 = [ 'It' , 'is' , 'a' , 'guide' , 'to' , 'action' , 'which' , ... 'ensures' , 'that' , 'the' , 'military' , 'always' , ... 'obeys' , 'the' , 'commands' , 'of' , 'the' , 'party' ] >>> hypothesis2 = [ 'It' , 'is' , 'to' , 'insure' , 'the' , 'troops' , ... 'forever' , 'hearing' , 'the' , 'activity' , 'guidebook' , ... 'that' , 'party' , 'direct' ] >>> reference1 = [ 'It' , 'is' , 'a' , 'guide' , 'to' , 'action' , 'that' , ... 'ensures' , 'that' , 'the' , 'military' , 'will' , ... 'forever' , 'heed' , 'Party' , 'commands' ] >>> reference2 = [ 'It' , 'is' , 'the' , 'guiding' , 'principle' , 'which' , ... 'guarantees' , 'the' , 'military' , 'forces' , 'always' , ... 'being' , 'under' , 'the' , 'command' , 'of' , 'the' , ... 'Party' ] >>> reference3 = [ 'It' , 'is' , 'the' , 'practical' , 'guide' , 'for' , 'the' , ... 'army' , 'always' , 'to' , 'heed' , 'the' , 'directions' , ... 'of' , 'the' , 'party' ] >>> references = [reference1, reference2, reference3] >>> float (modified_precision(references, hypothesis1, n=1)) # doctest: +ELLIPSIS 0.9444... >>> float (modified_precision(references, hypothesis2, n=1)) # doctest: +ELLIPSIS 0.5714... >>> float (modified_precision(references, hypothesis1, n=2)) # doctest: +ELLIPSIS 0.5882352941176471 >>> float (modified_precision(references, hypothesis2, n=2)) # doctest: +ELLIPSIS 0.07692...

Calculate BLEU score (Bilingual Evaluation Understudy) from Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. "BLEU: a method for automatic evaluation of machine translation." In Proceedings of ACL. http://www.aclweb.org/anthology/P02-1040.pdf

If there is no ngrams overlap for any order of n-grams, BLEU returns the value 0. This is because the precision for the order of n-grams without overlap is 0, and the geometric mean in the final BLEU score computation multiplies the 0 with the precision of other n-grams. This results in 0 (independently of the precision of the othe n-gram orders). The following example has zero 3-gram and 4-gram overlaps:

To avoid this harsh behaviour when no ngram overlaps are found a smoothing function can be used.

The default BLEU calculates a score for up to 4-grams using uniform weights (this is called BLEU-4). To evaluate your translations with higher/lower order ngrams, use customized weights. E.g. when accounting for up to 5-grams with uniform weights (this is called BLEU-5) use:

BLEU from scratch

11 minute read

Recently, I joined the Language, Information, and Learning at Yale lab, led by Professor Dragomir Radev. Although I’m still in what I would consider to be the incipient stages of ML/DL/NLP studies—meaning it will take time for me to be able to actively participate in an contribute to research and publications—I think it will be a great learning experience from which I can glean valuable insight into what research at Yale looks like.

One of the first projects I was introduced to at the lab is domain-independent table summarization. As the name implies, the goal is to train a model such that it can extract some meaningful insight from the table and produce a human-readable summary. Members are the lab seem to be making great progress in this project, and I’m excited to see where it will go. In the meantime, I decided to write a short post on BLEU , a metric that I came across while reading some of the survey papers related to this topic. Let’s dive into it.

Introduction

Before going into code and equations, a high-level overview of what BLEU is might be helpful here. BLEU, which stands for Bilingual Evaluation Understudy, is an metric that was introduced to quantitatively evaluate the quality of machine translations. The motivation is clear: as humans, we are able to get an intuitive sense of whether or not a given translation is accurate and of high quality; however, it is difficult to translate this arbitrary linguistic intuition to train NLP models to produce better translations. This is where BLEU comes to the rescue.

The way BLEU works is simple. Given some candidate translation of a sentence and a group of reference sentences, we use a bag-of-word approach to see how many occurences of BOWs co-occur in both the translation and reference sentences. BOW is a simple yet highly effective way of ensuring that the machine translation contains key phrases or words that reference translations also contain. In other words, BLEU compares candidate translations with human-produced, annotated reference translations and compares how many hits there are in the candidate sentence. The more BOW hits there are, the better the translation.

Of course, there are many more details that go beyond this. For instance, BLEU is able to account for situations in which meaningless words are repeated throughout the machine translation to simply increase BOW hits. It can also penalize translations that are too short. By combining this BOW precision-based approach with some penalization terms, BLEU provides a robust means of evaluating machine translations. With this high-level overview in mind, let’s start implementing BLEU from scratch.

Preprocessing

First, let’s begin by defining some simple preprocessing and helper functions that we will be using throughout this tutorial. The first on the list is lower_n_split , which converts a given sentence into lowercase and splits it into tokens, which are, in this case, English words. We could make this more robust using regular expressions to remove punctuations, but for the purposes of this demonstration, let’s make this simpler.

I decided to use anonymous functions for the sake of simplicity and code readability. Next, let’s write a function that creates n-grams from a given sentence. This involves tokenizing the given sentence using lower_n_split , then looping through the tokens to create a bag of words.

And here is a quick sanity check of what we’ve done so far.

Motivating BLEU

The BLEU score is based on a familar concept in machine learning: precision . Formally, precision is defined as

where $tp$ and $fp$ stand for true and false positives, respectively.

In the context of machine translations, we can consider positives as roughly corresponding to the notion of hits or matches. In other words, the positives are the bag of word n-grams we can construct from a given candidate translation. True positives are n-grams that appear in both the candidate and some reference translation; false positives are those that only appear in the candidate translation. Let’s use this intuition to build a simple precision-based metric.

Simple Precision

First, we need to create some n-grams from the candidate translation. Then, we iterate through the n-grams to see if they exist in any of the n-grams generated from reference translations. We count the total number of such hits, or true positives, and divide that quantity by the total number of n-grams produced from the candidate translation.

Below are some candidate sentences and reference translations that we will be using as an example throughout this tutorial.

Comparing ca_1 with ca_2 , it is pretty clear that the former is the better translation. Let’s see if the simple precision metric is able to capture this intuition.

And indeed that seems to be the case!

Modified Precision

However, the simple precision-based metric has some huge problems. As an extreme example, consider the following bad_ca candidate translation.

Obviously, bad_ca is a horrible translation, but the simple precision metric fails to flag it. This is because precision simply involves checking whether a hit occurs or not: it does not check for repeated bag of words. Hence, the original authors of BLEU introduces modified precision as a solution, which uses clipped counts. The gist of it is that, if some n-gram is repeated many times, we clip its count through the following formula:

Here, $\text{Count}$ refers to the number of hits we assign to a certain n-gram. We sum this value over all distinct n-grams in the candidate sentence. Note that the distinction requirement effectively weeds out repetitive translations such as bad_ca we looked at earlier.

$m_w$ refers to the number of occurrences of a n-gram in the candidate sentence. For example, in bad_ca , the unigram "it" appears 13 times, and so $m_w = 13$. This value, however, is clipped by $m_\text{max}$, which is the maximum number of occurrence of that n-gram in any one of the reference sentences. In other words, for each reference, we count the number of occurrence of that n-gram and take the maximum value among them.

This can seem very confusing, but hopefully it’s clearer once you read the code. Here is my implementation using collections.Counter .

Notice that we use a set in order to remove redundancies. max_count corresponds to $m_\text{max}$; ngram_counts[ngram] corresponds to $m_w$.

Using this modified metric, we can see that the bad_ca is now penalized quite a lot through the clipping mechanism.

But there are still problems that modified precision doesn’t take into account. Consider the following example translation.

To us, it’s pretty obvious that ca_3 is a bad translation. Although some of the key words might be there, the order in which they are arranged violates English syntax. This is the limitation of using unigrams for precision analysis. To make sure that sentences are coherent and read fluently, we now have to introduce the notion of n-grams, where $n$ is larger than 1. This way, we can preserve some of the sequential encoding in reference sentences and make better comparison.

The fact that unigrams are a poor way of evaluating translations becomes immediately clear once we plot the $n$ in n-grams against modified precision.

As you can see, precision score decreases as $n$ gets higher. This makes sense: a larger $n$ simply means that the window of comparison is larger. Unless whole phrases co-occur in the translation and reference sentences—which is highly unlikely—precision will be low. People have generally found that a suitable $n$ value lies somewhere around 1 and 4. As we will see later, packages like nltk use what is known as cumulative 4-gram BLEU score, or BLEU-4.

The good news is that our current implementation is already able to account for different $n$ values. This is because we wrote a handy little function, make_ngrams . By passing in different values to n , we can deal with different n-grams.

Brevity Penalty

Now we’re almost done. The last example to consider is the following translation:

This is obviously a bad translation. However, due to the way modified precision is currently being calculated, this sentence will likely earn a high score. To prevent this from happening, we need to apply what is known as brevity penalty. As the name implies, this penalizes short candidate translations, thus ensuring that only sufficiently long machine translations are ascribed a high score.

Although this might seem confusing, the underlying mechanism is quite simple. The goal is to find the length of the reference sentence whose length is closest to that of the candidate translation in question. If the length of that reference sentence is larger than the candidate sentence, we apply some penalty; if the candidate sentence is longer, than we do not apply any penalization. The specific formula for penalization looks as follows:

The brevity penalty term is multiplied to the n-gram modified precision. Therefore, a value of 1 means that no penalization is applied.

Let’s perform a quick sanity check to see whether the brevity penalty function works as expected.

Finally, it’s time to put all the pieces together. The formula for BLEU can be written as follows:

First, some notation clarifications. $n$ specifies the size of the bag of word, or the n-gram. $w_k$ denotes the weight we will ascribe to the modified precision—$p_k$—produced under that $k$-gram configuration. In other words, we calculate the weighted average of log precision, exponentiate that sum, and apply some brevity penalty. Although this can sound like a lot, really it’s just putting all the pieces we have discussed so far together. Let’s take a look at the code implementation.

The weighting happens in the zip part within the generator expression within the return statement. In this case, we apply weighting across $n$ that goes from n_start to n_end .

Now we’re done! Let’s test out our final implementation with ca_1 for $n$ from 1 to 4, all weighted equally.

The nltk package offers functions for BLEU calculation by default. For convenience purposes, let’s create a wrapper functions. This wrapping isn’t really necessary, but it abstracts out many of the preprocessing steps, such as applying lower_n_split . This is because the nltk BLEU calculation function expects tokenized input, whereas ca_1 and refs are untokenized sentences.

And we see that the result matches that derived from our own implementation!

In this post, we took a look at BLEU, a very common way of evaluating the fluency of machine translations. Studying the implementation of this metric was a meaningful and interesting process, not only because BLEU itself is widely used, but also because the motivation and intuition behind its construction was easily understandable and came very naturally to me. Each component of BLEU addresses some problem with simpler metrics, such as precision or modified precision. It also takes into account things like abnormally short or repetitive translations.

One area of interest for me these days is seq2seq models. Although RNN models have largely given way to transformers, I still think it’s a very interesting architecture worth diving into. I’ve also recently ran into a combined LSTM-CNN approach for processing series data. I might write about these topics in a future post.

I hope you’ve enjoyed reading this post. Catch you up later!

You May Also Enjoy

16 minute read

August 20 2023

I recently completed another summer internship at Meta (formerly Facebook). I was surprised to learn that one of the intern friends I met was an avid read...

Hacking Word Hunt

7 minute read

August 21 2022

Update: The code was modified with further optimizations. In particular, instead of checking the trie per every DFS call, we update the trie pointer along...

20 minute read

April 11 2022

Note: This blog post was completed as part of Yale’s CPSC 482: Current Topics in Applied Machine Learning.

Reflections and Expectations

5 minute read

December 27 2021

Last year, I wrote a blog post reflecting on the year 2020. Re-reading what I had written then was surprisingly insightful, particularly because I could see ...

dorkai / codeX-1.0 like 6

Search code, repositories, users, issues, pull requests...

Provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement . We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

corpus_bleu does not match multi-bleu.perl for very poor translations #1634

@AndreasMadsen

AndreasMadsen commented Feb 19, 2017

@alvations

alvations commented Feb 19, 2017

Sorry, something went wrong.

AndreasMadsen commented Feb 20, 2017

Alvations commented feb 20, 2017 • edited, alvations commented feb 20, 2017.

@alvations

alvations commented Mar 18, 2017

Alvations commented mar 19, 2017 • edited.

  • 👍 3 reactions

AndreasMadsen commented Mar 19, 2017

@AndreasMadsen

No branches or pull requests

@AndreasMadsen

IMAGES

  1. Null and Alternative Hypothesis Examples

    the hypothesis contains 0 counts of 2 gram overlaps

  2. How to Write Hypothesis in Research

    the hypothesis contains 0 counts of 2 gram overlaps

  3. PPT

    the hypothesis contains 0 counts of 2 gram overlaps

  4. Solved The null and alternative hypotheses for a hypothesis

    the hypothesis contains 0 counts of 2 gram overlaps

  5. Chapter 8 Hypothesis Testing

    the hypothesis contains 0 counts of 2 gram overlaps

  6. Hypothesis Testing Solved Problems

    the hypothesis contains 0 counts of 2 gram overlaps

VIDEO

  1. COSM

  2. Statistics

  3. COSM

  4. 24. Hypothesis Testing for Two Population Variances

  5. 8 Hypothesis testing| Z-test |Two Independent Samples with MS Excel

  6. Examples of Neyman-Pearson Lemma

COMMENTS

  1. BLEU

    The hypothesis contains 0 counts of 3-gram overlaps. Therefore the BLEU score evaluates to 0, independently of how many N-gram overlaps of lower order it contains. Consider using lower n-gram order or use SmoothingFunction() warnings.warn(_msg) Can someone tell me what is the problem here? I can not find the solution on google. Thank you.

  2. Why the same content, but the low score? · Issue #2217 · nltk/nltk

    The hypothesis contains 0 counts of 4-gram overlaps. Therefore the BLEU score evaluates to 0, independently of. how many N-gram overlaps of lower order it contains. Consider using lower n-gram order or use SmoothingFunction() the n-gram count for each individual precision score would be: 1.491668146240062e-154.

  3. nltk bleu_score return 0 · Issue #2498 · nltk/nltk · GitHub

    Therefore the BLEU score evaluates to 0, independently of how many N-gram overlaps of lower order it contains. Consider using lower n-gram order or use SmoothingFunction() warnings.warn(_msg) C:\Users\user\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\translate\bleu_score.py:523: UserWarning: The hypothesis contains 0 counts of 4 ...

  4. NLTK :: nltk.translate.bleu_score

    If there is no ngrams overlap for any order of n-grams, BLEU returns the value 0. This is because the precision for the order of n-grams without overlap is 0, and the geometric mean in the final BLEU score computation multiplies the 0 with the precision of other n-grams. This results in 0 (independently of the precision of the other n-gram orders).

  5. NLTK :: nltk.translate.bleu_score module

    Smoothing method 3: NIST geometric sequence smoothing The smoothing is computed by taking 1 / ( 2^k ), instead of 0, for each precision score whose matching n-gram count is null. k is 1 for the first 'n' value for which the n-gram match count is null/ For example, if the text contains: one 2-gram match. and (consequently) two 1-gram matches

  6. A weird edge case for bleu scoring. · Issue #1838 · nltk/nltk

    Next we have an example where there is 1 bigram and 2 unigram existing overlaps the reference and hypothesis. Seems like it gives a reasonable score. ... UserWarning: Corpus/Sentence contains 0 counts of 2-gram overlaps. BLEU scores might be undesirable; use SmoothingFunction(). warnings.warn(_msg) 0.5 # Valid BLEU score. >>> sentence_bleu([[1 ...

  7. A Gentle Introduction to Calculating the BLEU Score for Text in Python

    To calculate the BLEU score only for 1-gram matches, you can specify a weight of 1 for 1-gram and 0 for 2, 3 and 4 (1, 0, 0, 0). ... Corpus/Sentence contains 0 counts of 3-gram overlaps. BLEU scores might be undesirable; use SmoothingFunction(). warnings.warn(_msg) Next, we can a score that is very low indeed.

  8. Evaluate translation or summarization with BLEU similarity score

    The function computes n-gram overlaps between candidate and references for n-gram lengths one through four, with ... NgramWeights — N-gram weights [0.25 0.25 0.25 0.25] (default) | row vector of finite nonnegative values. ... The clipped n-gram counts function Count clip, if necessary, truncates the n-gram count for each n-gram so that it ...

  9. nltk.translate.bleu_score

    This is because the precision for the order of n-grams without overlap is 0, and the geometric mean in the final BLEU score computation multiplies the 0 with the precision of other n-grams. This results in 0 (independently of the precision of the othe n-gram orders). The following example has zero 3-gram and 4-gram overlaps:

  10. the BLEU socre calculated by the version 3.4 is different from the

    The hypothesis contains 0 counts of 4-gram overlaps. Therefore the BLEU score evaluates to 0, independently of how many N-gram overlaps of lower order it contains. Consider using lower n-gram order or use SmoothingFunction() warnings.warn(_msg) 5.5546715329196825e-78.

  11. NLTK: corpus-level bleu vs sentence-level BLEU score

    In Long:. Actually, if there's only one reference and one hypothesis in your whole corpus, both corpus_bleu() and sentence_bleu() should return the same value as shown in the example above.. In the code, we see that sentence_bleu is actually a duck-type of corpus_bleu:. def sentence_bleu(references, hypothesis, weights=(0.25, 0.25, 0.25, 0.25), smoothing_function=None): return corpus_bleu ...

  12. BLEU from scratch

    We count the total number of such hits, or true positives, and divide that quantity by the total number of n-grams produced from the candidate translation. def simple_precision ( ca , refs , n ): ngrams = make_ngrams ( ca , n ) count = 0 for ngram in ngrams : for ref in refs : if ngram in make_ngrams ( ref , n ): count += 1 break return count ...

  13. Brief Overview of LLM Response Evaluation Techniques

    Individual 2-gram: 0.153846. Individual 3-gram: 0.083333 ... The hypothesis contains 0 counts of 4-gram overlaps. Therefore the BLEU score evaluates to 0, independently of how many N-gram overlaps ...

  14. evaluator/CodeBLEU/weighted_ngram_match.py · dorkai/codeX-1.0 at main

    This is because the precision for the order of n-grams without: overlap is 0, and the geometric mean in the final BLEU score computation: multiplies the 0 with the precision of other n-grams. This results in 0 ... "\nThe hypothesis contains 0 counts of {}-gram overlaps.\n" "Therefore the BLEU score evaluates to 0, independently of\n"

  15. Google Colab

    Compares 1 hypothesis (candidate or source sentence) with 1+ reference sentences, returning the highest score when compared to multiple reference sentences. ... N-grams: 1-1.0, 2-0.9, 3-0.6666666666666666, 4-0.5 Cumulative N-grams: ... Corpus/Sentence contains 0 counts of 3-gram overlaps. BLEU scores might be undesirable; use SmoothingFunction ...

  16. How to handle BLEU scores for ngrams where n<4 in NLTK? #1554

    From #1545, BLEU is buggy for ngrams where the n<4. One simple way is to clip the weights distribution such that the weights=(0.25, 0.25, 0.25, 0.25) redistributes uniformly with respect to the highest ngram order from the hypothesis or reference. But we're not sure whether that's the most appropriate solution to fix that.

  17. PDF Re-evaluating the Role of BLEU in Machine Translation Research

    ngram∈S Countmatched(ngram) P S∈C P ngram∈S Count(ngram) Counting punctuation marks as separate tokens, the hypothesis translation given in Table 1 has 15 unigram matches, 10 bigram matches, 5 trigram matches (these are shown in bold in Table 2), and three 4-gram matches (not shown). The hypoth-esis translation contains a total of 18 ...

  18. Assertion error during training · Issue #24 · TellinaTool/nl2bash

    The hypothesis contains 0 counts of 2-gram overlaps. The hypothesis contains 0 counts of 3-gram overlaps. [...] 701 examples evaluated Top 1 Template Acc = 0.000 Top 1 Command Acc = 0.000 Average top 1 Template Match Score = 0.066 Average top 1 BLEU Score = 0.236 Top 3 Template Acc = 0.001 Top 3 Command Acc = 0.000 Average top 3 Template Match ...

  19. PDF Using N-gram based Features for Machine Translation

    As shown in Table 3, using 3-gram online LM plus 2~4-gram voting based confidence scores yields the best BLEU scores on both dev and test sets, which are 37.98% and 31.35%, respectively. This is a 0.84 BLEU point gain over the baseline on the MT08 test set. Table 1: Results of adding the n-gram online LM. BLEU %.

  20. corpus_bleu does not match multi-bleu.perl for very poor translations

    tokenized hypothesis: Teo S yb , oe uNb , R , T t , , t Tue Ar saln S , , 5istsi l , 5oe R ulO sae oR R ... UserWarning: Corpus / Sentence contains 0 counts of 2-gram overlaps. BLEU scores might be undesirable; use SmoothingFunction (). warnings. warn (_msg) 0. ... i.e. to return a 0.0 score if any of the 1- to 4- gram returns a 0.0 precision ...