Documentation
NLTK Documentation
- API Reference
- Example Usage
- Module Index
- Open Issues
- NLTK on GitHub
Installation
- Installing NLTK
- Installing NLTK Data
- Release Notes
- Contributing to NLTK
nltk.translate.bleu_score module ¶
BLEU score implementation.
Bases: object
This is an implementation of the smoothing techniques for segment-level BLEU scores that was presented in Boxing Chen and Collin Cherry (2014) A Systematic Comparison of Smoothing Techniques for Sentence-Level BLEU. In WMT14. http://acl2014.org/acl2014/W14-33/pdf/W14-3346.pdf
This will initialize the parameters required for the various smoothing techniques, the default values are set to the numbers used in the experiments from Chen and Cherry (2014).
epsilon ( float ) – the epsilon value use in method 1
alpha ( int ) – the alpha value use in method 6
k ( int ) – the k value use in method 4
No smoothing.
Smoothing method 1: Add epsilon counts to precision with 0 counts.
Smoothing method 2: Add 1 to both numerator and denominator from Chin-Yew Lin and Franz Josef Och (2004) ORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation. In COLING 2004.
Smoothing method 3: NIST geometric sequence smoothing The smoothing is computed by taking 1 / ( 2^k ), instead of 0, for each precision score whose matching n-gram count is null. k is 1 for the first ‘n’ value for which the n-gram match count is null/
For example, if the text contains:
one 2-gram match
and (consequently) two 1-gram matches
the n-gram count for each individual precision score would be:
n=1 => prec_count = 2 (two unigrams)
n=2 => prec_count = 1 (one bigram)
n=3 => prec_count = 1/2 (no trigram, taking ‘smoothed’ value of 1 / ( 2^k ), with k=1)
n=4 => prec_count = 1/4 (no fourgram, taking ‘smoothed’ value of 1 / ( 2^k ), with k=2)
Smoothing method 4: Shorter translations may have inflated precision values due to having smaller denominators; therefore, we give them proportionally smaller smoothed counts. Instead of scaling to 1/(2^k), Chen and Cherry suggests dividing by 1/ln(len(T)), where T is the length of the translation.
Smoothing method 5: The matched counts for similar values of n should be similar. To a calculate the n-gram matched count, it averages the n−1, n and n+1 gram matched counts.
Smoothing method 6: Interpolates the maximum likelihood estimate of the precision p_n with a prior estimate pi0 . The prior is estimated by assuming that the ratio between pn and pn−1 will be the same as that between pn−1 and pn−2; from Gao and He (2013) Training MRF-Based Phrase Translation Models using Gradient Ascent. In NAACL.
Smoothing method 7: Interpolates methods 4 and 5.
Calculate brevity penalty.
As the modified n-gram precision still has the problem from the short length sentence, brevity penalty is used to modify the overall BLEU score according to length.
An example from the paper. There are three references with length 12, 15 and 17. And a concise hypothesis of the length 12. The brevity penalty is 1.
In case a hypothesis translation is shorter than the references, penalty is applied.
The length of the closest reference is used to compute the penalty. If the length of a hypothesis is 12, and the reference lengths are 13 and 2, the penalty is applied because the hypothesis length (12) is less then the closest reference length (13).
The brevity penalty doesn’t depend on reference order. More importantly, when two reference sentences are at the same distance, the shortest reference sentence length is used.
A test example from mteval-v13a.pl (starting from the line 705):
hyp_len ( int ) – The length of the hypothesis for a single sentence OR the sum of all the hypotheses’ lengths for a corpus
closest_ref_len ( int ) – The length of the closest reference for a single hypothesis OR the sum of all the closest references for every hypotheses.
BLEU’s brevity penalty.
This function finds the reference that is the closest length to the hypothesis. The closest reference length is referred to as r variable from the brevity penalty formula in Papineni et. al. (2002)
references ( list ( list ( str ) ) ) – A list of reference translations.
hyp_len ( int ) – The length of the hypothesis.
The length of the reference that’s closest to the hypothesis.
Calculate a single corpus-level BLEU score (aka. system-level BLEU) for all the hypotheses and their respective references.
Instead of averaging the sentence level BLEU scores (i.e. macro-average precision), the original BLEU metric (Papineni et al. 2002) accounts for the micro-average precision (i.e. summing the numerators and denominators for each hypothesis-reference(s) pairs before the division).
The example below show that corpus_bleu() is different from averaging sentence_bleu() for hypotheses
Custom weights may be supplied to fine-tune the BLEU score further. A tuple of float weights for unigrams, bigrams, trigrams and so on can be given. >>> weights = (0.1, 0.3, 0.5, 0.1) >>> corpus_bleu(list_of_references, hypotheses, weights=weights) # doctest: +ELLIPSIS 0.5818…
This particular weight gave extra value to trigrams. Furthermore, multiple weights can be given, resulting in multiple BLEU scores. >>> weights = [ … (0.5, 0.5), … (0.333, 0.333, 0.334), … (0.25, 0.25, 0.25, 0.25), … (0.2, 0.2, 0.2, 0.2, 0.2) … ] >>> corpus_bleu(list_of_references, hypotheses, weights=weights) # doctest: +ELLIPSIS [0.8242…, 0.7067…, 0.5920…, 0.4719…]
list_of_references ( list ( list ( list ( str ) ) ) ) – a corpus of lists of reference sentences, w.r.t. hypotheses
hypotheses ( list ( list ( str ) ) ) – a list of hypothesis sentences
weights ( tuple ( float ) / list ( tuple ( float ) ) ) – weights for unigrams, bigrams, trigrams and so on (one or a list of weights)
smoothing_function ( SmoothingFunction ) –
auto_reweigh ( bool ) – Option to re-normalize the weights uniformly.
The corpus-level BLEU score.
Calculate modified ngram precision.
The normal precision method may lead to some wrong translations with high-precision, e.g., the translation, in which a word of reference repeats several times, has very high precision.
This function only returns the Fraction object that contains the numerator and denominator necessary to calculate the corpus-level precision. To calculate the modified precision for a single pair of hypothesis and references, cast the Fraction object into a float.
The famous “the the the … ” example shows that you can get BLEU precision by duplicating high frequency words.
In the modified n-gram precision, a reference word will be considered exhausted after a matching hypothesis word is identified, e.g.
An example of a normal machine translation hypothesis:
hypothesis ( list ( str ) ) – A hypothesis translation.
n ( int ) – The ngram order.
BLEU’s modified precision for the nth order ngram.
Calculate BLEU score (Bilingual Evaluation Understudy) from Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. “BLEU: a method for automatic evaluation of machine translation.” In Proceedings of ACL. https://www.aclweb.org/anthology/P02-1040.pdf
If there is no ngrams overlap for any order of n-grams, BLEU returns the value 0. This is because the precision for the order of n-grams without overlap is 0, and the geometric mean in the final BLEU score computation multiplies the 0 with the precision of other n-grams. This results in 0 (independently of the precision of the other n-gram orders). The following example has zero 3-gram and 4-gram overlaps:
To avoid this harsh behaviour when no ngram overlaps are found a smoothing function can be used.
The default BLEU calculates a score for up to 4-grams using uniform weights (this is called BLEU-4). To evaluate your translations with higher/lower order ngrams, use customized weights. E.g. when accounting for up to 5-grams with uniform weights (this is called BLEU-5) use:
Multiple BLEU scores can be computed at once, by supplying a list of weights. E.g. for computing BLEU-2, BLEU-3 and BLEU-4 in one computation, use: >>> weights = [ … (1./2., 1./2.), … (1./3., 1./3., 1./3.), … (1./4., 1./4., 1./4., 1./4.) … ] >>> sentence_bleu([reference1, reference2, reference3], hypothesis1, weights) # doctest: +ELLIPSIS [0.7453…, 0.6240…, 0.5045…]
references ( list ( list ( str ) ) ) – reference sentences
hypothesis ( list ( str ) ) – a hypothesis sentence
The sentence-level BLEU score. Returns a list if multiple weights were supplied.
float / list(float)
Help Center Help Center
- Help Center
- Trial Software
- Product Updates
- Documentation
bleuEvaluationScore
Evaluate translation or summarization with BLEU similarity score
Since R2020a
Description
The BiLingual Evaluation Understudy (BLEU) scoring algorithm evaluates the similarity between a candidate document and a collection of reference documents. Use the BLEU score to evaluate the quality of document translation and summarization models.
score = bleuEvaluationScore( candidate , references ) returns the BLEU similarity score between the specified candidate document and the reference documents. The function computes n-gram overlaps between candidate and references for n-gram lengths one through four, with equal weighting. For more information, see BLEU Score .
score = bleuEvaluationScore( candidate , references , Name=Value ) specifies additional options using one or more name-value arguments.
collapse all
Evaluate Summary
Create an array of tokenized documents and extract a summary using the extractSummary function.
Specify the reference documents as a tokenizedDocument array.
Calculate the BLEU score between the summary and the reference documents using the bleuEvaluationScore function.
This score indicates a fairly good similarity. A BLEU score close to one indicates strong similarity.
Specify N-Gram Weights
Calculate the BLEU score between the candidate document and the reference documents using the default options. The bleuEvaluationScore function, by default, uses n-grams of length one through four with equal weights.
Given that the summary document differs only by one word to one of the reference documents, this score might suggest a lower similarity than might be expected. This behavior is due to the function using n-grams which are too large for the short document length.
To address this, use shorter n-grams by setting the 'NgramWeights' option to a shorter vector. Calculate the BLEU score again using only unigrams and bigrams by setting the 'NgramWeights' option to a two-element vector. Treat unigrams and bigrams equally by specifying equal weights.
This score suggests a better similarity than before.
Input Arguments
Candidate — candidate document tokenizeddocument scalar | string array | cell array of character vectors.
Candidate document, specified as a tokenizedDocument scalar, a string array, or a cell array of character vectors. If candidate is not a tokenizedDocument scalar, then it must be a row vector representing a single document, where each element is a word.
references — Reference documents tokenizedDocument array | string array | cell array of character vectors
Reference documents, specified as a tokenizedDocument array, a string array, or a cell array of character vectors. If references is not a tokenizedDocument array, then it must be a row vector representing a single document, where each element is a word. To evaluate against multiple reference documents, use a tokenizedDocument array.
Name-Value Arguments
Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN , where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.
Before R2021a, use commas to separate each name and value, and enclose Name in quotes.
Example: bleuEvaluationScore(candidate,references,IgnoreCase=true) evaluate the BLEU similarity score ignoring case
NgramWeights — N-gram weights [0.25 0.25 0.25 0.25] (default) | row vector of finite nonnegative values
N-gram weights, specified as a row vector of finite nonnegative values, where NgramWeights(i) corresponds to the weight for n-grams of length i . The length of the weight vector determines the range of n-gram lengths to use for the BLEU score evaluation. The function normalizes the n-gram weights to sum to one.
If the number of words in candidate is smaller than the number of elements in ngramWeights , then the resulting BLEU score is zero. To ensure that bleuEvaluationScore returns nonzero scores for very short documents, set ngramWeights to a vector with fewer elements than the number of words in candidate .
Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64
IgnoreCase — Option to ignore case 0 ( false ) (default) | 1 ( true )
Option to ignore case, specified as one of these values:
0 ( false ) – use case-sensitive comparisons between candidates and references.
1 ( true ) – compare candidates and references ignoring case.
Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64 | logical
Output Arguments
Score — bleu score scalar.
BLEU score, returned as a scalar value in the range [0,1] or NaN .
A BLEU score close to zero indicates poor similarity between candidate and references . A BLEU score close to one indicates strong similarity. If candidate is identical to one of the reference documents, then score is 1. If candidate and references are both empty documents, then score is NaN . For more information, see BLEU Score .
The BiLingual Evaluation Understudy (BLEU) scoring algorithm [1] evaluates the similarity between a candidate document and a collection of reference documents. Use the BLEU score to evaluate the quality of document translation and summarization models.
To compute the BLEU score, the algorithm uses n-gram counts, clipped n-gram counts , modified n-gram precision scores , and a brevity penalty .
The clipped n-gram counts function Count clip , if necessary, truncates the n-gram count for each n-gram so that it does not exceed the largest count observed in any single reference for that n-gram. The clipped counts function is given by
Count clip ( n-gram ) = min ( Count ( n-gram ) , MaxRefCount ( n-gram ) ) ,
where Count ( n-gram ) denotes the n-gram counts and MaxRefCount ( n-gram ) is the largest n-gram count observed in a single reference document for that n-gram.
The modified n-gram precision scores are given by
p n = ∑ C ∈ { Candidates } ∑ n-gram ∈ C Count clip ( n-gram ) ∑ C ' ∈ { Candidates } ∑ n-gram ′ ∈ C ′ Count ( n-gram ′ ) ,
where n corresponds to the n-gram length and { candidates } is the set of sentences in the candidate documents.
Given a vector of n-gram weights w , the BLEU score is given by
bleuScore = BP · exp ( ∑ n = 1 N w n log p ¯ n ) ,
where N is the largest n-gram length, the entries in p ¯ correspond to the geometric averages of the modified n-gram precisions, and BP is the brevity penalty given by
BP = { 1 if c > r e 1 − r c if c ≤ r
where c is the length of the candidate document and r is the length of the reference document with length closest to the candidate length.
[1] Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. "BLEU: A Method for Automatic Evaluation of Machine Translation." In Proceedings of the 40th annual meeting on association for computational linguistics , pp. 311-318. Association for Computational Linguistics, 2002.
Version History
Introduced in R2020a
tokenizedDocument | rougeEvaluationScore | bm25Similarity | cosineSimilarity | textrankScores | lexrankScores | mmrScores | extractSummary
- Sequence-to-Sequence Translation Using Attention
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
- Switzerland (English)
- Switzerland (Deutsch)
- Switzerland (Français)
- 中国 (English)
You can also select a web site from the following list:
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
- América Latina (Español)
- Canada (English)
- United States (English)
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- United Kingdom (English)
Asia Pacific
- Australia (English)
- India (English)
- New Zealand (English)
Contact your local office
nltk . translate . bleu_score
Calculate brevity penalty.
As the modified n-gram precision still has the problem from the short length sentence, brevity penalty is used to modify the overall BLEU score according to length.
An example from the paper. There are three references with length 12, 15 and 17. And a concise hypothesis of the length 12. The brevity penalty is 1.
>>> reference1 = list ( 'aaaaaaaaaaaa' ) # i.e. ['a'] * 12 >>> reference2 = list ( 'aaaaaaaaaaaaaaa' ) # i.e. ['a'] * 15 >>> reference3 = list ( 'aaaaaaaaaaaaaaaaa' ) # i.e. ['a'] * 17 >>> hypothesis = list ( 'aaaaaaaaaaaa' ) # i.e. ['a'] * 12 >>> references = [reference1, reference2, reference3] >>> hyp_len = len (hypothesis) >>> closest_ref_len = closest_ref_length(references, hyp_len) >>> brevity_penalty(closest_ref_len, hyp_len) 1.0
In case a hypothesis translation is shorter than the references, penalty is applied.
>>> references = [[ 'a' ] * 28, [ 'a' ] * 28] >>> hypothesis = [ 'a' ] * 12 >>> hyp_len = len (hypothesis) >>> closest_ref_len = closest_ref_length(references, hyp_len) >>> brevity_penalty(closest_ref_len, hyp_len) 0.2635971381157267
The length of the closest reference is used to compute the penalty. If the length of a hypothesis is 12, and the reference lengths are 13 and 2, the penalty is applied because the hypothesis length (12) is less then the closest reference length (13).
>>> references = [[ 'a' ] * 13, [ 'a' ] * 2] >>> hypothesis = [ 'a' ] * 12 >>> hyp_len = len (hypothesis) >>> closest_ref_len = closest_ref_length(references, hyp_len) >>> brevity_penalty(closest_ref_len, hyp_len) # doctest: +ELLIPSIS 0.9200...
The brevity penalty doesn't depend on reference order. More importantly, when two reference sentences are at the same distance, the shortest reference sentence length is used.
>>> references = [[ 'a' ] * 13, [ 'a' ] * 11] >>> hypothesis = [ 'a' ] * 12 >>> hyp_len = len (hypothesis) >>> closest_ref_len = closest_ref_length(references, hyp_len) >>> bp1 = brevity_penalty(closest_ref_len, hyp_len) >>> hyp_len = len (hypothesis) >>> closest_ref_len = closest_ref_length( reversed (references), hyp_len) >>> bp2 = brevity_penalty(closest_ref_len, hyp_len) >>> bp1 == bp2 == 1 True
A test example from mteval-v13a.pl (starting from the line 705):
>>> references = [[ 'a' ] * 11, [ 'a' ] * 8] >>> hypothesis = [ 'a' ] * 7 >>> hyp_len = len (hypothesis) >>> closest_ref_len = closest_ref_length(references, hyp_len) >>> brevity_penalty(closest_ref_len, hyp_len) # doctest: +ELLIPSIS 0.8668... >>> references = [[ 'a' ] * 11, [ 'a' ] * 8, [ 'a' ] * 6, [ 'a' ] * 7] >>> hypothesis = [ 'a' ] * 7 >>> hyp_len = len (hypothesis) >>> closest_ref_len = closest_ref_length(references, hyp_len) >>> brevity_penalty(closest_ref_len, hyp_len) 1.0
sum of all the hypotheses' lengths for a corpus :type hyp_len: int :param closest_ref_len: The length of the closest reference for a single hypothesis OR the sum of all the closest references for every hypotheses. :type closest_ref_len: int :return: BLEU's brevity penalty. :rtype: float
Calculate a single corpus-level BLEU score (aka. system-level BLEU) for all the hypotheses and their respective references.
Instead of averaging the sentence level BLEU scores (i.e. macro-average precision), the original BLEU metric (Papineni et al. 2002) accounts for the micro-average precision (i.e. summing the numerators and denominators for each hypothesis-reference(s) pairs before the division).
The example below show that corpus_bleu() is different from averaging sentence_bleu() for hypotheses
Calculate modified ngram precision.
The normal precision method may lead to some wrong translations with high-precision, e.g., the translation, in which a word of reference repeats several times, has very high precision.
This function only returns the Fraction object that contains the numerator and denominator necessary to calculate the corpus-level precision. To calculate the modified precision for a single pair of hypothesis and references, cast the Fraction object into a float.
The famous "the the the ... " example shows that you can get BLEU precision by duplicating high frequency words.
>>> reference1 = 'the cat is on the mat' .split() >>> reference2 = 'there is a cat on the mat' .split() >>> hypothesis1 = 'the the the the the the the' .split() >>> references = [reference1, reference2] >>> float (modified_precision(references, hypothesis1, n=1)) # doctest: +ELLIPSIS 0.2857...
In the modified n-gram precision, a reference word will be considered exhausted after a matching hypothesis word is identified, e.g.
>>> reference1 = [ 'It' , 'is' , 'a' , 'guide' , 'to' , 'action' , 'that' , ... 'ensures' , 'that' , 'the' , 'military' , 'will' , ... 'forever' , 'heed' , 'Party' , 'commands' ] >>> reference2 = [ 'It' , 'is' , 'the' , 'guiding' , 'principle' , 'which' , ... 'guarantees' , 'the' , 'military' , 'forces' , 'always' , ... 'being' , 'under' , 'the' , 'command' , 'of' , 'the' , ... 'Party' ] >>> reference3 = [ 'It' , 'is' , 'the' , 'practical' , 'guide' , 'for' , 'the' , ... 'army' , 'always' , 'to' , 'heed' , 'the' , 'directions' , ... 'of' , 'the' , 'party' ] >>> hypothesis = 'of the' .split() >>> references = [reference1, reference2, reference3] >>> float (modified_precision(references, hypothesis, n=1)) 1.0 >>> float (modified_precision(references, hypothesis, n=2)) 1.0
An example of a normal machine translation hypothesis:
>>> hypothesis1 = [ 'It' , 'is' , 'a' , 'guide' , 'to' , 'action' , 'which' , ... 'ensures' , 'that' , 'the' , 'military' , 'always' , ... 'obeys' , 'the' , 'commands' , 'of' , 'the' , 'party' ] >>> hypothesis2 = [ 'It' , 'is' , 'to' , 'insure' , 'the' , 'troops' , ... 'forever' , 'hearing' , 'the' , 'activity' , 'guidebook' , ... 'that' , 'party' , 'direct' ] >>> reference1 = [ 'It' , 'is' , 'a' , 'guide' , 'to' , 'action' , 'that' , ... 'ensures' , 'that' , 'the' , 'military' , 'will' , ... 'forever' , 'heed' , 'Party' , 'commands' ] >>> reference2 = [ 'It' , 'is' , 'the' , 'guiding' , 'principle' , 'which' , ... 'guarantees' , 'the' , 'military' , 'forces' , 'always' , ... 'being' , 'under' , 'the' , 'command' , 'of' , 'the' , ... 'Party' ] >>> reference3 = [ 'It' , 'is' , 'the' , 'practical' , 'guide' , 'for' , 'the' , ... 'army' , 'always' , 'to' , 'heed' , 'the' , 'directions' , ... 'of' , 'the' , 'party' ] >>> references = [reference1, reference2, reference3] >>> float (modified_precision(references, hypothesis1, n=1)) # doctest: +ELLIPSIS 0.9444... >>> float (modified_precision(references, hypothesis2, n=1)) # doctest: +ELLIPSIS 0.5714... >>> float (modified_precision(references, hypothesis1, n=2)) # doctest: +ELLIPSIS 0.5882352941176471 >>> float (modified_precision(references, hypothesis2, n=2)) # doctest: +ELLIPSIS 0.07692...
Calculate BLEU score (Bilingual Evaluation Understudy) from Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. "BLEU: a method for automatic evaluation of machine translation." In Proceedings of ACL. http://www.aclweb.org/anthology/P02-1040.pdf
If there is no ngrams overlap for any order of n-grams, BLEU returns the value 0. This is because the precision for the order of n-grams without overlap is 0, and the geometric mean in the final BLEU score computation multiplies the 0 with the precision of other n-grams. This results in 0 (independently of the precision of the othe n-gram orders). The following example has zero 3-gram and 4-gram overlaps:
To avoid this harsh behaviour when no ngram overlaps are found a smoothing function can be used.
The default BLEU calculates a score for up to 4-grams using uniform weights (this is called BLEU-4). To evaluate your translations with higher/lower order ngrams, use customized weights. E.g. when accounting for up to 5-grams with uniform weights (this is called BLEU-5) use:
BLEU from scratch
11 minute read
Recently, I joined the Language, Information, and Learning at Yale lab, led by Professor Dragomir Radev. Although I’m still in what I would consider to be the incipient stages of ML/DL/NLP studies—meaning it will take time for me to be able to actively participate in an contribute to research and publications—I think it will be a great learning experience from which I can glean valuable insight into what research at Yale looks like.
One of the first projects I was introduced to at the lab is domain-independent table summarization. As the name implies, the goal is to train a model such that it can extract some meaningful insight from the table and produce a human-readable summary. Members are the lab seem to be making great progress in this project, and I’m excited to see where it will go. In the meantime, I decided to write a short post on BLEU , a metric that I came across while reading some of the survey papers related to this topic. Let’s dive into it.
Introduction
Before going into code and equations, a high-level overview of what BLEU is might be helpful here. BLEU, which stands for Bilingual Evaluation Understudy, is an metric that was introduced to quantitatively evaluate the quality of machine translations. The motivation is clear: as humans, we are able to get an intuitive sense of whether or not a given translation is accurate and of high quality; however, it is difficult to translate this arbitrary linguistic intuition to train NLP models to produce better translations. This is where BLEU comes to the rescue.
The way BLEU works is simple. Given some candidate translation of a sentence and a group of reference sentences, we use a bag-of-word approach to see how many occurences of BOWs co-occur in both the translation and reference sentences. BOW is a simple yet highly effective way of ensuring that the machine translation contains key phrases or words that reference translations also contain. In other words, BLEU compares candidate translations with human-produced, annotated reference translations and compares how many hits there are in the candidate sentence. The more BOW hits there are, the better the translation.
Of course, there are many more details that go beyond this. For instance, BLEU is able to account for situations in which meaningless words are repeated throughout the machine translation to simply increase BOW hits. It can also penalize translations that are too short. By combining this BOW precision-based approach with some penalization terms, BLEU provides a robust means of evaluating machine translations. With this high-level overview in mind, let’s start implementing BLEU from scratch.
Preprocessing
First, let’s begin by defining some simple preprocessing and helper functions that we will be using throughout this tutorial. The first on the list is lower_n_split , which converts a given sentence into lowercase and splits it into tokens, which are, in this case, English words. We could make this more robust using regular expressions to remove punctuations, but for the purposes of this demonstration, let’s make this simpler.
I decided to use anonymous functions for the sake of simplicity and code readability. Next, let’s write a function that creates n-grams from a given sentence. This involves tokenizing the given sentence using lower_n_split , then looping through the tokens to create a bag of words.
And here is a quick sanity check of what we’ve done so far.
Motivating BLEU
The BLEU score is based on a familar concept in machine learning: precision . Formally, precision is defined as
where $tp$ and $fp$ stand for true and false positives, respectively.
In the context of machine translations, we can consider positives as roughly corresponding to the notion of hits or matches. In other words, the positives are the bag of word n-grams we can construct from a given candidate translation. True positives are n-grams that appear in both the candidate and some reference translation; false positives are those that only appear in the candidate translation. Let’s use this intuition to build a simple precision-based metric.
Simple Precision
First, we need to create some n-grams from the candidate translation. Then, we iterate through the n-grams to see if they exist in any of the n-grams generated from reference translations. We count the total number of such hits, or true positives, and divide that quantity by the total number of n-grams produced from the candidate translation.
Below are some candidate sentences and reference translations that we will be using as an example throughout this tutorial.
Comparing ca_1 with ca_2 , it is pretty clear that the former is the better translation. Let’s see if the simple precision metric is able to capture this intuition.
And indeed that seems to be the case!
Modified Precision
However, the simple precision-based metric has some huge problems. As an extreme example, consider the following bad_ca candidate translation.
Obviously, bad_ca is a horrible translation, but the simple precision metric fails to flag it. This is because precision simply involves checking whether a hit occurs or not: it does not check for repeated bag of words. Hence, the original authors of BLEU introduces modified precision as a solution, which uses clipped counts. The gist of it is that, if some n-gram is repeated many times, we clip its count through the following formula:
Here, $\text{Count}$ refers to the number of hits we assign to a certain n-gram. We sum this value over all distinct n-grams in the candidate sentence. Note that the distinction requirement effectively weeds out repetitive translations such as bad_ca we looked at earlier.
$m_w$ refers to the number of occurrences of a n-gram in the candidate sentence. For example, in bad_ca , the unigram "it" appears 13 times, and so $m_w = 13$. This value, however, is clipped by $m_\text{max}$, which is the maximum number of occurrence of that n-gram in any one of the reference sentences. In other words, for each reference, we count the number of occurrence of that n-gram and take the maximum value among them.
This can seem very confusing, but hopefully it’s clearer once you read the code. Here is my implementation using collections.Counter .
Notice that we use a set in order to remove redundancies. max_count corresponds to $m_\text{max}$; ngram_counts[ngram] corresponds to $m_w$.
Using this modified metric, we can see that the bad_ca is now penalized quite a lot through the clipping mechanism.
But there are still problems that modified precision doesn’t take into account. Consider the following example translation.
To us, it’s pretty obvious that ca_3 is a bad translation. Although some of the key words might be there, the order in which they are arranged violates English syntax. This is the limitation of using unigrams for precision analysis. To make sure that sentences are coherent and read fluently, we now have to introduce the notion of n-grams, where $n$ is larger than 1. This way, we can preserve some of the sequential encoding in reference sentences and make better comparison.
The fact that unigrams are a poor way of evaluating translations becomes immediately clear once we plot the $n$ in n-grams against modified precision.
As you can see, precision score decreases as $n$ gets higher. This makes sense: a larger $n$ simply means that the window of comparison is larger. Unless whole phrases co-occur in the translation and reference sentences—which is highly unlikely—precision will be low. People have generally found that a suitable $n$ value lies somewhere around 1 and 4. As we will see later, packages like nltk use what is known as cumulative 4-gram BLEU score, or BLEU-4.
The good news is that our current implementation is already able to account for different $n$ values. This is because we wrote a handy little function, make_ngrams . By passing in different values to n , we can deal with different n-grams.
Brevity Penalty
Now we’re almost done. The last example to consider is the following translation:
This is obviously a bad translation. However, due to the way modified precision is currently being calculated, this sentence will likely earn a high score. To prevent this from happening, we need to apply what is known as brevity penalty. As the name implies, this penalizes short candidate translations, thus ensuring that only sufficiently long machine translations are ascribed a high score.
Although this might seem confusing, the underlying mechanism is quite simple. The goal is to find the length of the reference sentence whose length is closest to that of the candidate translation in question. If the length of that reference sentence is larger than the candidate sentence, we apply some penalty; if the candidate sentence is longer, than we do not apply any penalization. The specific formula for penalization looks as follows:
The brevity penalty term is multiplied to the n-gram modified precision. Therefore, a value of 1 means that no penalization is applied.
Let’s perform a quick sanity check to see whether the brevity penalty function works as expected.
Finally, it’s time to put all the pieces together. The formula for BLEU can be written as follows:
First, some notation clarifications. $n$ specifies the size of the bag of word, or the n-gram. $w_k$ denotes the weight we will ascribe to the modified precision—$p_k$—produced under that $k$-gram configuration. In other words, we calculate the weighted average of log precision, exponentiate that sum, and apply some brevity penalty. Although this can sound like a lot, really it’s just putting all the pieces we have discussed so far together. Let’s take a look at the code implementation.
The weighting happens in the zip part within the generator expression within the return statement. In this case, we apply weighting across $n$ that goes from n_start to n_end .
Now we’re done! Let’s test out our final implementation with ca_1 for $n$ from 1 to 4, all weighted equally.
The nltk package offers functions for BLEU calculation by default. For convenience purposes, let’s create a wrapper functions. This wrapping isn’t really necessary, but it abstracts out many of the preprocessing steps, such as applying lower_n_split . This is because the nltk BLEU calculation function expects tokenized input, whereas ca_1 and refs are untokenized sentences.
And we see that the result matches that derived from our own implementation!
In this post, we took a look at BLEU, a very common way of evaluating the fluency of machine translations. Studying the implementation of this metric was a meaningful and interesting process, not only because BLEU itself is widely used, but also because the motivation and intuition behind its construction was easily understandable and came very naturally to me. Each component of BLEU addresses some problem with simpler metrics, such as precision or modified precision. It also takes into account things like abnormally short or repetitive translations.
One area of interest for me these days is seq2seq models. Although RNN models have largely given way to transformers, I still think it’s a very interesting architecture worth diving into. I’ve also recently ran into a combined LSTM-CNN approach for processing series data. I might write about these topics in a future post.
I hope you’ve enjoyed reading this post. Catch you up later!
You May Also Enjoy
16 minute read
August 20 2023
I recently completed another summer internship at Meta (formerly Facebook). I was surprised to learn that one of the intern friends I met was an avid read...
Hacking Word Hunt
7 minute read
August 21 2022
Update: The code was modified with further optimizations. In particular, instead of checking the trie per every DFS call, we update the trie pointer along...
20 minute read
April 11 2022
Note: This blog post was completed as part of Yale’s CPSC 482: Current Topics in Applied Machine Learning.
Reflections and Expectations
5 minute read
December 27 2021
Last year, I wrote a blog post reflecting on the year 2020. Re-reading what I had written then was surprisingly insightful, particularly because I could see ...
dorkai / codeX-1.0 like 6
Search code, repositories, users, issues, pull requests...
Provide feedback.
We read every piece of feedback, and take your input very seriously.
Saved searches
Use saved searches to filter your results more quickly.
To see all available qualifiers, see our documentation .
- Notifications
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement . We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
corpus_bleu does not match multi-bleu.perl for very poor translations #1634
AndreasMadsen commented Feb 19, 2017
alvations commented Feb 19, 2017
Sorry, something went wrong.
AndreasMadsen commented Feb 20, 2017
Alvations commented feb 20, 2017 • edited, alvations commented feb 20, 2017.
alvations commented Mar 18, 2017
Alvations commented mar 19, 2017 • edited.
- 👍 3 reactions
AndreasMadsen commented Mar 19, 2017
No branches or pull requests
IMAGES
VIDEO
COMMENTS
The hypothesis contains 0 counts of 3-gram overlaps. Therefore the BLEU score evaluates to 0, independently of how many N-gram overlaps of lower order it contains. Consider using lower n-gram order or use SmoothingFunction() warnings.warn(_msg) Can someone tell me what is the problem here? I can not find the solution on google. Thank you.
The hypothesis contains 0 counts of 4-gram overlaps. Therefore the BLEU score evaluates to 0, independently of. how many N-gram overlaps of lower order it contains. Consider using lower n-gram order or use SmoothingFunction() the n-gram count for each individual precision score would be: 1.491668146240062e-154.
Therefore the BLEU score evaluates to 0, independently of how many N-gram overlaps of lower order it contains. Consider using lower n-gram order or use SmoothingFunction() warnings.warn(_msg) C:\Users\user\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\translate\bleu_score.py:523: UserWarning: The hypothesis contains 0 counts of 4 ...
If there is no ngrams overlap for any order of n-grams, BLEU returns the value 0. This is because the precision for the order of n-grams without overlap is 0, and the geometric mean in the final BLEU score computation multiplies the 0 with the precision of other n-grams. This results in 0 (independently of the precision of the other n-gram orders).
Smoothing method 3: NIST geometric sequence smoothing The smoothing is computed by taking 1 / ( 2^k ), instead of 0, for each precision score whose matching n-gram count is null. k is 1 for the first 'n' value for which the n-gram match count is null/ For example, if the text contains: one 2-gram match. and (consequently) two 1-gram matches
Next we have an example where there is 1 bigram and 2 unigram existing overlaps the reference and hypothesis. Seems like it gives a reasonable score. ... UserWarning: Corpus/Sentence contains 0 counts of 2-gram overlaps. BLEU scores might be undesirable; use SmoothingFunction(). warnings.warn(_msg) 0.5 # Valid BLEU score. >>> sentence_bleu([[1 ...
To calculate the BLEU score only for 1-gram matches, you can specify a weight of 1 for 1-gram and 0 for 2, 3 and 4 (1, 0, 0, 0). ... Corpus/Sentence contains 0 counts of 3-gram overlaps. BLEU scores might be undesirable; use SmoothingFunction(). warnings.warn(_msg) Next, we can a score that is very low indeed.
The function computes n-gram overlaps between candidate and references for n-gram lengths one through four, with ... NgramWeights — N-gram weights [0.25 0.25 0.25 0.25] (default) | row vector of finite nonnegative values. ... The clipped n-gram counts function Count clip, if necessary, truncates the n-gram count for each n-gram so that it ...
This is because the precision for the order of n-grams without overlap is 0, and the geometric mean in the final BLEU score computation multiplies the 0 with the precision of other n-grams. This results in 0 (independently of the precision of the othe n-gram orders). The following example has zero 3-gram and 4-gram overlaps:
The hypothesis contains 0 counts of 4-gram overlaps. Therefore the BLEU score evaluates to 0, independently of how many N-gram overlaps of lower order it contains. Consider using lower n-gram order or use SmoothingFunction() warnings.warn(_msg) 5.5546715329196825e-78.
In Long:. Actually, if there's only one reference and one hypothesis in your whole corpus, both corpus_bleu() and sentence_bleu() should return the same value as shown in the example above.. In the code, we see that sentence_bleu is actually a duck-type of corpus_bleu:. def sentence_bleu(references, hypothesis, weights=(0.25, 0.25, 0.25, 0.25), smoothing_function=None): return corpus_bleu ...
We count the total number of such hits, or true positives, and divide that quantity by the total number of n-grams produced from the candidate translation. def simple_precision ( ca , refs , n ): ngrams = make_ngrams ( ca , n ) count = 0 for ngram in ngrams : for ref in refs : if ngram in make_ngrams ( ref , n ): count += 1 break return count ...
Individual 2-gram: 0.153846. Individual 3-gram: 0.083333 ... The hypothesis contains 0 counts of 4-gram overlaps. Therefore the BLEU score evaluates to 0, independently of how many N-gram overlaps ...
This is because the precision for the order of n-grams without: overlap is 0, and the geometric mean in the final BLEU score computation: multiplies the 0 with the precision of other n-grams. This results in 0 ... "\nThe hypothesis contains 0 counts of {}-gram overlaps.\n" "Therefore the BLEU score evaluates to 0, independently of\n"
Compares 1 hypothesis (candidate or source sentence) with 1+ reference sentences, returning the highest score when compared to multiple reference sentences. ... N-grams: 1-1.0, 2-0.9, 3-0.6666666666666666, 4-0.5 Cumulative N-grams: ... Corpus/Sentence contains 0 counts of 3-gram overlaps. BLEU scores might be undesirable; use SmoothingFunction ...
From #1545, BLEU is buggy for ngrams where the n<4. One simple way is to clip the weights distribution such that the weights=(0.25, 0.25, 0.25, 0.25) redistributes uniformly with respect to the highest ngram order from the hypothesis or reference. But we're not sure whether that's the most appropriate solution to fix that.
ngram∈S Countmatched(ngram) P S∈C P ngram∈S Count(ngram) Counting punctuation marks as separate tokens, the hypothesis translation given in Table 1 has 15 unigram matches, 10 bigram matches, 5 trigram matches (these are shown in bold in Table 2), and three 4-gram matches (not shown). The hypoth-esis translation contains a total of 18 ...
The hypothesis contains 0 counts of 2-gram overlaps. The hypothesis contains 0 counts of 3-gram overlaps. [...] 701 examples evaluated Top 1 Template Acc = 0.000 Top 1 Command Acc = 0.000 Average top 1 Template Match Score = 0.066 Average top 1 BLEU Score = 0.236 Top 3 Template Acc = 0.001 Top 3 Command Acc = 0.000 Average top 3 Template Match ...
As shown in Table 3, using 3-gram online LM plus 2~4-gram voting based confidence scores yields the best BLEU scores on both dev and test sets, which are 37.98% and 31.35%, respectively. This is a 0.84 BLEU point gain over the baseline on the MT08 test set. Table 1: Results of adding the n-gram online LM. BLEU %.
tokenized hypothesis: Teo S yb , oe uNb , R , T t , , t Tue Ar saln S , , 5istsi l , 5oe R ulO sae oR R ... UserWarning: Corpus / Sentence contains 0 counts of 2-gram overlaps. BLEU scores might be undesirable; use SmoothingFunction (). warnings. warn (_msg) 0. ... i.e. to return a 0.0 score if any of the 1- to 4- gram returns a 0.0 precision ...