Upload
hoangcong
View
216
Download
1
Embed Size (px)
Citation preview
Automatic evaluation of Machine Translation (MT) BLEU its
shortcomings and other evaluation metrics
Presented by(As a part of CS 712)
Aditya JoshiKashyap PopatShubham Gautam
IIT Bombay
GuideProf Pushpak Bhattacharyya
IIT Bombay
Part I Introduction and formulation of BLEU
Presented by(As a part of CS 712)
Aditya JoshiKashyap PopatShubham Gautam
IIT Bombay
GuideProf Pushpak Bhattacharyya
IIT Bombay
Poetry is what gets Poetry is what gets lost in translationlost in translationrdquordquo
Robert FrostPoet (1874 ndash 1963)Wrote the famous poem lsquoStopping by woods on a snowy eveningrsquo better known as lsquoMiles to go before I sleeprsquo
Motivation
How do we judge a good translationCan a machine do this
Why should a machine do this Because humans take time
Outline
bull Evaluationbull Formulating BLEU Metricbull Understanding BLEU formulabull Shortcomings of BLEUbull Shortcomings in context of English-Hindi MT
Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an assessment Machine Translation Volume 8 pages 1ndash27 1993
K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
Part I
Part II
BLEU - IThe BLEU Score rises
EvaluationEvaluation [1][1]
Of NLP systemsOf MT systems
Evaluation in NLP PrecisionRecall
Precision How many results returned were correct
RecallWhat portion of correct results were returned
Adapting precisionrecall to NLP tasks
Evaluation in NLP PrecisionRecall
bull Document Retrieval
Precision = |Documents relevant and retrieved||Documents retrieved|
Recall=|Documents relevant and retrieved|| Documents relevant|
bull Classification
Precision = |True Positives||True Positives + False Positives|
Recall=|True Positives|| True Positives + False Negatives|
Evaluation in MT [1]
bull Operational evaluationndash ldquoIs MT system A operationally better than MT system B
Does MT system A cost lessrdquo
bull Typological evaluationndash ldquoHave you ensured which linguistic phenomena the MT
system coversrdquo
bull Declarative evaluationndash ldquoHow does quality of output of system A fare with respect
to that of Brdquo
Operational evaluation
bull Cost-benefit is the focusbull To establish cost-per-unit figures and use this
as a basis for comparisonbull Essentially lsquoblack boxrsquobull Realism requirement
Word-based SA
Sense-based SA
Cost of pre-processing (Stemming etc)Cost of learning a classifier
Cost of pre-processing (Stemming etc)Cost of learning a classifierCost of sense annotation
Typological evaluation
bull Use a test suite of examplesbull Ensure all relevant phenomena to be tested
are covered
bull Specific-to-language phenomenaFor example हर आख वाला लड़का म कराया
hari aankhon-wala ladka muskurayagreen eyes-with boy smiledThe boy with green eyes smiled
Declarative evaluation
bull Assign scores to specific qualities of outputndash Intelligibility How good the output is as a well-
formed target language entityndash Accuracy How good the output is in terms of
preserving content of the source text
For example I am attending a lectureम एक या यान बठा ह Main ek vyaakhyan baitha hoonI a lecture sit (Present-first person)I sit a lecture Accurate but not intelligibleम या यान ह Main vyakhyan hoonI lecture amI am lecture Intelligible but not accurate
Evaluation bottleneck
bull Typological evaluation is time-consumingbull Operational evaluation needs accurate
modeling of cost-benefit
bull Automatic MT evaluation DeclarativeBLEU Bilingual Evaluation Understudy
Deriving BLEU Deriving BLEU [2][2]
Incorporating PrecisionIncorporating Recall
How is translation performance measured
The closer a machine translation is to a professional human translation the better it is
bull A corpus of good quality human reference translations
bull A numerical ldquotranslation closenessrdquo metric
Preliminaries
bull Candidate Translation(s) Translation returned by an MT system
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
Goal of BLEU To correlate with human judgment
Formulating BLEU (Step 1) PrecisionI had lunch now
Candidate 1 मन अब खाना खाया
maine ab khana khaya matching unigrams 3I now food ate matching bigrams 1I ate food now
Candidate 2 मन अभी लच एटmaine abhi lunch ate matching unigrams 2I now lunch ate
I ate lunch(OOV) now(OOV) matching bigrams 1Unigram precision Candidate 1 34 = 075 Candidate 2 24 = 05Similarly bigram precision Candidate 1 033 Candidate 2 = 033
Reference 1 मन अभी खाना खायाmaine abhi khana khayaI now food ateI ate food now
Reference 2 मन अभी भोजन कयाmaine abhi bhojan kiyaa
I now meal didI did meal now
Precision Not good enough
Reference मझपर तरा स र छायाmujh-par tera suroor chhaaya
me-on your spell castYour spell was cast on me
Candidate 1 मर तरा स र छाया matching unigram 3
mere tera suroor chhaayamy your spell castYour spell cast my
Candidate 2 तरा तरा तरा स र matching unigrams 4tera tera tera surooryour your your spell
Unigram precision Candidate 1 34 = 075 Candidate 2 44 = 1
Formulating BLEU (Step 2) Modified Precision
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujh-par tera suroor chhaaya
me-on your spell castYour spell was cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spell
bull matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Modified n-gram precision
For entire test corpus for a given n
n-gram Matching n-grams in C
n-gramrsquo All n-grams in CModified precision for n-grams Overall candidates of
test corpus
Formula from [2]
Calculating modified n-gram precision (12)
bull 127 source sentences were translated by two human translators and three MT systems
bull Translated sentences evaluated against professional reference translations using modified n-gram precision
Calculating modified n-gram precision (22)
bull Decaying precision with increasing nbull Comparative ranking of the five
Combining precision for different values of n-grams
Graph from [2]
Formulation of BLEU Recap
bull Precision cannot be used as isbull Modified precision considers lsquoclipped word
countrsquo
Recall for MT (12)
bull Candidates shorter than referencesbull Reference या ल लब वा य क गणव ता को समझ
पाएगाkya blue lambe vaakya ki guNvatta ko samajh paaegawill blue long sentence-of quality (case-marker) understand able(III-person-male-singular)Will blue be able to understand quality of long sentence
Candidate लब वा य
lambe vaakyalong sentencelong sentence
modified unigram precision 22 = 1modified bigram precision 11 = 1
Recall for MT (22)
bull Candidates longer than referencesReference 2 मन भोजन कया
maine bhojan kiyaaI meal didI had meal
Candidate 1 मन खाना भोजन कया
maine khaana bhojan kiyaI food meal didI had food meal
Modified unigram precision 1
Reference 1 मन खाना खायाmaine khaana khaayaI food ateI ate food
Candidate 2 मन खाना खायाmaine khaana khaayaI food ateI ate food
Modified unigram precision 1
Formulating BLEU (Step 3) Incorporating recall
bull Sentence length indicates lsquobest matchrsquobull Brevity penalty (BP)
ndash Multiplicative factorndash Candidate translations that match reference
translations in length must be ranked higher
Candidate 1 लब वा य
Candidate 2 या ल लब वा य क गणव तासमझ पाएगा
Formulating BLEU (Step 3) Brevity Penalty
e^(1-x)
Graph drawn using wwwfooplotcom
BP
BP = 1 for c gt r
Why
x = ( r c )Formula from [2]
r Reference sentence lengthc Candidate sentence length
BP leaves out longer translations
WhyTranslations longer than reference are already
penalized by modified precision
Validating the claim
Formula from [2]
BLEU scorePrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Understanding BLEUUnderstanding BLEUDissecting the formulaUsing it news headline example
Decay in precision
Why log pnTo accommodate decay in precision values
Graph from [2]Formula from [2]
Dissecting the Formula
Claim BLEU should lie between 0 and 1Reason To intuitively satisfy ldquo1 implies perfect
translationrdquoUnderstanding constituents of the formula to
validate the claim
Brevity Penalty
Modified precision
Set to 1NFormula from [2]
Validation of range of BLEU
BP
pn Between 0 and 1log pn Between ndash(infinity) and 0 A Between ndash(infinity) and 0 e ^ (A) Between 0 and 1
A
BP Between 0 and 1
Graph drawn using wwwfooplotcom
(rc)
pn
log pn
A
e^(A)
Calculating BLEU An actual example
Ref भावना मक एक करण लान क लए एक योहारbhaavanaatmak ekikaran laane ke liye ek tyohaaremotional unity bring-to a festivalA festival to bring emotional unity
C1 लान क लए एक योहार भावना मक एक करणlaane-ke-liye ek tyoaar bhaavanaatmak ekikaranbring-to a festival emotional unity (This is invalid Hindi ordering)
C2 को एक उ सव क बार म लाना भावना मक एक करणko ek utsav ke baare mein laana bhaavanaatmak ekikaran
for a festival-about to-bring emotional unity (This is invalid Hindi ordering)
Modified n-gram precision
Ref भावना मक एक करण लान क लए एक योहारC1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 7 7 1
2 5 6 083
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 4 9 044
2 1 8 0125
r 7c 7
r 7c 9
Calculating BLEU scoren wn pn log pn wn
log pn
1 1 2 1 0 0
2 1 2 083 -0186 -0093
Total -0093
BLEU 0911
wn= 1 N = 1 2r = 7
C1c = 7BP = exp(1 ndash 77) = 1
C2c = 9BP = 1
n wn pn log pn wn log pn
1 1 2 044 -0821 04105
2 1 2 0125 -207 -1035
Total -1445
BLEU 0235
C1
C2
Hence the BLEU scores
C1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
0911
0235
Ref भावना मक एक करण लान क लए एक योहार
BLEU vs human BLEU vs human judgment judgment [2][2]
Target language EnglishSource language Chinese
Setup
BLEU scores obtained for each system
Five systems perform translation3 automatic MT systems2 human translators
Human judgment (on scale of 5) obtained for each system
bull Group 1 Ten Monolingual speakers of target language (English)
bull Group 2 Ten Bilingual speakers of Chinese and English
BLEU vs human judgment
bull Monolingual speakersCorrelation co-efficient 099
bull Bilingual speakersCorrelation co-efficient 096
BLEU and human evaluation for S2 and S3 Graph from [2]
Comparison of normalized values
bull High correlation between monolingual group and BLEU score
bull Bilingual group were lenient on lsquofluencyrsquo for H1
bull Demarcation between S1-S3 and H1-H2 is captured by BLEU
Graph from [2]
Conclusion
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Part II Shortcomings of BLEU(continued from 7th March 2013)
Presented by(As a part of CS 712)
Aditya JoshiKashyap PopatShubham Gautam
IIT Bombay
GuideProf Pushpak Bhattacharyya
IIT Bombay
BLEU score revisitedPrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Outline
bull Evaluationbull Formulating BLEU Metricbull Understanding BLEU formulabull Shortcomings of BLEU in generalbull Shortcomings in context of English-Hindi MT
Part I
Part II
Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
[A domesticated bird wants to learn to fly]
I can do this I just have to work out the physics I have quadrated by vector angles I have adjusted for wind shear
This is it Lets fly Just keep it simple Thrust lift drag and wait Thrust lift drag wait[speeds up to get to the of the table]Thrust lift drag waiwaiwait [gets scared and tries to stop himself
but instead falls from the table]
Lines from Rio (2011)
- Blu the Macaw
ldquo
Image Source wwwfanpopcom
Use of BLEU metric [3]
bull Evaluating incremental system changesndash Neglecting actual examples
bull Systems ranked on basis of BLEU
bull Does minimizing error rate with respect to BLEU indeed guarantee translation improvements
Criticisms of BLEU [4]
bull Admits too much variation [3]
bull Admits too little variation [4] [3]
bull Poor correlation with human judgment [3]
Admits too much variation [3]
Permuting phrasesDrawing different items from reference set
Admits too much variation
bull BLEU relies on n-gram matching only
bull Puts very few constraints on how n-gram matches can be drawn from multiple reference translations
Brevity Penalty(Incorporating recall)
Modified precision
Set to 1N
Modified n-gram precision in BLEU [2]
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujhpar teraa suroor chhaayame-on your spell has-been-castYour spell has been cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spellYour your your spell
matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Permuting phrases [3]
bull Reordering of unmatched phrases does not affect precision
bull Bigram mismatch sites can be freely permuted
Possible to randomly produce other hypothesis translations that have the same BLEU score
B1 B2 B3 B4
B4 B2 B1 B3Bigram Bigrammismatch
Issues with precision (12)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate राजा और रानी शकार करन जगल गय म raaja aur raani shikaar karne gaye jungle meinKing and queen to-do-hunting went jungle to (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Issues with precision (22)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate शकार करन जगल राजा और रानी म गयshikaar karne jungle raaja aur raani mein gayeto-do hunting jungle raja and rani in went (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Permuting phrases in generalbull For lsquobrsquo bi-gram matches in a candidate translation of length
lsquokrsquo(k ndash b) possible ways to generate similarly score items using only the words
in this translation
In our example (8-4) = 24 candidate translations
In sentence of length ktotal bigrams = k ndash 1
matched bigrams = b
no of mismatched bigrams = k ndash 1 ndash bno of matched chunks = k ndash 1 ndash b + 1
= k ndash b
These (k-b) chunks can be reordered in (k ndash b) ways
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Part I Introduction and formulation of BLEU
Presented by(As a part of CS 712)
Aditya JoshiKashyap PopatShubham Gautam
IIT Bombay
GuideProf Pushpak Bhattacharyya
IIT Bombay
Poetry is what gets Poetry is what gets lost in translationlost in translationrdquordquo
Robert FrostPoet (1874 ndash 1963)Wrote the famous poem lsquoStopping by woods on a snowy eveningrsquo better known as lsquoMiles to go before I sleeprsquo
Motivation
How do we judge a good translationCan a machine do this
Why should a machine do this Because humans take time
Outline
bull Evaluationbull Formulating BLEU Metricbull Understanding BLEU formulabull Shortcomings of BLEUbull Shortcomings in context of English-Hindi MT
Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an assessment Machine Translation Volume 8 pages 1ndash27 1993
K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
Part I
Part II
BLEU - IThe BLEU Score rises
EvaluationEvaluation [1][1]
Of NLP systemsOf MT systems
Evaluation in NLP PrecisionRecall
Precision How many results returned were correct
RecallWhat portion of correct results were returned
Adapting precisionrecall to NLP tasks
Evaluation in NLP PrecisionRecall
bull Document Retrieval
Precision = |Documents relevant and retrieved||Documents retrieved|
Recall=|Documents relevant and retrieved|| Documents relevant|
bull Classification
Precision = |True Positives||True Positives + False Positives|
Recall=|True Positives|| True Positives + False Negatives|
Evaluation in MT [1]
bull Operational evaluationndash ldquoIs MT system A operationally better than MT system B
Does MT system A cost lessrdquo
bull Typological evaluationndash ldquoHave you ensured which linguistic phenomena the MT
system coversrdquo
bull Declarative evaluationndash ldquoHow does quality of output of system A fare with respect
to that of Brdquo
Operational evaluation
bull Cost-benefit is the focusbull To establish cost-per-unit figures and use this
as a basis for comparisonbull Essentially lsquoblack boxrsquobull Realism requirement
Word-based SA
Sense-based SA
Cost of pre-processing (Stemming etc)Cost of learning a classifier
Cost of pre-processing (Stemming etc)Cost of learning a classifierCost of sense annotation
Typological evaluation
bull Use a test suite of examplesbull Ensure all relevant phenomena to be tested
are covered
bull Specific-to-language phenomenaFor example हर आख वाला लड़का म कराया
hari aankhon-wala ladka muskurayagreen eyes-with boy smiledThe boy with green eyes smiled
Declarative evaluation
bull Assign scores to specific qualities of outputndash Intelligibility How good the output is as a well-
formed target language entityndash Accuracy How good the output is in terms of
preserving content of the source text
For example I am attending a lectureम एक या यान बठा ह Main ek vyaakhyan baitha hoonI a lecture sit (Present-first person)I sit a lecture Accurate but not intelligibleम या यान ह Main vyakhyan hoonI lecture amI am lecture Intelligible but not accurate
Evaluation bottleneck
bull Typological evaluation is time-consumingbull Operational evaluation needs accurate
modeling of cost-benefit
bull Automatic MT evaluation DeclarativeBLEU Bilingual Evaluation Understudy
Deriving BLEU Deriving BLEU [2][2]
Incorporating PrecisionIncorporating Recall
How is translation performance measured
The closer a machine translation is to a professional human translation the better it is
bull A corpus of good quality human reference translations
bull A numerical ldquotranslation closenessrdquo metric
Preliminaries
bull Candidate Translation(s) Translation returned by an MT system
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
Goal of BLEU To correlate with human judgment
Formulating BLEU (Step 1) PrecisionI had lunch now
Candidate 1 मन अब खाना खाया
maine ab khana khaya matching unigrams 3I now food ate matching bigrams 1I ate food now
Candidate 2 मन अभी लच एटmaine abhi lunch ate matching unigrams 2I now lunch ate
I ate lunch(OOV) now(OOV) matching bigrams 1Unigram precision Candidate 1 34 = 075 Candidate 2 24 = 05Similarly bigram precision Candidate 1 033 Candidate 2 = 033
Reference 1 मन अभी खाना खायाmaine abhi khana khayaI now food ateI ate food now
Reference 2 मन अभी भोजन कयाmaine abhi bhojan kiyaa
I now meal didI did meal now
Precision Not good enough
Reference मझपर तरा स र छायाmujh-par tera suroor chhaaya
me-on your spell castYour spell was cast on me
Candidate 1 मर तरा स र छाया matching unigram 3
mere tera suroor chhaayamy your spell castYour spell cast my
Candidate 2 तरा तरा तरा स र matching unigrams 4tera tera tera surooryour your your spell
Unigram precision Candidate 1 34 = 075 Candidate 2 44 = 1
Formulating BLEU (Step 2) Modified Precision
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujh-par tera suroor chhaaya
me-on your spell castYour spell was cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spell
bull matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Modified n-gram precision
For entire test corpus for a given n
n-gram Matching n-grams in C
n-gramrsquo All n-grams in CModified precision for n-grams Overall candidates of
test corpus
Formula from [2]
Calculating modified n-gram precision (12)
bull 127 source sentences were translated by two human translators and three MT systems
bull Translated sentences evaluated against professional reference translations using modified n-gram precision
Calculating modified n-gram precision (22)
bull Decaying precision with increasing nbull Comparative ranking of the five
Combining precision for different values of n-grams
Graph from [2]
Formulation of BLEU Recap
bull Precision cannot be used as isbull Modified precision considers lsquoclipped word
countrsquo
Recall for MT (12)
bull Candidates shorter than referencesbull Reference या ल लब वा य क गणव ता को समझ
पाएगाkya blue lambe vaakya ki guNvatta ko samajh paaegawill blue long sentence-of quality (case-marker) understand able(III-person-male-singular)Will blue be able to understand quality of long sentence
Candidate लब वा य
lambe vaakyalong sentencelong sentence
modified unigram precision 22 = 1modified bigram precision 11 = 1
Recall for MT (22)
bull Candidates longer than referencesReference 2 मन भोजन कया
maine bhojan kiyaaI meal didI had meal
Candidate 1 मन खाना भोजन कया
maine khaana bhojan kiyaI food meal didI had food meal
Modified unigram precision 1
Reference 1 मन खाना खायाmaine khaana khaayaI food ateI ate food
Candidate 2 मन खाना खायाmaine khaana khaayaI food ateI ate food
Modified unigram precision 1
Formulating BLEU (Step 3) Incorporating recall
bull Sentence length indicates lsquobest matchrsquobull Brevity penalty (BP)
ndash Multiplicative factorndash Candidate translations that match reference
translations in length must be ranked higher
Candidate 1 लब वा य
Candidate 2 या ल लब वा य क गणव तासमझ पाएगा
Formulating BLEU (Step 3) Brevity Penalty
e^(1-x)
Graph drawn using wwwfooplotcom
BP
BP = 1 for c gt r
Why
x = ( r c )Formula from [2]
r Reference sentence lengthc Candidate sentence length
BP leaves out longer translations
WhyTranslations longer than reference are already
penalized by modified precision
Validating the claim
Formula from [2]
BLEU scorePrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Understanding BLEUUnderstanding BLEUDissecting the formulaUsing it news headline example
Decay in precision
Why log pnTo accommodate decay in precision values
Graph from [2]Formula from [2]
Dissecting the Formula
Claim BLEU should lie between 0 and 1Reason To intuitively satisfy ldquo1 implies perfect
translationrdquoUnderstanding constituents of the formula to
validate the claim
Brevity Penalty
Modified precision
Set to 1NFormula from [2]
Validation of range of BLEU
BP
pn Between 0 and 1log pn Between ndash(infinity) and 0 A Between ndash(infinity) and 0 e ^ (A) Between 0 and 1
A
BP Between 0 and 1
Graph drawn using wwwfooplotcom
(rc)
pn
log pn
A
e^(A)
Calculating BLEU An actual example
Ref भावना मक एक करण लान क लए एक योहारbhaavanaatmak ekikaran laane ke liye ek tyohaaremotional unity bring-to a festivalA festival to bring emotional unity
C1 लान क लए एक योहार भावना मक एक करणlaane-ke-liye ek tyoaar bhaavanaatmak ekikaranbring-to a festival emotional unity (This is invalid Hindi ordering)
C2 को एक उ सव क बार म लाना भावना मक एक करणko ek utsav ke baare mein laana bhaavanaatmak ekikaran
for a festival-about to-bring emotional unity (This is invalid Hindi ordering)
Modified n-gram precision
Ref भावना मक एक करण लान क लए एक योहारC1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 7 7 1
2 5 6 083
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 4 9 044
2 1 8 0125
r 7c 7
r 7c 9
Calculating BLEU scoren wn pn log pn wn
log pn
1 1 2 1 0 0
2 1 2 083 -0186 -0093
Total -0093
BLEU 0911
wn= 1 N = 1 2r = 7
C1c = 7BP = exp(1 ndash 77) = 1
C2c = 9BP = 1
n wn pn log pn wn log pn
1 1 2 044 -0821 04105
2 1 2 0125 -207 -1035
Total -1445
BLEU 0235
C1
C2
Hence the BLEU scores
C1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
0911
0235
Ref भावना मक एक करण लान क लए एक योहार
BLEU vs human BLEU vs human judgment judgment [2][2]
Target language EnglishSource language Chinese
Setup
BLEU scores obtained for each system
Five systems perform translation3 automatic MT systems2 human translators
Human judgment (on scale of 5) obtained for each system
bull Group 1 Ten Monolingual speakers of target language (English)
bull Group 2 Ten Bilingual speakers of Chinese and English
BLEU vs human judgment
bull Monolingual speakersCorrelation co-efficient 099
bull Bilingual speakersCorrelation co-efficient 096
BLEU and human evaluation for S2 and S3 Graph from [2]
Comparison of normalized values
bull High correlation between monolingual group and BLEU score
bull Bilingual group were lenient on lsquofluencyrsquo for H1
bull Demarcation between S1-S3 and H1-H2 is captured by BLEU
Graph from [2]
Conclusion
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Part II Shortcomings of BLEU(continued from 7th March 2013)
Presented by(As a part of CS 712)
Aditya JoshiKashyap PopatShubham Gautam
IIT Bombay
GuideProf Pushpak Bhattacharyya
IIT Bombay
BLEU score revisitedPrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Outline
bull Evaluationbull Formulating BLEU Metricbull Understanding BLEU formulabull Shortcomings of BLEU in generalbull Shortcomings in context of English-Hindi MT
Part I
Part II
Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
[A domesticated bird wants to learn to fly]
I can do this I just have to work out the physics I have quadrated by vector angles I have adjusted for wind shear
This is it Lets fly Just keep it simple Thrust lift drag and wait Thrust lift drag wait[speeds up to get to the of the table]Thrust lift drag waiwaiwait [gets scared and tries to stop himself
but instead falls from the table]
Lines from Rio (2011)
- Blu the Macaw
ldquo
Image Source wwwfanpopcom
Use of BLEU metric [3]
bull Evaluating incremental system changesndash Neglecting actual examples
bull Systems ranked on basis of BLEU
bull Does minimizing error rate with respect to BLEU indeed guarantee translation improvements
Criticisms of BLEU [4]
bull Admits too much variation [3]
bull Admits too little variation [4] [3]
bull Poor correlation with human judgment [3]
Admits too much variation [3]
Permuting phrasesDrawing different items from reference set
Admits too much variation
bull BLEU relies on n-gram matching only
bull Puts very few constraints on how n-gram matches can be drawn from multiple reference translations
Brevity Penalty(Incorporating recall)
Modified precision
Set to 1N
Modified n-gram precision in BLEU [2]
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujhpar teraa suroor chhaayame-on your spell has-been-castYour spell has been cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spellYour your your spell
matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Permuting phrases [3]
bull Reordering of unmatched phrases does not affect precision
bull Bigram mismatch sites can be freely permuted
Possible to randomly produce other hypothesis translations that have the same BLEU score
B1 B2 B3 B4
B4 B2 B1 B3Bigram Bigrammismatch
Issues with precision (12)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate राजा और रानी शकार करन जगल गय म raaja aur raani shikaar karne gaye jungle meinKing and queen to-do-hunting went jungle to (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Issues with precision (22)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate शकार करन जगल राजा और रानी म गयshikaar karne jungle raaja aur raani mein gayeto-do hunting jungle raja and rani in went (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Permuting phrases in generalbull For lsquobrsquo bi-gram matches in a candidate translation of length
lsquokrsquo(k ndash b) possible ways to generate similarly score items using only the words
in this translation
In our example (8-4) = 24 candidate translations
In sentence of length ktotal bigrams = k ndash 1
matched bigrams = b
no of mismatched bigrams = k ndash 1 ndash bno of matched chunks = k ndash 1 ndash b + 1
= k ndash b
These (k-b) chunks can be reordered in (k ndash b) ways
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Poetry is what gets Poetry is what gets lost in translationlost in translationrdquordquo
Robert FrostPoet (1874 ndash 1963)Wrote the famous poem lsquoStopping by woods on a snowy eveningrsquo better known as lsquoMiles to go before I sleeprsquo
Motivation
How do we judge a good translationCan a machine do this
Why should a machine do this Because humans take time
Outline
bull Evaluationbull Formulating BLEU Metricbull Understanding BLEU formulabull Shortcomings of BLEUbull Shortcomings in context of English-Hindi MT
Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an assessment Machine Translation Volume 8 pages 1ndash27 1993
K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
Part I
Part II
BLEU - IThe BLEU Score rises
EvaluationEvaluation [1][1]
Of NLP systemsOf MT systems
Evaluation in NLP PrecisionRecall
Precision How many results returned were correct
RecallWhat portion of correct results were returned
Adapting precisionrecall to NLP tasks
Evaluation in NLP PrecisionRecall
bull Document Retrieval
Precision = |Documents relevant and retrieved||Documents retrieved|
Recall=|Documents relevant and retrieved|| Documents relevant|
bull Classification
Precision = |True Positives||True Positives + False Positives|
Recall=|True Positives|| True Positives + False Negatives|
Evaluation in MT [1]
bull Operational evaluationndash ldquoIs MT system A operationally better than MT system B
Does MT system A cost lessrdquo
bull Typological evaluationndash ldquoHave you ensured which linguistic phenomena the MT
system coversrdquo
bull Declarative evaluationndash ldquoHow does quality of output of system A fare with respect
to that of Brdquo
Operational evaluation
bull Cost-benefit is the focusbull To establish cost-per-unit figures and use this
as a basis for comparisonbull Essentially lsquoblack boxrsquobull Realism requirement
Word-based SA
Sense-based SA
Cost of pre-processing (Stemming etc)Cost of learning a classifier
Cost of pre-processing (Stemming etc)Cost of learning a classifierCost of sense annotation
Typological evaluation
bull Use a test suite of examplesbull Ensure all relevant phenomena to be tested
are covered
bull Specific-to-language phenomenaFor example हर आख वाला लड़का म कराया
hari aankhon-wala ladka muskurayagreen eyes-with boy smiledThe boy with green eyes smiled
Declarative evaluation
bull Assign scores to specific qualities of outputndash Intelligibility How good the output is as a well-
formed target language entityndash Accuracy How good the output is in terms of
preserving content of the source text
For example I am attending a lectureम एक या यान बठा ह Main ek vyaakhyan baitha hoonI a lecture sit (Present-first person)I sit a lecture Accurate but not intelligibleम या यान ह Main vyakhyan hoonI lecture amI am lecture Intelligible but not accurate
Evaluation bottleneck
bull Typological evaluation is time-consumingbull Operational evaluation needs accurate
modeling of cost-benefit
bull Automatic MT evaluation DeclarativeBLEU Bilingual Evaluation Understudy
Deriving BLEU Deriving BLEU [2][2]
Incorporating PrecisionIncorporating Recall
How is translation performance measured
The closer a machine translation is to a professional human translation the better it is
bull A corpus of good quality human reference translations
bull A numerical ldquotranslation closenessrdquo metric
Preliminaries
bull Candidate Translation(s) Translation returned by an MT system
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
Goal of BLEU To correlate with human judgment
Formulating BLEU (Step 1) PrecisionI had lunch now
Candidate 1 मन अब खाना खाया
maine ab khana khaya matching unigrams 3I now food ate matching bigrams 1I ate food now
Candidate 2 मन अभी लच एटmaine abhi lunch ate matching unigrams 2I now lunch ate
I ate lunch(OOV) now(OOV) matching bigrams 1Unigram precision Candidate 1 34 = 075 Candidate 2 24 = 05Similarly bigram precision Candidate 1 033 Candidate 2 = 033
Reference 1 मन अभी खाना खायाmaine abhi khana khayaI now food ateI ate food now
Reference 2 मन अभी भोजन कयाmaine abhi bhojan kiyaa
I now meal didI did meal now
Precision Not good enough
Reference मझपर तरा स र छायाmujh-par tera suroor chhaaya
me-on your spell castYour spell was cast on me
Candidate 1 मर तरा स र छाया matching unigram 3
mere tera suroor chhaayamy your spell castYour spell cast my
Candidate 2 तरा तरा तरा स र matching unigrams 4tera tera tera surooryour your your spell
Unigram precision Candidate 1 34 = 075 Candidate 2 44 = 1
Formulating BLEU (Step 2) Modified Precision
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujh-par tera suroor chhaaya
me-on your spell castYour spell was cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spell
bull matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Modified n-gram precision
For entire test corpus for a given n
n-gram Matching n-grams in C
n-gramrsquo All n-grams in CModified precision for n-grams Overall candidates of
test corpus
Formula from [2]
Calculating modified n-gram precision (12)
bull 127 source sentences were translated by two human translators and three MT systems
bull Translated sentences evaluated against professional reference translations using modified n-gram precision
Calculating modified n-gram precision (22)
bull Decaying precision with increasing nbull Comparative ranking of the five
Combining precision for different values of n-grams
Graph from [2]
Formulation of BLEU Recap
bull Precision cannot be used as isbull Modified precision considers lsquoclipped word
countrsquo
Recall for MT (12)
bull Candidates shorter than referencesbull Reference या ल लब वा य क गणव ता को समझ
पाएगाkya blue lambe vaakya ki guNvatta ko samajh paaegawill blue long sentence-of quality (case-marker) understand able(III-person-male-singular)Will blue be able to understand quality of long sentence
Candidate लब वा य
lambe vaakyalong sentencelong sentence
modified unigram precision 22 = 1modified bigram precision 11 = 1
Recall for MT (22)
bull Candidates longer than referencesReference 2 मन भोजन कया
maine bhojan kiyaaI meal didI had meal
Candidate 1 मन खाना भोजन कया
maine khaana bhojan kiyaI food meal didI had food meal
Modified unigram precision 1
Reference 1 मन खाना खायाmaine khaana khaayaI food ateI ate food
Candidate 2 मन खाना खायाmaine khaana khaayaI food ateI ate food
Modified unigram precision 1
Formulating BLEU (Step 3) Incorporating recall
bull Sentence length indicates lsquobest matchrsquobull Brevity penalty (BP)
ndash Multiplicative factorndash Candidate translations that match reference
translations in length must be ranked higher
Candidate 1 लब वा य
Candidate 2 या ल लब वा य क गणव तासमझ पाएगा
Formulating BLEU (Step 3) Brevity Penalty
e^(1-x)
Graph drawn using wwwfooplotcom
BP
BP = 1 for c gt r
Why
x = ( r c )Formula from [2]
r Reference sentence lengthc Candidate sentence length
BP leaves out longer translations
WhyTranslations longer than reference are already
penalized by modified precision
Validating the claim
Formula from [2]
BLEU scorePrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Understanding BLEUUnderstanding BLEUDissecting the formulaUsing it news headline example
Decay in precision
Why log pnTo accommodate decay in precision values
Graph from [2]Formula from [2]
Dissecting the Formula
Claim BLEU should lie between 0 and 1Reason To intuitively satisfy ldquo1 implies perfect
translationrdquoUnderstanding constituents of the formula to
validate the claim
Brevity Penalty
Modified precision
Set to 1NFormula from [2]
Validation of range of BLEU
BP
pn Between 0 and 1log pn Between ndash(infinity) and 0 A Between ndash(infinity) and 0 e ^ (A) Between 0 and 1
A
BP Between 0 and 1
Graph drawn using wwwfooplotcom
(rc)
pn
log pn
A
e^(A)
Calculating BLEU An actual example
Ref भावना मक एक करण लान क लए एक योहारbhaavanaatmak ekikaran laane ke liye ek tyohaaremotional unity bring-to a festivalA festival to bring emotional unity
C1 लान क लए एक योहार भावना मक एक करणlaane-ke-liye ek tyoaar bhaavanaatmak ekikaranbring-to a festival emotional unity (This is invalid Hindi ordering)
C2 को एक उ सव क बार म लाना भावना मक एक करणko ek utsav ke baare mein laana bhaavanaatmak ekikaran
for a festival-about to-bring emotional unity (This is invalid Hindi ordering)
Modified n-gram precision
Ref भावना मक एक करण लान क लए एक योहारC1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 7 7 1
2 5 6 083
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 4 9 044
2 1 8 0125
r 7c 7
r 7c 9
Calculating BLEU scoren wn pn log pn wn
log pn
1 1 2 1 0 0
2 1 2 083 -0186 -0093
Total -0093
BLEU 0911
wn= 1 N = 1 2r = 7
C1c = 7BP = exp(1 ndash 77) = 1
C2c = 9BP = 1
n wn pn log pn wn log pn
1 1 2 044 -0821 04105
2 1 2 0125 -207 -1035
Total -1445
BLEU 0235
C1
C2
Hence the BLEU scores
C1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
0911
0235
Ref भावना मक एक करण लान क लए एक योहार
BLEU vs human BLEU vs human judgment judgment [2][2]
Target language EnglishSource language Chinese
Setup
BLEU scores obtained for each system
Five systems perform translation3 automatic MT systems2 human translators
Human judgment (on scale of 5) obtained for each system
bull Group 1 Ten Monolingual speakers of target language (English)
bull Group 2 Ten Bilingual speakers of Chinese and English
BLEU vs human judgment
bull Monolingual speakersCorrelation co-efficient 099
bull Bilingual speakersCorrelation co-efficient 096
BLEU and human evaluation for S2 and S3 Graph from [2]
Comparison of normalized values
bull High correlation between monolingual group and BLEU score
bull Bilingual group were lenient on lsquofluencyrsquo for H1
bull Demarcation between S1-S3 and H1-H2 is captured by BLEU
Graph from [2]
Conclusion
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Part II Shortcomings of BLEU(continued from 7th March 2013)
Presented by(As a part of CS 712)
Aditya JoshiKashyap PopatShubham Gautam
IIT Bombay
GuideProf Pushpak Bhattacharyya
IIT Bombay
BLEU score revisitedPrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Outline
bull Evaluationbull Formulating BLEU Metricbull Understanding BLEU formulabull Shortcomings of BLEU in generalbull Shortcomings in context of English-Hindi MT
Part I
Part II
Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
[A domesticated bird wants to learn to fly]
I can do this I just have to work out the physics I have quadrated by vector angles I have adjusted for wind shear
This is it Lets fly Just keep it simple Thrust lift drag and wait Thrust lift drag wait[speeds up to get to the of the table]Thrust lift drag waiwaiwait [gets scared and tries to stop himself
but instead falls from the table]
Lines from Rio (2011)
- Blu the Macaw
ldquo
Image Source wwwfanpopcom
Use of BLEU metric [3]
bull Evaluating incremental system changesndash Neglecting actual examples
bull Systems ranked on basis of BLEU
bull Does minimizing error rate with respect to BLEU indeed guarantee translation improvements
Criticisms of BLEU [4]
bull Admits too much variation [3]
bull Admits too little variation [4] [3]
bull Poor correlation with human judgment [3]
Admits too much variation [3]
Permuting phrasesDrawing different items from reference set
Admits too much variation
bull BLEU relies on n-gram matching only
bull Puts very few constraints on how n-gram matches can be drawn from multiple reference translations
Brevity Penalty(Incorporating recall)
Modified precision
Set to 1N
Modified n-gram precision in BLEU [2]
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujhpar teraa suroor chhaayame-on your spell has-been-castYour spell has been cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spellYour your your spell
matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Permuting phrases [3]
bull Reordering of unmatched phrases does not affect precision
bull Bigram mismatch sites can be freely permuted
Possible to randomly produce other hypothesis translations that have the same BLEU score
B1 B2 B3 B4
B4 B2 B1 B3Bigram Bigrammismatch
Issues with precision (12)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate राजा और रानी शकार करन जगल गय म raaja aur raani shikaar karne gaye jungle meinKing and queen to-do-hunting went jungle to (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Issues with precision (22)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate शकार करन जगल राजा और रानी म गयshikaar karne jungle raaja aur raani mein gayeto-do hunting jungle raja and rani in went (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Permuting phrases in generalbull For lsquobrsquo bi-gram matches in a candidate translation of length
lsquokrsquo(k ndash b) possible ways to generate similarly score items using only the words
in this translation
In our example (8-4) = 24 candidate translations
In sentence of length ktotal bigrams = k ndash 1
matched bigrams = b
no of mismatched bigrams = k ndash 1 ndash bno of matched chunks = k ndash 1 ndash b + 1
= k ndash b
These (k-b) chunks can be reordered in (k ndash b) ways
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Motivation
How do we judge a good translationCan a machine do this
Why should a machine do this Because humans take time
Outline
bull Evaluationbull Formulating BLEU Metricbull Understanding BLEU formulabull Shortcomings of BLEUbull Shortcomings in context of English-Hindi MT
Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an assessment Machine Translation Volume 8 pages 1ndash27 1993
K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
Part I
Part II
BLEU - IThe BLEU Score rises
EvaluationEvaluation [1][1]
Of NLP systemsOf MT systems
Evaluation in NLP PrecisionRecall
Precision How many results returned were correct
RecallWhat portion of correct results were returned
Adapting precisionrecall to NLP tasks
Evaluation in NLP PrecisionRecall
bull Document Retrieval
Precision = |Documents relevant and retrieved||Documents retrieved|
Recall=|Documents relevant and retrieved|| Documents relevant|
bull Classification
Precision = |True Positives||True Positives + False Positives|
Recall=|True Positives|| True Positives + False Negatives|
Evaluation in MT [1]
bull Operational evaluationndash ldquoIs MT system A operationally better than MT system B
Does MT system A cost lessrdquo
bull Typological evaluationndash ldquoHave you ensured which linguistic phenomena the MT
system coversrdquo
bull Declarative evaluationndash ldquoHow does quality of output of system A fare with respect
to that of Brdquo
Operational evaluation
bull Cost-benefit is the focusbull To establish cost-per-unit figures and use this
as a basis for comparisonbull Essentially lsquoblack boxrsquobull Realism requirement
Word-based SA
Sense-based SA
Cost of pre-processing (Stemming etc)Cost of learning a classifier
Cost of pre-processing (Stemming etc)Cost of learning a classifierCost of sense annotation
Typological evaluation
bull Use a test suite of examplesbull Ensure all relevant phenomena to be tested
are covered
bull Specific-to-language phenomenaFor example हर आख वाला लड़का म कराया
hari aankhon-wala ladka muskurayagreen eyes-with boy smiledThe boy with green eyes smiled
Declarative evaluation
bull Assign scores to specific qualities of outputndash Intelligibility How good the output is as a well-
formed target language entityndash Accuracy How good the output is in terms of
preserving content of the source text
For example I am attending a lectureम एक या यान बठा ह Main ek vyaakhyan baitha hoonI a lecture sit (Present-first person)I sit a lecture Accurate but not intelligibleम या यान ह Main vyakhyan hoonI lecture amI am lecture Intelligible but not accurate
Evaluation bottleneck
bull Typological evaluation is time-consumingbull Operational evaluation needs accurate
modeling of cost-benefit
bull Automatic MT evaluation DeclarativeBLEU Bilingual Evaluation Understudy
Deriving BLEU Deriving BLEU [2][2]
Incorporating PrecisionIncorporating Recall
How is translation performance measured
The closer a machine translation is to a professional human translation the better it is
bull A corpus of good quality human reference translations
bull A numerical ldquotranslation closenessrdquo metric
Preliminaries
bull Candidate Translation(s) Translation returned by an MT system
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
Goal of BLEU To correlate with human judgment
Formulating BLEU (Step 1) PrecisionI had lunch now
Candidate 1 मन अब खाना खाया
maine ab khana khaya matching unigrams 3I now food ate matching bigrams 1I ate food now
Candidate 2 मन अभी लच एटmaine abhi lunch ate matching unigrams 2I now lunch ate
I ate lunch(OOV) now(OOV) matching bigrams 1Unigram precision Candidate 1 34 = 075 Candidate 2 24 = 05Similarly bigram precision Candidate 1 033 Candidate 2 = 033
Reference 1 मन अभी खाना खायाmaine abhi khana khayaI now food ateI ate food now
Reference 2 मन अभी भोजन कयाmaine abhi bhojan kiyaa
I now meal didI did meal now
Precision Not good enough
Reference मझपर तरा स र छायाmujh-par tera suroor chhaaya
me-on your spell castYour spell was cast on me
Candidate 1 मर तरा स र छाया matching unigram 3
mere tera suroor chhaayamy your spell castYour spell cast my
Candidate 2 तरा तरा तरा स र matching unigrams 4tera tera tera surooryour your your spell
Unigram precision Candidate 1 34 = 075 Candidate 2 44 = 1
Formulating BLEU (Step 2) Modified Precision
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujh-par tera suroor chhaaya
me-on your spell castYour spell was cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spell
bull matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Modified n-gram precision
For entire test corpus for a given n
n-gram Matching n-grams in C
n-gramrsquo All n-grams in CModified precision for n-grams Overall candidates of
test corpus
Formula from [2]
Calculating modified n-gram precision (12)
bull 127 source sentences were translated by two human translators and three MT systems
bull Translated sentences evaluated against professional reference translations using modified n-gram precision
Calculating modified n-gram precision (22)
bull Decaying precision with increasing nbull Comparative ranking of the five
Combining precision for different values of n-grams
Graph from [2]
Formulation of BLEU Recap
bull Precision cannot be used as isbull Modified precision considers lsquoclipped word
countrsquo
Recall for MT (12)
bull Candidates shorter than referencesbull Reference या ल लब वा य क गणव ता को समझ
पाएगाkya blue lambe vaakya ki guNvatta ko samajh paaegawill blue long sentence-of quality (case-marker) understand able(III-person-male-singular)Will blue be able to understand quality of long sentence
Candidate लब वा य
lambe vaakyalong sentencelong sentence
modified unigram precision 22 = 1modified bigram precision 11 = 1
Recall for MT (22)
bull Candidates longer than referencesReference 2 मन भोजन कया
maine bhojan kiyaaI meal didI had meal
Candidate 1 मन खाना भोजन कया
maine khaana bhojan kiyaI food meal didI had food meal
Modified unigram precision 1
Reference 1 मन खाना खायाmaine khaana khaayaI food ateI ate food
Candidate 2 मन खाना खायाmaine khaana khaayaI food ateI ate food
Modified unigram precision 1
Formulating BLEU (Step 3) Incorporating recall
bull Sentence length indicates lsquobest matchrsquobull Brevity penalty (BP)
ndash Multiplicative factorndash Candidate translations that match reference
translations in length must be ranked higher
Candidate 1 लब वा य
Candidate 2 या ल लब वा य क गणव तासमझ पाएगा
Formulating BLEU (Step 3) Brevity Penalty
e^(1-x)
Graph drawn using wwwfooplotcom
BP
BP = 1 for c gt r
Why
x = ( r c )Formula from [2]
r Reference sentence lengthc Candidate sentence length
BP leaves out longer translations
WhyTranslations longer than reference are already
penalized by modified precision
Validating the claim
Formula from [2]
BLEU scorePrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Understanding BLEUUnderstanding BLEUDissecting the formulaUsing it news headline example
Decay in precision
Why log pnTo accommodate decay in precision values
Graph from [2]Formula from [2]
Dissecting the Formula
Claim BLEU should lie between 0 and 1Reason To intuitively satisfy ldquo1 implies perfect
translationrdquoUnderstanding constituents of the formula to
validate the claim
Brevity Penalty
Modified precision
Set to 1NFormula from [2]
Validation of range of BLEU
BP
pn Between 0 and 1log pn Between ndash(infinity) and 0 A Between ndash(infinity) and 0 e ^ (A) Between 0 and 1
A
BP Between 0 and 1
Graph drawn using wwwfooplotcom
(rc)
pn
log pn
A
e^(A)
Calculating BLEU An actual example
Ref भावना मक एक करण लान क लए एक योहारbhaavanaatmak ekikaran laane ke liye ek tyohaaremotional unity bring-to a festivalA festival to bring emotional unity
C1 लान क लए एक योहार भावना मक एक करणlaane-ke-liye ek tyoaar bhaavanaatmak ekikaranbring-to a festival emotional unity (This is invalid Hindi ordering)
C2 को एक उ सव क बार म लाना भावना मक एक करणko ek utsav ke baare mein laana bhaavanaatmak ekikaran
for a festival-about to-bring emotional unity (This is invalid Hindi ordering)
Modified n-gram precision
Ref भावना मक एक करण लान क लए एक योहारC1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 7 7 1
2 5 6 083
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 4 9 044
2 1 8 0125
r 7c 7
r 7c 9
Calculating BLEU scoren wn pn log pn wn
log pn
1 1 2 1 0 0
2 1 2 083 -0186 -0093
Total -0093
BLEU 0911
wn= 1 N = 1 2r = 7
C1c = 7BP = exp(1 ndash 77) = 1
C2c = 9BP = 1
n wn pn log pn wn log pn
1 1 2 044 -0821 04105
2 1 2 0125 -207 -1035
Total -1445
BLEU 0235
C1
C2
Hence the BLEU scores
C1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
0911
0235
Ref भावना मक एक करण लान क लए एक योहार
BLEU vs human BLEU vs human judgment judgment [2][2]
Target language EnglishSource language Chinese
Setup
BLEU scores obtained for each system
Five systems perform translation3 automatic MT systems2 human translators
Human judgment (on scale of 5) obtained for each system
bull Group 1 Ten Monolingual speakers of target language (English)
bull Group 2 Ten Bilingual speakers of Chinese and English
BLEU vs human judgment
bull Monolingual speakersCorrelation co-efficient 099
bull Bilingual speakersCorrelation co-efficient 096
BLEU and human evaluation for S2 and S3 Graph from [2]
Comparison of normalized values
bull High correlation between monolingual group and BLEU score
bull Bilingual group were lenient on lsquofluencyrsquo for H1
bull Demarcation between S1-S3 and H1-H2 is captured by BLEU
Graph from [2]
Conclusion
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Part II Shortcomings of BLEU(continued from 7th March 2013)
Presented by(As a part of CS 712)
Aditya JoshiKashyap PopatShubham Gautam
IIT Bombay
GuideProf Pushpak Bhattacharyya
IIT Bombay
BLEU score revisitedPrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Outline
bull Evaluationbull Formulating BLEU Metricbull Understanding BLEU formulabull Shortcomings of BLEU in generalbull Shortcomings in context of English-Hindi MT
Part I
Part II
Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
[A domesticated bird wants to learn to fly]
I can do this I just have to work out the physics I have quadrated by vector angles I have adjusted for wind shear
This is it Lets fly Just keep it simple Thrust lift drag and wait Thrust lift drag wait[speeds up to get to the of the table]Thrust lift drag waiwaiwait [gets scared and tries to stop himself
but instead falls from the table]
Lines from Rio (2011)
- Blu the Macaw
ldquo
Image Source wwwfanpopcom
Use of BLEU metric [3]
bull Evaluating incremental system changesndash Neglecting actual examples
bull Systems ranked on basis of BLEU
bull Does minimizing error rate with respect to BLEU indeed guarantee translation improvements
Criticisms of BLEU [4]
bull Admits too much variation [3]
bull Admits too little variation [4] [3]
bull Poor correlation with human judgment [3]
Admits too much variation [3]
Permuting phrasesDrawing different items from reference set
Admits too much variation
bull BLEU relies on n-gram matching only
bull Puts very few constraints on how n-gram matches can be drawn from multiple reference translations
Brevity Penalty(Incorporating recall)
Modified precision
Set to 1N
Modified n-gram precision in BLEU [2]
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujhpar teraa suroor chhaayame-on your spell has-been-castYour spell has been cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spellYour your your spell
matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Permuting phrases [3]
bull Reordering of unmatched phrases does not affect precision
bull Bigram mismatch sites can be freely permuted
Possible to randomly produce other hypothesis translations that have the same BLEU score
B1 B2 B3 B4
B4 B2 B1 B3Bigram Bigrammismatch
Issues with precision (12)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate राजा और रानी शकार करन जगल गय म raaja aur raani shikaar karne gaye jungle meinKing and queen to-do-hunting went jungle to (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Issues with precision (22)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate शकार करन जगल राजा और रानी म गयshikaar karne jungle raaja aur raani mein gayeto-do hunting jungle raja and rani in went (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Permuting phrases in generalbull For lsquobrsquo bi-gram matches in a candidate translation of length
lsquokrsquo(k ndash b) possible ways to generate similarly score items using only the words
in this translation
In our example (8-4) = 24 candidate translations
In sentence of length ktotal bigrams = k ndash 1
matched bigrams = b
no of mismatched bigrams = k ndash 1 ndash bno of matched chunks = k ndash 1 ndash b + 1
= k ndash b
These (k-b) chunks can be reordered in (k ndash b) ways
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Outline
bull Evaluationbull Formulating BLEU Metricbull Understanding BLEU formulabull Shortcomings of BLEUbull Shortcomings in context of English-Hindi MT
Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an assessment Machine Translation Volume 8 pages 1ndash27 1993
K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
Part I
Part II
BLEU - IThe BLEU Score rises
EvaluationEvaluation [1][1]
Of NLP systemsOf MT systems
Evaluation in NLP PrecisionRecall
Precision How many results returned were correct
RecallWhat portion of correct results were returned
Adapting precisionrecall to NLP tasks
Evaluation in NLP PrecisionRecall
bull Document Retrieval
Precision = |Documents relevant and retrieved||Documents retrieved|
Recall=|Documents relevant and retrieved|| Documents relevant|
bull Classification
Precision = |True Positives||True Positives + False Positives|
Recall=|True Positives|| True Positives + False Negatives|
Evaluation in MT [1]
bull Operational evaluationndash ldquoIs MT system A operationally better than MT system B
Does MT system A cost lessrdquo
bull Typological evaluationndash ldquoHave you ensured which linguistic phenomena the MT
system coversrdquo
bull Declarative evaluationndash ldquoHow does quality of output of system A fare with respect
to that of Brdquo
Operational evaluation
bull Cost-benefit is the focusbull To establish cost-per-unit figures and use this
as a basis for comparisonbull Essentially lsquoblack boxrsquobull Realism requirement
Word-based SA
Sense-based SA
Cost of pre-processing (Stemming etc)Cost of learning a classifier
Cost of pre-processing (Stemming etc)Cost of learning a classifierCost of sense annotation
Typological evaluation
bull Use a test suite of examplesbull Ensure all relevant phenomena to be tested
are covered
bull Specific-to-language phenomenaFor example हर आख वाला लड़का म कराया
hari aankhon-wala ladka muskurayagreen eyes-with boy smiledThe boy with green eyes smiled
Declarative evaluation
bull Assign scores to specific qualities of outputndash Intelligibility How good the output is as a well-
formed target language entityndash Accuracy How good the output is in terms of
preserving content of the source text
For example I am attending a lectureम एक या यान बठा ह Main ek vyaakhyan baitha hoonI a lecture sit (Present-first person)I sit a lecture Accurate but not intelligibleम या यान ह Main vyakhyan hoonI lecture amI am lecture Intelligible but not accurate
Evaluation bottleneck
bull Typological evaluation is time-consumingbull Operational evaluation needs accurate
modeling of cost-benefit
bull Automatic MT evaluation DeclarativeBLEU Bilingual Evaluation Understudy
Deriving BLEU Deriving BLEU [2][2]
Incorporating PrecisionIncorporating Recall
How is translation performance measured
The closer a machine translation is to a professional human translation the better it is
bull A corpus of good quality human reference translations
bull A numerical ldquotranslation closenessrdquo metric
Preliminaries
bull Candidate Translation(s) Translation returned by an MT system
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
Goal of BLEU To correlate with human judgment
Formulating BLEU (Step 1) PrecisionI had lunch now
Candidate 1 मन अब खाना खाया
maine ab khana khaya matching unigrams 3I now food ate matching bigrams 1I ate food now
Candidate 2 मन अभी लच एटmaine abhi lunch ate matching unigrams 2I now lunch ate
I ate lunch(OOV) now(OOV) matching bigrams 1Unigram precision Candidate 1 34 = 075 Candidate 2 24 = 05Similarly bigram precision Candidate 1 033 Candidate 2 = 033
Reference 1 मन अभी खाना खायाmaine abhi khana khayaI now food ateI ate food now
Reference 2 मन अभी भोजन कयाmaine abhi bhojan kiyaa
I now meal didI did meal now
Precision Not good enough
Reference मझपर तरा स र छायाmujh-par tera suroor chhaaya
me-on your spell castYour spell was cast on me
Candidate 1 मर तरा स र छाया matching unigram 3
mere tera suroor chhaayamy your spell castYour spell cast my
Candidate 2 तरा तरा तरा स र matching unigrams 4tera tera tera surooryour your your spell
Unigram precision Candidate 1 34 = 075 Candidate 2 44 = 1
Formulating BLEU (Step 2) Modified Precision
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujh-par tera suroor chhaaya
me-on your spell castYour spell was cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spell
bull matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Modified n-gram precision
For entire test corpus for a given n
n-gram Matching n-grams in C
n-gramrsquo All n-grams in CModified precision for n-grams Overall candidates of
test corpus
Formula from [2]
Calculating modified n-gram precision (12)
bull 127 source sentences were translated by two human translators and three MT systems
bull Translated sentences evaluated against professional reference translations using modified n-gram precision
Calculating modified n-gram precision (22)
bull Decaying precision with increasing nbull Comparative ranking of the five
Combining precision for different values of n-grams
Graph from [2]
Formulation of BLEU Recap
bull Precision cannot be used as isbull Modified precision considers lsquoclipped word
countrsquo
Recall for MT (12)
bull Candidates shorter than referencesbull Reference या ल लब वा य क गणव ता को समझ
पाएगाkya blue lambe vaakya ki guNvatta ko samajh paaegawill blue long sentence-of quality (case-marker) understand able(III-person-male-singular)Will blue be able to understand quality of long sentence
Candidate लब वा य
lambe vaakyalong sentencelong sentence
modified unigram precision 22 = 1modified bigram precision 11 = 1
Recall for MT (22)
bull Candidates longer than referencesReference 2 मन भोजन कया
maine bhojan kiyaaI meal didI had meal
Candidate 1 मन खाना भोजन कया
maine khaana bhojan kiyaI food meal didI had food meal
Modified unigram precision 1
Reference 1 मन खाना खायाmaine khaana khaayaI food ateI ate food
Candidate 2 मन खाना खायाmaine khaana khaayaI food ateI ate food
Modified unigram precision 1
Formulating BLEU (Step 3) Incorporating recall
bull Sentence length indicates lsquobest matchrsquobull Brevity penalty (BP)
ndash Multiplicative factorndash Candidate translations that match reference
translations in length must be ranked higher
Candidate 1 लब वा य
Candidate 2 या ल लब वा य क गणव तासमझ पाएगा
Formulating BLEU (Step 3) Brevity Penalty
e^(1-x)
Graph drawn using wwwfooplotcom
BP
BP = 1 for c gt r
Why
x = ( r c )Formula from [2]
r Reference sentence lengthc Candidate sentence length
BP leaves out longer translations
WhyTranslations longer than reference are already
penalized by modified precision
Validating the claim
Formula from [2]
BLEU scorePrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Understanding BLEUUnderstanding BLEUDissecting the formulaUsing it news headline example
Decay in precision
Why log pnTo accommodate decay in precision values
Graph from [2]Formula from [2]
Dissecting the Formula
Claim BLEU should lie between 0 and 1Reason To intuitively satisfy ldquo1 implies perfect
translationrdquoUnderstanding constituents of the formula to
validate the claim
Brevity Penalty
Modified precision
Set to 1NFormula from [2]
Validation of range of BLEU
BP
pn Between 0 and 1log pn Between ndash(infinity) and 0 A Between ndash(infinity) and 0 e ^ (A) Between 0 and 1
A
BP Between 0 and 1
Graph drawn using wwwfooplotcom
(rc)
pn
log pn
A
e^(A)
Calculating BLEU An actual example
Ref भावना मक एक करण लान क लए एक योहारbhaavanaatmak ekikaran laane ke liye ek tyohaaremotional unity bring-to a festivalA festival to bring emotional unity
C1 लान क लए एक योहार भावना मक एक करणlaane-ke-liye ek tyoaar bhaavanaatmak ekikaranbring-to a festival emotional unity (This is invalid Hindi ordering)
C2 को एक उ सव क बार म लाना भावना मक एक करणko ek utsav ke baare mein laana bhaavanaatmak ekikaran
for a festival-about to-bring emotional unity (This is invalid Hindi ordering)
Modified n-gram precision
Ref भावना मक एक करण लान क लए एक योहारC1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 7 7 1
2 5 6 083
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 4 9 044
2 1 8 0125
r 7c 7
r 7c 9
Calculating BLEU scoren wn pn log pn wn
log pn
1 1 2 1 0 0
2 1 2 083 -0186 -0093
Total -0093
BLEU 0911
wn= 1 N = 1 2r = 7
C1c = 7BP = exp(1 ndash 77) = 1
C2c = 9BP = 1
n wn pn log pn wn log pn
1 1 2 044 -0821 04105
2 1 2 0125 -207 -1035
Total -1445
BLEU 0235
C1
C2
Hence the BLEU scores
C1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
0911
0235
Ref भावना मक एक करण लान क लए एक योहार
BLEU vs human BLEU vs human judgment judgment [2][2]
Target language EnglishSource language Chinese
Setup
BLEU scores obtained for each system
Five systems perform translation3 automatic MT systems2 human translators
Human judgment (on scale of 5) obtained for each system
bull Group 1 Ten Monolingual speakers of target language (English)
bull Group 2 Ten Bilingual speakers of Chinese and English
BLEU vs human judgment
bull Monolingual speakersCorrelation co-efficient 099
bull Bilingual speakersCorrelation co-efficient 096
BLEU and human evaluation for S2 and S3 Graph from [2]
Comparison of normalized values
bull High correlation between monolingual group and BLEU score
bull Bilingual group were lenient on lsquofluencyrsquo for H1
bull Demarcation between S1-S3 and H1-H2 is captured by BLEU
Graph from [2]
Conclusion
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Part II Shortcomings of BLEU(continued from 7th March 2013)
Presented by(As a part of CS 712)
Aditya JoshiKashyap PopatShubham Gautam
IIT Bombay
GuideProf Pushpak Bhattacharyya
IIT Bombay
BLEU score revisitedPrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Outline
bull Evaluationbull Formulating BLEU Metricbull Understanding BLEU formulabull Shortcomings of BLEU in generalbull Shortcomings in context of English-Hindi MT
Part I
Part II
Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
[A domesticated bird wants to learn to fly]
I can do this I just have to work out the physics I have quadrated by vector angles I have adjusted for wind shear
This is it Lets fly Just keep it simple Thrust lift drag and wait Thrust lift drag wait[speeds up to get to the of the table]Thrust lift drag waiwaiwait [gets scared and tries to stop himself
but instead falls from the table]
Lines from Rio (2011)
- Blu the Macaw
ldquo
Image Source wwwfanpopcom
Use of BLEU metric [3]
bull Evaluating incremental system changesndash Neglecting actual examples
bull Systems ranked on basis of BLEU
bull Does minimizing error rate with respect to BLEU indeed guarantee translation improvements
Criticisms of BLEU [4]
bull Admits too much variation [3]
bull Admits too little variation [4] [3]
bull Poor correlation with human judgment [3]
Admits too much variation [3]
Permuting phrasesDrawing different items from reference set
Admits too much variation
bull BLEU relies on n-gram matching only
bull Puts very few constraints on how n-gram matches can be drawn from multiple reference translations
Brevity Penalty(Incorporating recall)
Modified precision
Set to 1N
Modified n-gram precision in BLEU [2]
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujhpar teraa suroor chhaayame-on your spell has-been-castYour spell has been cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spellYour your your spell
matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Permuting phrases [3]
bull Reordering of unmatched phrases does not affect precision
bull Bigram mismatch sites can be freely permuted
Possible to randomly produce other hypothesis translations that have the same BLEU score
B1 B2 B3 B4
B4 B2 B1 B3Bigram Bigrammismatch
Issues with precision (12)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate राजा और रानी शकार करन जगल गय म raaja aur raani shikaar karne gaye jungle meinKing and queen to-do-hunting went jungle to (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Issues with precision (22)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate शकार करन जगल राजा और रानी म गयshikaar karne jungle raaja aur raani mein gayeto-do hunting jungle raja and rani in went (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Permuting phrases in generalbull For lsquobrsquo bi-gram matches in a candidate translation of length
lsquokrsquo(k ndash b) possible ways to generate similarly score items using only the words
in this translation
In our example (8-4) = 24 candidate translations
In sentence of length ktotal bigrams = k ndash 1
matched bigrams = b
no of mismatched bigrams = k ndash 1 ndash bno of matched chunks = k ndash 1 ndash b + 1
= k ndash b
These (k-b) chunks can be reordered in (k ndash b) ways
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
BLEU - IThe BLEU Score rises
EvaluationEvaluation [1][1]
Of NLP systemsOf MT systems
Evaluation in NLP PrecisionRecall
Precision How many results returned were correct
RecallWhat portion of correct results were returned
Adapting precisionrecall to NLP tasks
Evaluation in NLP PrecisionRecall
bull Document Retrieval
Precision = |Documents relevant and retrieved||Documents retrieved|
Recall=|Documents relevant and retrieved|| Documents relevant|
bull Classification
Precision = |True Positives||True Positives + False Positives|
Recall=|True Positives|| True Positives + False Negatives|
Evaluation in MT [1]
bull Operational evaluationndash ldquoIs MT system A operationally better than MT system B
Does MT system A cost lessrdquo
bull Typological evaluationndash ldquoHave you ensured which linguistic phenomena the MT
system coversrdquo
bull Declarative evaluationndash ldquoHow does quality of output of system A fare with respect
to that of Brdquo
Operational evaluation
bull Cost-benefit is the focusbull To establish cost-per-unit figures and use this
as a basis for comparisonbull Essentially lsquoblack boxrsquobull Realism requirement
Word-based SA
Sense-based SA
Cost of pre-processing (Stemming etc)Cost of learning a classifier
Cost of pre-processing (Stemming etc)Cost of learning a classifierCost of sense annotation
Typological evaluation
bull Use a test suite of examplesbull Ensure all relevant phenomena to be tested
are covered
bull Specific-to-language phenomenaFor example हर आख वाला लड़का म कराया
hari aankhon-wala ladka muskurayagreen eyes-with boy smiledThe boy with green eyes smiled
Declarative evaluation
bull Assign scores to specific qualities of outputndash Intelligibility How good the output is as a well-
formed target language entityndash Accuracy How good the output is in terms of
preserving content of the source text
For example I am attending a lectureम एक या यान बठा ह Main ek vyaakhyan baitha hoonI a lecture sit (Present-first person)I sit a lecture Accurate but not intelligibleम या यान ह Main vyakhyan hoonI lecture amI am lecture Intelligible but not accurate
Evaluation bottleneck
bull Typological evaluation is time-consumingbull Operational evaluation needs accurate
modeling of cost-benefit
bull Automatic MT evaluation DeclarativeBLEU Bilingual Evaluation Understudy
Deriving BLEU Deriving BLEU [2][2]
Incorporating PrecisionIncorporating Recall
How is translation performance measured
The closer a machine translation is to a professional human translation the better it is
bull A corpus of good quality human reference translations
bull A numerical ldquotranslation closenessrdquo metric
Preliminaries
bull Candidate Translation(s) Translation returned by an MT system
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
Goal of BLEU To correlate with human judgment
Formulating BLEU (Step 1) PrecisionI had lunch now
Candidate 1 मन अब खाना खाया
maine ab khana khaya matching unigrams 3I now food ate matching bigrams 1I ate food now
Candidate 2 मन अभी लच एटmaine abhi lunch ate matching unigrams 2I now lunch ate
I ate lunch(OOV) now(OOV) matching bigrams 1Unigram precision Candidate 1 34 = 075 Candidate 2 24 = 05Similarly bigram precision Candidate 1 033 Candidate 2 = 033
Reference 1 मन अभी खाना खायाmaine abhi khana khayaI now food ateI ate food now
Reference 2 मन अभी भोजन कयाmaine abhi bhojan kiyaa
I now meal didI did meal now
Precision Not good enough
Reference मझपर तरा स र छायाmujh-par tera suroor chhaaya
me-on your spell castYour spell was cast on me
Candidate 1 मर तरा स र छाया matching unigram 3
mere tera suroor chhaayamy your spell castYour spell cast my
Candidate 2 तरा तरा तरा स र matching unigrams 4tera tera tera surooryour your your spell
Unigram precision Candidate 1 34 = 075 Candidate 2 44 = 1
Formulating BLEU (Step 2) Modified Precision
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujh-par tera suroor chhaaya
me-on your spell castYour spell was cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spell
bull matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Modified n-gram precision
For entire test corpus for a given n
n-gram Matching n-grams in C
n-gramrsquo All n-grams in CModified precision for n-grams Overall candidates of
test corpus
Formula from [2]
Calculating modified n-gram precision (12)
bull 127 source sentences were translated by two human translators and three MT systems
bull Translated sentences evaluated against professional reference translations using modified n-gram precision
Calculating modified n-gram precision (22)
bull Decaying precision with increasing nbull Comparative ranking of the five
Combining precision for different values of n-grams
Graph from [2]
Formulation of BLEU Recap
bull Precision cannot be used as isbull Modified precision considers lsquoclipped word
countrsquo
Recall for MT (12)
bull Candidates shorter than referencesbull Reference या ल लब वा य क गणव ता को समझ
पाएगाkya blue lambe vaakya ki guNvatta ko samajh paaegawill blue long sentence-of quality (case-marker) understand able(III-person-male-singular)Will blue be able to understand quality of long sentence
Candidate लब वा य
lambe vaakyalong sentencelong sentence
modified unigram precision 22 = 1modified bigram precision 11 = 1
Recall for MT (22)
bull Candidates longer than referencesReference 2 मन भोजन कया
maine bhojan kiyaaI meal didI had meal
Candidate 1 मन खाना भोजन कया
maine khaana bhojan kiyaI food meal didI had food meal
Modified unigram precision 1
Reference 1 मन खाना खायाmaine khaana khaayaI food ateI ate food
Candidate 2 मन खाना खायाmaine khaana khaayaI food ateI ate food
Modified unigram precision 1
Formulating BLEU (Step 3) Incorporating recall
bull Sentence length indicates lsquobest matchrsquobull Brevity penalty (BP)
ndash Multiplicative factorndash Candidate translations that match reference
translations in length must be ranked higher
Candidate 1 लब वा य
Candidate 2 या ल लब वा य क गणव तासमझ पाएगा
Formulating BLEU (Step 3) Brevity Penalty
e^(1-x)
Graph drawn using wwwfooplotcom
BP
BP = 1 for c gt r
Why
x = ( r c )Formula from [2]
r Reference sentence lengthc Candidate sentence length
BP leaves out longer translations
WhyTranslations longer than reference are already
penalized by modified precision
Validating the claim
Formula from [2]
BLEU scorePrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Understanding BLEUUnderstanding BLEUDissecting the formulaUsing it news headline example
Decay in precision
Why log pnTo accommodate decay in precision values
Graph from [2]Formula from [2]
Dissecting the Formula
Claim BLEU should lie between 0 and 1Reason To intuitively satisfy ldquo1 implies perfect
translationrdquoUnderstanding constituents of the formula to
validate the claim
Brevity Penalty
Modified precision
Set to 1NFormula from [2]
Validation of range of BLEU
BP
pn Between 0 and 1log pn Between ndash(infinity) and 0 A Between ndash(infinity) and 0 e ^ (A) Between 0 and 1
A
BP Between 0 and 1
Graph drawn using wwwfooplotcom
(rc)
pn
log pn
A
e^(A)
Calculating BLEU An actual example
Ref भावना मक एक करण लान क लए एक योहारbhaavanaatmak ekikaran laane ke liye ek tyohaaremotional unity bring-to a festivalA festival to bring emotional unity
C1 लान क लए एक योहार भावना मक एक करणlaane-ke-liye ek tyoaar bhaavanaatmak ekikaranbring-to a festival emotional unity (This is invalid Hindi ordering)
C2 को एक उ सव क बार म लाना भावना मक एक करणko ek utsav ke baare mein laana bhaavanaatmak ekikaran
for a festival-about to-bring emotional unity (This is invalid Hindi ordering)
Modified n-gram precision
Ref भावना मक एक करण लान क लए एक योहारC1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 7 7 1
2 5 6 083
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 4 9 044
2 1 8 0125
r 7c 7
r 7c 9
Calculating BLEU scoren wn pn log pn wn
log pn
1 1 2 1 0 0
2 1 2 083 -0186 -0093
Total -0093
BLEU 0911
wn= 1 N = 1 2r = 7
C1c = 7BP = exp(1 ndash 77) = 1
C2c = 9BP = 1
n wn pn log pn wn log pn
1 1 2 044 -0821 04105
2 1 2 0125 -207 -1035
Total -1445
BLEU 0235
C1
C2
Hence the BLEU scores
C1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
0911
0235
Ref भावना मक एक करण लान क लए एक योहार
BLEU vs human BLEU vs human judgment judgment [2][2]
Target language EnglishSource language Chinese
Setup
BLEU scores obtained for each system
Five systems perform translation3 automatic MT systems2 human translators
Human judgment (on scale of 5) obtained for each system
bull Group 1 Ten Monolingual speakers of target language (English)
bull Group 2 Ten Bilingual speakers of Chinese and English
BLEU vs human judgment
bull Monolingual speakersCorrelation co-efficient 099
bull Bilingual speakersCorrelation co-efficient 096
BLEU and human evaluation for S2 and S3 Graph from [2]
Comparison of normalized values
bull High correlation between monolingual group and BLEU score
bull Bilingual group were lenient on lsquofluencyrsquo for H1
bull Demarcation between S1-S3 and H1-H2 is captured by BLEU
Graph from [2]
Conclusion
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Part II Shortcomings of BLEU(continued from 7th March 2013)
Presented by(As a part of CS 712)
Aditya JoshiKashyap PopatShubham Gautam
IIT Bombay
GuideProf Pushpak Bhattacharyya
IIT Bombay
BLEU score revisitedPrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Outline
bull Evaluationbull Formulating BLEU Metricbull Understanding BLEU formulabull Shortcomings of BLEU in generalbull Shortcomings in context of English-Hindi MT
Part I
Part II
Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
[A domesticated bird wants to learn to fly]
I can do this I just have to work out the physics I have quadrated by vector angles I have adjusted for wind shear
This is it Lets fly Just keep it simple Thrust lift drag and wait Thrust lift drag wait[speeds up to get to the of the table]Thrust lift drag waiwaiwait [gets scared and tries to stop himself
but instead falls from the table]
Lines from Rio (2011)
- Blu the Macaw
ldquo
Image Source wwwfanpopcom
Use of BLEU metric [3]
bull Evaluating incremental system changesndash Neglecting actual examples
bull Systems ranked on basis of BLEU
bull Does minimizing error rate with respect to BLEU indeed guarantee translation improvements
Criticisms of BLEU [4]
bull Admits too much variation [3]
bull Admits too little variation [4] [3]
bull Poor correlation with human judgment [3]
Admits too much variation [3]
Permuting phrasesDrawing different items from reference set
Admits too much variation
bull BLEU relies on n-gram matching only
bull Puts very few constraints on how n-gram matches can be drawn from multiple reference translations
Brevity Penalty(Incorporating recall)
Modified precision
Set to 1N
Modified n-gram precision in BLEU [2]
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujhpar teraa suroor chhaayame-on your spell has-been-castYour spell has been cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spellYour your your spell
matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Permuting phrases [3]
bull Reordering of unmatched phrases does not affect precision
bull Bigram mismatch sites can be freely permuted
Possible to randomly produce other hypothesis translations that have the same BLEU score
B1 B2 B3 B4
B4 B2 B1 B3Bigram Bigrammismatch
Issues with precision (12)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate राजा और रानी शकार करन जगल गय म raaja aur raani shikaar karne gaye jungle meinKing and queen to-do-hunting went jungle to (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Issues with precision (22)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate शकार करन जगल राजा और रानी म गयshikaar karne jungle raaja aur raani mein gayeto-do hunting jungle raja and rani in went (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Permuting phrases in generalbull For lsquobrsquo bi-gram matches in a candidate translation of length
lsquokrsquo(k ndash b) possible ways to generate similarly score items using only the words
in this translation
In our example (8-4) = 24 candidate translations
In sentence of length ktotal bigrams = k ndash 1
matched bigrams = b
no of mismatched bigrams = k ndash 1 ndash bno of matched chunks = k ndash 1 ndash b + 1
= k ndash b
These (k-b) chunks can be reordered in (k ndash b) ways
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
EvaluationEvaluation [1][1]
Of NLP systemsOf MT systems
Evaluation in NLP PrecisionRecall
Precision How many results returned were correct
RecallWhat portion of correct results were returned
Adapting precisionrecall to NLP tasks
Evaluation in NLP PrecisionRecall
bull Document Retrieval
Precision = |Documents relevant and retrieved||Documents retrieved|
Recall=|Documents relevant and retrieved|| Documents relevant|
bull Classification
Precision = |True Positives||True Positives + False Positives|
Recall=|True Positives|| True Positives + False Negatives|
Evaluation in MT [1]
bull Operational evaluationndash ldquoIs MT system A operationally better than MT system B
Does MT system A cost lessrdquo
bull Typological evaluationndash ldquoHave you ensured which linguistic phenomena the MT
system coversrdquo
bull Declarative evaluationndash ldquoHow does quality of output of system A fare with respect
to that of Brdquo
Operational evaluation
bull Cost-benefit is the focusbull To establish cost-per-unit figures and use this
as a basis for comparisonbull Essentially lsquoblack boxrsquobull Realism requirement
Word-based SA
Sense-based SA
Cost of pre-processing (Stemming etc)Cost of learning a classifier
Cost of pre-processing (Stemming etc)Cost of learning a classifierCost of sense annotation
Typological evaluation
bull Use a test suite of examplesbull Ensure all relevant phenomena to be tested
are covered
bull Specific-to-language phenomenaFor example हर आख वाला लड़का म कराया
hari aankhon-wala ladka muskurayagreen eyes-with boy smiledThe boy with green eyes smiled
Declarative evaluation
bull Assign scores to specific qualities of outputndash Intelligibility How good the output is as a well-
formed target language entityndash Accuracy How good the output is in terms of
preserving content of the source text
For example I am attending a lectureम एक या यान बठा ह Main ek vyaakhyan baitha hoonI a lecture sit (Present-first person)I sit a lecture Accurate but not intelligibleम या यान ह Main vyakhyan hoonI lecture amI am lecture Intelligible but not accurate
Evaluation bottleneck
bull Typological evaluation is time-consumingbull Operational evaluation needs accurate
modeling of cost-benefit
bull Automatic MT evaluation DeclarativeBLEU Bilingual Evaluation Understudy
Deriving BLEU Deriving BLEU [2][2]
Incorporating PrecisionIncorporating Recall
How is translation performance measured
The closer a machine translation is to a professional human translation the better it is
bull A corpus of good quality human reference translations
bull A numerical ldquotranslation closenessrdquo metric
Preliminaries
bull Candidate Translation(s) Translation returned by an MT system
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
Goal of BLEU To correlate with human judgment
Formulating BLEU (Step 1) PrecisionI had lunch now
Candidate 1 मन अब खाना खाया
maine ab khana khaya matching unigrams 3I now food ate matching bigrams 1I ate food now
Candidate 2 मन अभी लच एटmaine abhi lunch ate matching unigrams 2I now lunch ate
I ate lunch(OOV) now(OOV) matching bigrams 1Unigram precision Candidate 1 34 = 075 Candidate 2 24 = 05Similarly bigram precision Candidate 1 033 Candidate 2 = 033
Reference 1 मन अभी खाना खायाmaine abhi khana khayaI now food ateI ate food now
Reference 2 मन अभी भोजन कयाmaine abhi bhojan kiyaa
I now meal didI did meal now
Precision Not good enough
Reference मझपर तरा स र छायाmujh-par tera suroor chhaaya
me-on your spell castYour spell was cast on me
Candidate 1 मर तरा स र छाया matching unigram 3
mere tera suroor chhaayamy your spell castYour spell cast my
Candidate 2 तरा तरा तरा स र matching unigrams 4tera tera tera surooryour your your spell
Unigram precision Candidate 1 34 = 075 Candidate 2 44 = 1
Formulating BLEU (Step 2) Modified Precision
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujh-par tera suroor chhaaya
me-on your spell castYour spell was cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spell
bull matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Modified n-gram precision
For entire test corpus for a given n
n-gram Matching n-grams in C
n-gramrsquo All n-grams in CModified precision for n-grams Overall candidates of
test corpus
Formula from [2]
Calculating modified n-gram precision (12)
bull 127 source sentences were translated by two human translators and three MT systems
bull Translated sentences evaluated against professional reference translations using modified n-gram precision
Calculating modified n-gram precision (22)
bull Decaying precision with increasing nbull Comparative ranking of the five
Combining precision for different values of n-grams
Graph from [2]
Formulation of BLEU Recap
bull Precision cannot be used as isbull Modified precision considers lsquoclipped word
countrsquo
Recall for MT (12)
bull Candidates shorter than referencesbull Reference या ल लब वा य क गणव ता को समझ
पाएगाkya blue lambe vaakya ki guNvatta ko samajh paaegawill blue long sentence-of quality (case-marker) understand able(III-person-male-singular)Will blue be able to understand quality of long sentence
Candidate लब वा य
lambe vaakyalong sentencelong sentence
modified unigram precision 22 = 1modified bigram precision 11 = 1
Recall for MT (22)
bull Candidates longer than referencesReference 2 मन भोजन कया
maine bhojan kiyaaI meal didI had meal
Candidate 1 मन खाना भोजन कया
maine khaana bhojan kiyaI food meal didI had food meal
Modified unigram precision 1
Reference 1 मन खाना खायाmaine khaana khaayaI food ateI ate food
Candidate 2 मन खाना खायाmaine khaana khaayaI food ateI ate food
Modified unigram precision 1
Formulating BLEU (Step 3) Incorporating recall
bull Sentence length indicates lsquobest matchrsquobull Brevity penalty (BP)
ndash Multiplicative factorndash Candidate translations that match reference
translations in length must be ranked higher
Candidate 1 लब वा य
Candidate 2 या ल लब वा य क गणव तासमझ पाएगा
Formulating BLEU (Step 3) Brevity Penalty
e^(1-x)
Graph drawn using wwwfooplotcom
BP
BP = 1 for c gt r
Why
x = ( r c )Formula from [2]
r Reference sentence lengthc Candidate sentence length
BP leaves out longer translations
WhyTranslations longer than reference are already
penalized by modified precision
Validating the claim
Formula from [2]
BLEU scorePrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Understanding BLEUUnderstanding BLEUDissecting the formulaUsing it news headline example
Decay in precision
Why log pnTo accommodate decay in precision values
Graph from [2]Formula from [2]
Dissecting the Formula
Claim BLEU should lie between 0 and 1Reason To intuitively satisfy ldquo1 implies perfect
translationrdquoUnderstanding constituents of the formula to
validate the claim
Brevity Penalty
Modified precision
Set to 1NFormula from [2]
Validation of range of BLEU
BP
pn Between 0 and 1log pn Between ndash(infinity) and 0 A Between ndash(infinity) and 0 e ^ (A) Between 0 and 1
A
BP Between 0 and 1
Graph drawn using wwwfooplotcom
(rc)
pn
log pn
A
e^(A)
Calculating BLEU An actual example
Ref भावना मक एक करण लान क लए एक योहारbhaavanaatmak ekikaran laane ke liye ek tyohaaremotional unity bring-to a festivalA festival to bring emotional unity
C1 लान क लए एक योहार भावना मक एक करणlaane-ke-liye ek tyoaar bhaavanaatmak ekikaranbring-to a festival emotional unity (This is invalid Hindi ordering)
C2 को एक उ सव क बार म लाना भावना मक एक करणko ek utsav ke baare mein laana bhaavanaatmak ekikaran
for a festival-about to-bring emotional unity (This is invalid Hindi ordering)
Modified n-gram precision
Ref भावना मक एक करण लान क लए एक योहारC1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 7 7 1
2 5 6 083
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 4 9 044
2 1 8 0125
r 7c 7
r 7c 9
Calculating BLEU scoren wn pn log pn wn
log pn
1 1 2 1 0 0
2 1 2 083 -0186 -0093
Total -0093
BLEU 0911
wn= 1 N = 1 2r = 7
C1c = 7BP = exp(1 ndash 77) = 1
C2c = 9BP = 1
n wn pn log pn wn log pn
1 1 2 044 -0821 04105
2 1 2 0125 -207 -1035
Total -1445
BLEU 0235
C1
C2
Hence the BLEU scores
C1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
0911
0235
Ref भावना मक एक करण लान क लए एक योहार
BLEU vs human BLEU vs human judgment judgment [2][2]
Target language EnglishSource language Chinese
Setup
BLEU scores obtained for each system
Five systems perform translation3 automatic MT systems2 human translators
Human judgment (on scale of 5) obtained for each system
bull Group 1 Ten Monolingual speakers of target language (English)
bull Group 2 Ten Bilingual speakers of Chinese and English
BLEU vs human judgment
bull Monolingual speakersCorrelation co-efficient 099
bull Bilingual speakersCorrelation co-efficient 096
BLEU and human evaluation for S2 and S3 Graph from [2]
Comparison of normalized values
bull High correlation between monolingual group and BLEU score
bull Bilingual group were lenient on lsquofluencyrsquo for H1
bull Demarcation between S1-S3 and H1-H2 is captured by BLEU
Graph from [2]
Conclusion
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Part II Shortcomings of BLEU(continued from 7th March 2013)
Presented by(As a part of CS 712)
Aditya JoshiKashyap PopatShubham Gautam
IIT Bombay
GuideProf Pushpak Bhattacharyya
IIT Bombay
BLEU score revisitedPrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Outline
bull Evaluationbull Formulating BLEU Metricbull Understanding BLEU formulabull Shortcomings of BLEU in generalbull Shortcomings in context of English-Hindi MT
Part I
Part II
Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
[A domesticated bird wants to learn to fly]
I can do this I just have to work out the physics I have quadrated by vector angles I have adjusted for wind shear
This is it Lets fly Just keep it simple Thrust lift drag and wait Thrust lift drag wait[speeds up to get to the of the table]Thrust lift drag waiwaiwait [gets scared and tries to stop himself
but instead falls from the table]
Lines from Rio (2011)
- Blu the Macaw
ldquo
Image Source wwwfanpopcom
Use of BLEU metric [3]
bull Evaluating incremental system changesndash Neglecting actual examples
bull Systems ranked on basis of BLEU
bull Does minimizing error rate with respect to BLEU indeed guarantee translation improvements
Criticisms of BLEU [4]
bull Admits too much variation [3]
bull Admits too little variation [4] [3]
bull Poor correlation with human judgment [3]
Admits too much variation [3]
Permuting phrasesDrawing different items from reference set
Admits too much variation
bull BLEU relies on n-gram matching only
bull Puts very few constraints on how n-gram matches can be drawn from multiple reference translations
Brevity Penalty(Incorporating recall)
Modified precision
Set to 1N
Modified n-gram precision in BLEU [2]
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujhpar teraa suroor chhaayame-on your spell has-been-castYour spell has been cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spellYour your your spell
matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Permuting phrases [3]
bull Reordering of unmatched phrases does not affect precision
bull Bigram mismatch sites can be freely permuted
Possible to randomly produce other hypothesis translations that have the same BLEU score
B1 B2 B3 B4
B4 B2 B1 B3Bigram Bigrammismatch
Issues with precision (12)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate राजा और रानी शकार करन जगल गय म raaja aur raani shikaar karne gaye jungle meinKing and queen to-do-hunting went jungle to (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Issues with precision (22)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate शकार करन जगल राजा और रानी म गयshikaar karne jungle raaja aur raani mein gayeto-do hunting jungle raja and rani in went (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Permuting phrases in generalbull For lsquobrsquo bi-gram matches in a candidate translation of length
lsquokrsquo(k ndash b) possible ways to generate similarly score items using only the words
in this translation
In our example (8-4) = 24 candidate translations
In sentence of length ktotal bigrams = k ndash 1
matched bigrams = b
no of mismatched bigrams = k ndash 1 ndash bno of matched chunks = k ndash 1 ndash b + 1
= k ndash b
These (k-b) chunks can be reordered in (k ndash b) ways
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Evaluation in NLP PrecisionRecall
Precision How many results returned were correct
RecallWhat portion of correct results were returned
Adapting precisionrecall to NLP tasks
Evaluation in NLP PrecisionRecall
bull Document Retrieval
Precision = |Documents relevant and retrieved||Documents retrieved|
Recall=|Documents relevant and retrieved|| Documents relevant|
bull Classification
Precision = |True Positives||True Positives + False Positives|
Recall=|True Positives|| True Positives + False Negatives|
Evaluation in MT [1]
bull Operational evaluationndash ldquoIs MT system A operationally better than MT system B
Does MT system A cost lessrdquo
bull Typological evaluationndash ldquoHave you ensured which linguistic phenomena the MT
system coversrdquo
bull Declarative evaluationndash ldquoHow does quality of output of system A fare with respect
to that of Brdquo
Operational evaluation
bull Cost-benefit is the focusbull To establish cost-per-unit figures and use this
as a basis for comparisonbull Essentially lsquoblack boxrsquobull Realism requirement
Word-based SA
Sense-based SA
Cost of pre-processing (Stemming etc)Cost of learning a classifier
Cost of pre-processing (Stemming etc)Cost of learning a classifierCost of sense annotation
Typological evaluation
bull Use a test suite of examplesbull Ensure all relevant phenomena to be tested
are covered
bull Specific-to-language phenomenaFor example हर आख वाला लड़का म कराया
hari aankhon-wala ladka muskurayagreen eyes-with boy smiledThe boy with green eyes smiled
Declarative evaluation
bull Assign scores to specific qualities of outputndash Intelligibility How good the output is as a well-
formed target language entityndash Accuracy How good the output is in terms of
preserving content of the source text
For example I am attending a lectureम एक या यान बठा ह Main ek vyaakhyan baitha hoonI a lecture sit (Present-first person)I sit a lecture Accurate but not intelligibleम या यान ह Main vyakhyan hoonI lecture amI am lecture Intelligible but not accurate
Evaluation bottleneck
bull Typological evaluation is time-consumingbull Operational evaluation needs accurate
modeling of cost-benefit
bull Automatic MT evaluation DeclarativeBLEU Bilingual Evaluation Understudy
Deriving BLEU Deriving BLEU [2][2]
Incorporating PrecisionIncorporating Recall
How is translation performance measured
The closer a machine translation is to a professional human translation the better it is
bull A corpus of good quality human reference translations
bull A numerical ldquotranslation closenessrdquo metric
Preliminaries
bull Candidate Translation(s) Translation returned by an MT system
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
Goal of BLEU To correlate with human judgment
Formulating BLEU (Step 1) PrecisionI had lunch now
Candidate 1 मन अब खाना खाया
maine ab khana khaya matching unigrams 3I now food ate matching bigrams 1I ate food now
Candidate 2 मन अभी लच एटmaine abhi lunch ate matching unigrams 2I now lunch ate
I ate lunch(OOV) now(OOV) matching bigrams 1Unigram precision Candidate 1 34 = 075 Candidate 2 24 = 05Similarly bigram precision Candidate 1 033 Candidate 2 = 033
Reference 1 मन अभी खाना खायाmaine abhi khana khayaI now food ateI ate food now
Reference 2 मन अभी भोजन कयाmaine abhi bhojan kiyaa
I now meal didI did meal now
Precision Not good enough
Reference मझपर तरा स र छायाmujh-par tera suroor chhaaya
me-on your spell castYour spell was cast on me
Candidate 1 मर तरा स र छाया matching unigram 3
mere tera suroor chhaayamy your spell castYour spell cast my
Candidate 2 तरा तरा तरा स र matching unigrams 4tera tera tera surooryour your your spell
Unigram precision Candidate 1 34 = 075 Candidate 2 44 = 1
Formulating BLEU (Step 2) Modified Precision
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujh-par tera suroor chhaaya
me-on your spell castYour spell was cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spell
bull matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Modified n-gram precision
For entire test corpus for a given n
n-gram Matching n-grams in C
n-gramrsquo All n-grams in CModified precision for n-grams Overall candidates of
test corpus
Formula from [2]
Calculating modified n-gram precision (12)
bull 127 source sentences were translated by two human translators and three MT systems
bull Translated sentences evaluated against professional reference translations using modified n-gram precision
Calculating modified n-gram precision (22)
bull Decaying precision with increasing nbull Comparative ranking of the five
Combining precision for different values of n-grams
Graph from [2]
Formulation of BLEU Recap
bull Precision cannot be used as isbull Modified precision considers lsquoclipped word
countrsquo
Recall for MT (12)
bull Candidates shorter than referencesbull Reference या ल लब वा य क गणव ता को समझ
पाएगाkya blue lambe vaakya ki guNvatta ko samajh paaegawill blue long sentence-of quality (case-marker) understand able(III-person-male-singular)Will blue be able to understand quality of long sentence
Candidate लब वा य
lambe vaakyalong sentencelong sentence
modified unigram precision 22 = 1modified bigram precision 11 = 1
Recall for MT (22)
bull Candidates longer than referencesReference 2 मन भोजन कया
maine bhojan kiyaaI meal didI had meal
Candidate 1 मन खाना भोजन कया
maine khaana bhojan kiyaI food meal didI had food meal
Modified unigram precision 1
Reference 1 मन खाना खायाmaine khaana khaayaI food ateI ate food
Candidate 2 मन खाना खायाmaine khaana khaayaI food ateI ate food
Modified unigram precision 1
Formulating BLEU (Step 3) Incorporating recall
bull Sentence length indicates lsquobest matchrsquobull Brevity penalty (BP)
ndash Multiplicative factorndash Candidate translations that match reference
translations in length must be ranked higher
Candidate 1 लब वा य
Candidate 2 या ल लब वा य क गणव तासमझ पाएगा
Formulating BLEU (Step 3) Brevity Penalty
e^(1-x)
Graph drawn using wwwfooplotcom
BP
BP = 1 for c gt r
Why
x = ( r c )Formula from [2]
r Reference sentence lengthc Candidate sentence length
BP leaves out longer translations
WhyTranslations longer than reference are already
penalized by modified precision
Validating the claim
Formula from [2]
BLEU scorePrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Understanding BLEUUnderstanding BLEUDissecting the formulaUsing it news headline example
Decay in precision
Why log pnTo accommodate decay in precision values
Graph from [2]Formula from [2]
Dissecting the Formula
Claim BLEU should lie between 0 and 1Reason To intuitively satisfy ldquo1 implies perfect
translationrdquoUnderstanding constituents of the formula to
validate the claim
Brevity Penalty
Modified precision
Set to 1NFormula from [2]
Validation of range of BLEU
BP
pn Between 0 and 1log pn Between ndash(infinity) and 0 A Between ndash(infinity) and 0 e ^ (A) Between 0 and 1
A
BP Between 0 and 1
Graph drawn using wwwfooplotcom
(rc)
pn
log pn
A
e^(A)
Calculating BLEU An actual example
Ref भावना मक एक करण लान क लए एक योहारbhaavanaatmak ekikaran laane ke liye ek tyohaaremotional unity bring-to a festivalA festival to bring emotional unity
C1 लान क लए एक योहार भावना मक एक करणlaane-ke-liye ek tyoaar bhaavanaatmak ekikaranbring-to a festival emotional unity (This is invalid Hindi ordering)
C2 को एक उ सव क बार म लाना भावना मक एक करणko ek utsav ke baare mein laana bhaavanaatmak ekikaran
for a festival-about to-bring emotional unity (This is invalid Hindi ordering)
Modified n-gram precision
Ref भावना मक एक करण लान क लए एक योहारC1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 7 7 1
2 5 6 083
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 4 9 044
2 1 8 0125
r 7c 7
r 7c 9
Calculating BLEU scoren wn pn log pn wn
log pn
1 1 2 1 0 0
2 1 2 083 -0186 -0093
Total -0093
BLEU 0911
wn= 1 N = 1 2r = 7
C1c = 7BP = exp(1 ndash 77) = 1
C2c = 9BP = 1
n wn pn log pn wn log pn
1 1 2 044 -0821 04105
2 1 2 0125 -207 -1035
Total -1445
BLEU 0235
C1
C2
Hence the BLEU scores
C1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
0911
0235
Ref भावना मक एक करण लान क लए एक योहार
BLEU vs human BLEU vs human judgment judgment [2][2]
Target language EnglishSource language Chinese
Setup
BLEU scores obtained for each system
Five systems perform translation3 automatic MT systems2 human translators
Human judgment (on scale of 5) obtained for each system
bull Group 1 Ten Monolingual speakers of target language (English)
bull Group 2 Ten Bilingual speakers of Chinese and English
BLEU vs human judgment
bull Monolingual speakersCorrelation co-efficient 099
bull Bilingual speakersCorrelation co-efficient 096
BLEU and human evaluation for S2 and S3 Graph from [2]
Comparison of normalized values
bull High correlation between monolingual group and BLEU score
bull Bilingual group were lenient on lsquofluencyrsquo for H1
bull Demarcation between S1-S3 and H1-H2 is captured by BLEU
Graph from [2]
Conclusion
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Part II Shortcomings of BLEU(continued from 7th March 2013)
Presented by(As a part of CS 712)
Aditya JoshiKashyap PopatShubham Gautam
IIT Bombay
GuideProf Pushpak Bhattacharyya
IIT Bombay
BLEU score revisitedPrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Outline
bull Evaluationbull Formulating BLEU Metricbull Understanding BLEU formulabull Shortcomings of BLEU in generalbull Shortcomings in context of English-Hindi MT
Part I
Part II
Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
[A domesticated bird wants to learn to fly]
I can do this I just have to work out the physics I have quadrated by vector angles I have adjusted for wind shear
This is it Lets fly Just keep it simple Thrust lift drag and wait Thrust lift drag wait[speeds up to get to the of the table]Thrust lift drag waiwaiwait [gets scared and tries to stop himself
but instead falls from the table]
Lines from Rio (2011)
- Blu the Macaw
ldquo
Image Source wwwfanpopcom
Use of BLEU metric [3]
bull Evaluating incremental system changesndash Neglecting actual examples
bull Systems ranked on basis of BLEU
bull Does minimizing error rate with respect to BLEU indeed guarantee translation improvements
Criticisms of BLEU [4]
bull Admits too much variation [3]
bull Admits too little variation [4] [3]
bull Poor correlation with human judgment [3]
Admits too much variation [3]
Permuting phrasesDrawing different items from reference set
Admits too much variation
bull BLEU relies on n-gram matching only
bull Puts very few constraints on how n-gram matches can be drawn from multiple reference translations
Brevity Penalty(Incorporating recall)
Modified precision
Set to 1N
Modified n-gram precision in BLEU [2]
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujhpar teraa suroor chhaayame-on your spell has-been-castYour spell has been cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spellYour your your spell
matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Permuting phrases [3]
bull Reordering of unmatched phrases does not affect precision
bull Bigram mismatch sites can be freely permuted
Possible to randomly produce other hypothesis translations that have the same BLEU score
B1 B2 B3 B4
B4 B2 B1 B3Bigram Bigrammismatch
Issues with precision (12)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate राजा और रानी शकार करन जगल गय म raaja aur raani shikaar karne gaye jungle meinKing and queen to-do-hunting went jungle to (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Issues with precision (22)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate शकार करन जगल राजा और रानी म गयshikaar karne jungle raaja aur raani mein gayeto-do hunting jungle raja and rani in went (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Permuting phrases in generalbull For lsquobrsquo bi-gram matches in a candidate translation of length
lsquokrsquo(k ndash b) possible ways to generate similarly score items using only the words
in this translation
In our example (8-4) = 24 candidate translations
In sentence of length ktotal bigrams = k ndash 1
matched bigrams = b
no of mismatched bigrams = k ndash 1 ndash bno of matched chunks = k ndash 1 ndash b + 1
= k ndash b
These (k-b) chunks can be reordered in (k ndash b) ways
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Evaluation in NLP PrecisionRecall
bull Document Retrieval
Precision = |Documents relevant and retrieved||Documents retrieved|
Recall=|Documents relevant and retrieved|| Documents relevant|
bull Classification
Precision = |True Positives||True Positives + False Positives|
Recall=|True Positives|| True Positives + False Negatives|
Evaluation in MT [1]
bull Operational evaluationndash ldquoIs MT system A operationally better than MT system B
Does MT system A cost lessrdquo
bull Typological evaluationndash ldquoHave you ensured which linguistic phenomena the MT
system coversrdquo
bull Declarative evaluationndash ldquoHow does quality of output of system A fare with respect
to that of Brdquo
Operational evaluation
bull Cost-benefit is the focusbull To establish cost-per-unit figures and use this
as a basis for comparisonbull Essentially lsquoblack boxrsquobull Realism requirement
Word-based SA
Sense-based SA
Cost of pre-processing (Stemming etc)Cost of learning a classifier
Cost of pre-processing (Stemming etc)Cost of learning a classifierCost of sense annotation
Typological evaluation
bull Use a test suite of examplesbull Ensure all relevant phenomena to be tested
are covered
bull Specific-to-language phenomenaFor example हर आख वाला लड़का म कराया
hari aankhon-wala ladka muskurayagreen eyes-with boy smiledThe boy with green eyes smiled
Declarative evaluation
bull Assign scores to specific qualities of outputndash Intelligibility How good the output is as a well-
formed target language entityndash Accuracy How good the output is in terms of
preserving content of the source text
For example I am attending a lectureम एक या यान बठा ह Main ek vyaakhyan baitha hoonI a lecture sit (Present-first person)I sit a lecture Accurate but not intelligibleम या यान ह Main vyakhyan hoonI lecture amI am lecture Intelligible but not accurate
Evaluation bottleneck
bull Typological evaluation is time-consumingbull Operational evaluation needs accurate
modeling of cost-benefit
bull Automatic MT evaluation DeclarativeBLEU Bilingual Evaluation Understudy
Deriving BLEU Deriving BLEU [2][2]
Incorporating PrecisionIncorporating Recall
How is translation performance measured
The closer a machine translation is to a professional human translation the better it is
bull A corpus of good quality human reference translations
bull A numerical ldquotranslation closenessrdquo metric
Preliminaries
bull Candidate Translation(s) Translation returned by an MT system
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
Goal of BLEU To correlate with human judgment
Formulating BLEU (Step 1) PrecisionI had lunch now
Candidate 1 मन अब खाना खाया
maine ab khana khaya matching unigrams 3I now food ate matching bigrams 1I ate food now
Candidate 2 मन अभी लच एटmaine abhi lunch ate matching unigrams 2I now lunch ate
I ate lunch(OOV) now(OOV) matching bigrams 1Unigram precision Candidate 1 34 = 075 Candidate 2 24 = 05Similarly bigram precision Candidate 1 033 Candidate 2 = 033
Reference 1 मन अभी खाना खायाmaine abhi khana khayaI now food ateI ate food now
Reference 2 मन अभी भोजन कयाmaine abhi bhojan kiyaa
I now meal didI did meal now
Precision Not good enough
Reference मझपर तरा स र छायाmujh-par tera suroor chhaaya
me-on your spell castYour spell was cast on me
Candidate 1 मर तरा स र छाया matching unigram 3
mere tera suroor chhaayamy your spell castYour spell cast my
Candidate 2 तरा तरा तरा स र matching unigrams 4tera tera tera surooryour your your spell
Unigram precision Candidate 1 34 = 075 Candidate 2 44 = 1
Formulating BLEU (Step 2) Modified Precision
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujh-par tera suroor chhaaya
me-on your spell castYour spell was cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spell
bull matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Modified n-gram precision
For entire test corpus for a given n
n-gram Matching n-grams in C
n-gramrsquo All n-grams in CModified precision for n-grams Overall candidates of
test corpus
Formula from [2]
Calculating modified n-gram precision (12)
bull 127 source sentences were translated by two human translators and three MT systems
bull Translated sentences evaluated against professional reference translations using modified n-gram precision
Calculating modified n-gram precision (22)
bull Decaying precision with increasing nbull Comparative ranking of the five
Combining precision for different values of n-grams
Graph from [2]
Formulation of BLEU Recap
bull Precision cannot be used as isbull Modified precision considers lsquoclipped word
countrsquo
Recall for MT (12)
bull Candidates shorter than referencesbull Reference या ल लब वा य क गणव ता को समझ
पाएगाkya blue lambe vaakya ki guNvatta ko samajh paaegawill blue long sentence-of quality (case-marker) understand able(III-person-male-singular)Will blue be able to understand quality of long sentence
Candidate लब वा य
lambe vaakyalong sentencelong sentence
modified unigram precision 22 = 1modified bigram precision 11 = 1
Recall for MT (22)
bull Candidates longer than referencesReference 2 मन भोजन कया
maine bhojan kiyaaI meal didI had meal
Candidate 1 मन खाना भोजन कया
maine khaana bhojan kiyaI food meal didI had food meal
Modified unigram precision 1
Reference 1 मन खाना खायाmaine khaana khaayaI food ateI ate food
Candidate 2 मन खाना खायाmaine khaana khaayaI food ateI ate food
Modified unigram precision 1
Formulating BLEU (Step 3) Incorporating recall
bull Sentence length indicates lsquobest matchrsquobull Brevity penalty (BP)
ndash Multiplicative factorndash Candidate translations that match reference
translations in length must be ranked higher
Candidate 1 लब वा य
Candidate 2 या ल लब वा य क गणव तासमझ पाएगा
Formulating BLEU (Step 3) Brevity Penalty
e^(1-x)
Graph drawn using wwwfooplotcom
BP
BP = 1 for c gt r
Why
x = ( r c )Formula from [2]
r Reference sentence lengthc Candidate sentence length
BP leaves out longer translations
WhyTranslations longer than reference are already
penalized by modified precision
Validating the claim
Formula from [2]
BLEU scorePrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Understanding BLEUUnderstanding BLEUDissecting the formulaUsing it news headline example
Decay in precision
Why log pnTo accommodate decay in precision values
Graph from [2]Formula from [2]
Dissecting the Formula
Claim BLEU should lie between 0 and 1Reason To intuitively satisfy ldquo1 implies perfect
translationrdquoUnderstanding constituents of the formula to
validate the claim
Brevity Penalty
Modified precision
Set to 1NFormula from [2]
Validation of range of BLEU
BP
pn Between 0 and 1log pn Between ndash(infinity) and 0 A Between ndash(infinity) and 0 e ^ (A) Between 0 and 1
A
BP Between 0 and 1
Graph drawn using wwwfooplotcom
(rc)
pn
log pn
A
e^(A)
Calculating BLEU An actual example
Ref भावना मक एक करण लान क लए एक योहारbhaavanaatmak ekikaran laane ke liye ek tyohaaremotional unity bring-to a festivalA festival to bring emotional unity
C1 लान क लए एक योहार भावना मक एक करणlaane-ke-liye ek tyoaar bhaavanaatmak ekikaranbring-to a festival emotional unity (This is invalid Hindi ordering)
C2 को एक उ सव क बार म लाना भावना मक एक करणko ek utsav ke baare mein laana bhaavanaatmak ekikaran
for a festival-about to-bring emotional unity (This is invalid Hindi ordering)
Modified n-gram precision
Ref भावना मक एक करण लान क लए एक योहारC1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 7 7 1
2 5 6 083
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 4 9 044
2 1 8 0125
r 7c 7
r 7c 9
Calculating BLEU scoren wn pn log pn wn
log pn
1 1 2 1 0 0
2 1 2 083 -0186 -0093
Total -0093
BLEU 0911
wn= 1 N = 1 2r = 7
C1c = 7BP = exp(1 ndash 77) = 1
C2c = 9BP = 1
n wn pn log pn wn log pn
1 1 2 044 -0821 04105
2 1 2 0125 -207 -1035
Total -1445
BLEU 0235
C1
C2
Hence the BLEU scores
C1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
0911
0235
Ref भावना मक एक करण लान क लए एक योहार
BLEU vs human BLEU vs human judgment judgment [2][2]
Target language EnglishSource language Chinese
Setup
BLEU scores obtained for each system
Five systems perform translation3 automatic MT systems2 human translators
Human judgment (on scale of 5) obtained for each system
bull Group 1 Ten Monolingual speakers of target language (English)
bull Group 2 Ten Bilingual speakers of Chinese and English
BLEU vs human judgment
bull Monolingual speakersCorrelation co-efficient 099
bull Bilingual speakersCorrelation co-efficient 096
BLEU and human evaluation for S2 and S3 Graph from [2]
Comparison of normalized values
bull High correlation between monolingual group and BLEU score
bull Bilingual group were lenient on lsquofluencyrsquo for H1
bull Demarcation between S1-S3 and H1-H2 is captured by BLEU
Graph from [2]
Conclusion
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Part II Shortcomings of BLEU(continued from 7th March 2013)
Presented by(As a part of CS 712)
Aditya JoshiKashyap PopatShubham Gautam
IIT Bombay
GuideProf Pushpak Bhattacharyya
IIT Bombay
BLEU score revisitedPrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Outline
bull Evaluationbull Formulating BLEU Metricbull Understanding BLEU formulabull Shortcomings of BLEU in generalbull Shortcomings in context of English-Hindi MT
Part I
Part II
Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
[A domesticated bird wants to learn to fly]
I can do this I just have to work out the physics I have quadrated by vector angles I have adjusted for wind shear
This is it Lets fly Just keep it simple Thrust lift drag and wait Thrust lift drag wait[speeds up to get to the of the table]Thrust lift drag waiwaiwait [gets scared and tries to stop himself
but instead falls from the table]
Lines from Rio (2011)
- Blu the Macaw
ldquo
Image Source wwwfanpopcom
Use of BLEU metric [3]
bull Evaluating incremental system changesndash Neglecting actual examples
bull Systems ranked on basis of BLEU
bull Does minimizing error rate with respect to BLEU indeed guarantee translation improvements
Criticisms of BLEU [4]
bull Admits too much variation [3]
bull Admits too little variation [4] [3]
bull Poor correlation with human judgment [3]
Admits too much variation [3]
Permuting phrasesDrawing different items from reference set
Admits too much variation
bull BLEU relies on n-gram matching only
bull Puts very few constraints on how n-gram matches can be drawn from multiple reference translations
Brevity Penalty(Incorporating recall)
Modified precision
Set to 1N
Modified n-gram precision in BLEU [2]
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujhpar teraa suroor chhaayame-on your spell has-been-castYour spell has been cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spellYour your your spell
matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Permuting phrases [3]
bull Reordering of unmatched phrases does not affect precision
bull Bigram mismatch sites can be freely permuted
Possible to randomly produce other hypothesis translations that have the same BLEU score
B1 B2 B3 B4
B4 B2 B1 B3Bigram Bigrammismatch
Issues with precision (12)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate राजा और रानी शकार करन जगल गय म raaja aur raani shikaar karne gaye jungle meinKing and queen to-do-hunting went jungle to (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Issues with precision (22)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate शकार करन जगल राजा और रानी म गयshikaar karne jungle raaja aur raani mein gayeto-do hunting jungle raja and rani in went (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Permuting phrases in generalbull For lsquobrsquo bi-gram matches in a candidate translation of length
lsquokrsquo(k ndash b) possible ways to generate similarly score items using only the words
in this translation
In our example (8-4) = 24 candidate translations
In sentence of length ktotal bigrams = k ndash 1
matched bigrams = b
no of mismatched bigrams = k ndash 1 ndash bno of matched chunks = k ndash 1 ndash b + 1
= k ndash b
These (k-b) chunks can be reordered in (k ndash b) ways
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Evaluation in MT [1]
bull Operational evaluationndash ldquoIs MT system A operationally better than MT system B
Does MT system A cost lessrdquo
bull Typological evaluationndash ldquoHave you ensured which linguistic phenomena the MT
system coversrdquo
bull Declarative evaluationndash ldquoHow does quality of output of system A fare with respect
to that of Brdquo
Operational evaluation
bull Cost-benefit is the focusbull To establish cost-per-unit figures and use this
as a basis for comparisonbull Essentially lsquoblack boxrsquobull Realism requirement
Word-based SA
Sense-based SA
Cost of pre-processing (Stemming etc)Cost of learning a classifier
Cost of pre-processing (Stemming etc)Cost of learning a classifierCost of sense annotation
Typological evaluation
bull Use a test suite of examplesbull Ensure all relevant phenomena to be tested
are covered
bull Specific-to-language phenomenaFor example हर आख वाला लड़का म कराया
hari aankhon-wala ladka muskurayagreen eyes-with boy smiledThe boy with green eyes smiled
Declarative evaluation
bull Assign scores to specific qualities of outputndash Intelligibility How good the output is as a well-
formed target language entityndash Accuracy How good the output is in terms of
preserving content of the source text
For example I am attending a lectureम एक या यान बठा ह Main ek vyaakhyan baitha hoonI a lecture sit (Present-first person)I sit a lecture Accurate but not intelligibleम या यान ह Main vyakhyan hoonI lecture amI am lecture Intelligible but not accurate
Evaluation bottleneck
bull Typological evaluation is time-consumingbull Operational evaluation needs accurate
modeling of cost-benefit
bull Automatic MT evaluation DeclarativeBLEU Bilingual Evaluation Understudy
Deriving BLEU Deriving BLEU [2][2]
Incorporating PrecisionIncorporating Recall
How is translation performance measured
The closer a machine translation is to a professional human translation the better it is
bull A corpus of good quality human reference translations
bull A numerical ldquotranslation closenessrdquo metric
Preliminaries
bull Candidate Translation(s) Translation returned by an MT system
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
Goal of BLEU To correlate with human judgment
Formulating BLEU (Step 1) PrecisionI had lunch now
Candidate 1 मन अब खाना खाया
maine ab khana khaya matching unigrams 3I now food ate matching bigrams 1I ate food now
Candidate 2 मन अभी लच एटmaine abhi lunch ate matching unigrams 2I now lunch ate
I ate lunch(OOV) now(OOV) matching bigrams 1Unigram precision Candidate 1 34 = 075 Candidate 2 24 = 05Similarly bigram precision Candidate 1 033 Candidate 2 = 033
Reference 1 मन अभी खाना खायाmaine abhi khana khayaI now food ateI ate food now
Reference 2 मन अभी भोजन कयाmaine abhi bhojan kiyaa
I now meal didI did meal now
Precision Not good enough
Reference मझपर तरा स र छायाmujh-par tera suroor chhaaya
me-on your spell castYour spell was cast on me
Candidate 1 मर तरा स र छाया matching unigram 3
mere tera suroor chhaayamy your spell castYour spell cast my
Candidate 2 तरा तरा तरा स र matching unigrams 4tera tera tera surooryour your your spell
Unigram precision Candidate 1 34 = 075 Candidate 2 44 = 1
Formulating BLEU (Step 2) Modified Precision
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujh-par tera suroor chhaaya
me-on your spell castYour spell was cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spell
bull matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Modified n-gram precision
For entire test corpus for a given n
n-gram Matching n-grams in C
n-gramrsquo All n-grams in CModified precision for n-grams Overall candidates of
test corpus
Formula from [2]
Calculating modified n-gram precision (12)
bull 127 source sentences were translated by two human translators and three MT systems
bull Translated sentences evaluated against professional reference translations using modified n-gram precision
Calculating modified n-gram precision (22)
bull Decaying precision with increasing nbull Comparative ranking of the five
Combining precision for different values of n-grams
Graph from [2]
Formulation of BLEU Recap
bull Precision cannot be used as isbull Modified precision considers lsquoclipped word
countrsquo
Recall for MT (12)
bull Candidates shorter than referencesbull Reference या ल लब वा य क गणव ता को समझ
पाएगाkya blue lambe vaakya ki guNvatta ko samajh paaegawill blue long sentence-of quality (case-marker) understand able(III-person-male-singular)Will blue be able to understand quality of long sentence
Candidate लब वा य
lambe vaakyalong sentencelong sentence
modified unigram precision 22 = 1modified bigram precision 11 = 1
Recall for MT (22)
bull Candidates longer than referencesReference 2 मन भोजन कया
maine bhojan kiyaaI meal didI had meal
Candidate 1 मन खाना भोजन कया
maine khaana bhojan kiyaI food meal didI had food meal
Modified unigram precision 1
Reference 1 मन खाना खायाmaine khaana khaayaI food ateI ate food
Candidate 2 मन खाना खायाmaine khaana khaayaI food ateI ate food
Modified unigram precision 1
Formulating BLEU (Step 3) Incorporating recall
bull Sentence length indicates lsquobest matchrsquobull Brevity penalty (BP)
ndash Multiplicative factorndash Candidate translations that match reference
translations in length must be ranked higher
Candidate 1 लब वा य
Candidate 2 या ल लब वा य क गणव तासमझ पाएगा
Formulating BLEU (Step 3) Brevity Penalty
e^(1-x)
Graph drawn using wwwfooplotcom
BP
BP = 1 for c gt r
Why
x = ( r c )Formula from [2]
r Reference sentence lengthc Candidate sentence length
BP leaves out longer translations
WhyTranslations longer than reference are already
penalized by modified precision
Validating the claim
Formula from [2]
BLEU scorePrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Understanding BLEUUnderstanding BLEUDissecting the formulaUsing it news headline example
Decay in precision
Why log pnTo accommodate decay in precision values
Graph from [2]Formula from [2]
Dissecting the Formula
Claim BLEU should lie between 0 and 1Reason To intuitively satisfy ldquo1 implies perfect
translationrdquoUnderstanding constituents of the formula to
validate the claim
Brevity Penalty
Modified precision
Set to 1NFormula from [2]
Validation of range of BLEU
BP
pn Between 0 and 1log pn Between ndash(infinity) and 0 A Between ndash(infinity) and 0 e ^ (A) Between 0 and 1
A
BP Between 0 and 1
Graph drawn using wwwfooplotcom
(rc)
pn
log pn
A
e^(A)
Calculating BLEU An actual example
Ref भावना मक एक करण लान क लए एक योहारbhaavanaatmak ekikaran laane ke liye ek tyohaaremotional unity bring-to a festivalA festival to bring emotional unity
C1 लान क लए एक योहार भावना मक एक करणlaane-ke-liye ek tyoaar bhaavanaatmak ekikaranbring-to a festival emotional unity (This is invalid Hindi ordering)
C2 को एक उ सव क बार म लाना भावना मक एक करणko ek utsav ke baare mein laana bhaavanaatmak ekikaran
for a festival-about to-bring emotional unity (This is invalid Hindi ordering)
Modified n-gram precision
Ref भावना मक एक करण लान क लए एक योहारC1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 7 7 1
2 5 6 083
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 4 9 044
2 1 8 0125
r 7c 7
r 7c 9
Calculating BLEU scoren wn pn log pn wn
log pn
1 1 2 1 0 0
2 1 2 083 -0186 -0093
Total -0093
BLEU 0911
wn= 1 N = 1 2r = 7
C1c = 7BP = exp(1 ndash 77) = 1
C2c = 9BP = 1
n wn pn log pn wn log pn
1 1 2 044 -0821 04105
2 1 2 0125 -207 -1035
Total -1445
BLEU 0235
C1
C2
Hence the BLEU scores
C1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
0911
0235
Ref भावना मक एक करण लान क लए एक योहार
BLEU vs human BLEU vs human judgment judgment [2][2]
Target language EnglishSource language Chinese
Setup
BLEU scores obtained for each system
Five systems perform translation3 automatic MT systems2 human translators
Human judgment (on scale of 5) obtained for each system
bull Group 1 Ten Monolingual speakers of target language (English)
bull Group 2 Ten Bilingual speakers of Chinese and English
BLEU vs human judgment
bull Monolingual speakersCorrelation co-efficient 099
bull Bilingual speakersCorrelation co-efficient 096
BLEU and human evaluation for S2 and S3 Graph from [2]
Comparison of normalized values
bull High correlation between monolingual group and BLEU score
bull Bilingual group were lenient on lsquofluencyrsquo for H1
bull Demarcation between S1-S3 and H1-H2 is captured by BLEU
Graph from [2]
Conclusion
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Part II Shortcomings of BLEU(continued from 7th March 2013)
Presented by(As a part of CS 712)
Aditya JoshiKashyap PopatShubham Gautam
IIT Bombay
GuideProf Pushpak Bhattacharyya
IIT Bombay
BLEU score revisitedPrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Outline
bull Evaluationbull Formulating BLEU Metricbull Understanding BLEU formulabull Shortcomings of BLEU in generalbull Shortcomings in context of English-Hindi MT
Part I
Part II
Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
[A domesticated bird wants to learn to fly]
I can do this I just have to work out the physics I have quadrated by vector angles I have adjusted for wind shear
This is it Lets fly Just keep it simple Thrust lift drag and wait Thrust lift drag wait[speeds up to get to the of the table]Thrust lift drag waiwaiwait [gets scared and tries to stop himself
but instead falls from the table]
Lines from Rio (2011)
- Blu the Macaw
ldquo
Image Source wwwfanpopcom
Use of BLEU metric [3]
bull Evaluating incremental system changesndash Neglecting actual examples
bull Systems ranked on basis of BLEU
bull Does minimizing error rate with respect to BLEU indeed guarantee translation improvements
Criticisms of BLEU [4]
bull Admits too much variation [3]
bull Admits too little variation [4] [3]
bull Poor correlation with human judgment [3]
Admits too much variation [3]
Permuting phrasesDrawing different items from reference set
Admits too much variation
bull BLEU relies on n-gram matching only
bull Puts very few constraints on how n-gram matches can be drawn from multiple reference translations
Brevity Penalty(Incorporating recall)
Modified precision
Set to 1N
Modified n-gram precision in BLEU [2]
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujhpar teraa suroor chhaayame-on your spell has-been-castYour spell has been cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spellYour your your spell
matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Permuting phrases [3]
bull Reordering of unmatched phrases does not affect precision
bull Bigram mismatch sites can be freely permuted
Possible to randomly produce other hypothesis translations that have the same BLEU score
B1 B2 B3 B4
B4 B2 B1 B3Bigram Bigrammismatch
Issues with precision (12)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate राजा और रानी शकार करन जगल गय म raaja aur raani shikaar karne gaye jungle meinKing and queen to-do-hunting went jungle to (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Issues with precision (22)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate शकार करन जगल राजा और रानी म गयshikaar karne jungle raaja aur raani mein gayeto-do hunting jungle raja and rani in went (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Permuting phrases in generalbull For lsquobrsquo bi-gram matches in a candidate translation of length
lsquokrsquo(k ndash b) possible ways to generate similarly score items using only the words
in this translation
In our example (8-4) = 24 candidate translations
In sentence of length ktotal bigrams = k ndash 1
matched bigrams = b
no of mismatched bigrams = k ndash 1 ndash bno of matched chunks = k ndash 1 ndash b + 1
= k ndash b
These (k-b) chunks can be reordered in (k ndash b) ways
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Operational evaluation
bull Cost-benefit is the focusbull To establish cost-per-unit figures and use this
as a basis for comparisonbull Essentially lsquoblack boxrsquobull Realism requirement
Word-based SA
Sense-based SA
Cost of pre-processing (Stemming etc)Cost of learning a classifier
Cost of pre-processing (Stemming etc)Cost of learning a classifierCost of sense annotation
Typological evaluation
bull Use a test suite of examplesbull Ensure all relevant phenomena to be tested
are covered
bull Specific-to-language phenomenaFor example हर आख वाला लड़का म कराया
hari aankhon-wala ladka muskurayagreen eyes-with boy smiledThe boy with green eyes smiled
Declarative evaluation
bull Assign scores to specific qualities of outputndash Intelligibility How good the output is as a well-
formed target language entityndash Accuracy How good the output is in terms of
preserving content of the source text
For example I am attending a lectureम एक या यान बठा ह Main ek vyaakhyan baitha hoonI a lecture sit (Present-first person)I sit a lecture Accurate but not intelligibleम या यान ह Main vyakhyan hoonI lecture amI am lecture Intelligible but not accurate
Evaluation bottleneck
bull Typological evaluation is time-consumingbull Operational evaluation needs accurate
modeling of cost-benefit
bull Automatic MT evaluation DeclarativeBLEU Bilingual Evaluation Understudy
Deriving BLEU Deriving BLEU [2][2]
Incorporating PrecisionIncorporating Recall
How is translation performance measured
The closer a machine translation is to a professional human translation the better it is
bull A corpus of good quality human reference translations
bull A numerical ldquotranslation closenessrdquo metric
Preliminaries
bull Candidate Translation(s) Translation returned by an MT system
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
Goal of BLEU To correlate with human judgment
Formulating BLEU (Step 1) PrecisionI had lunch now
Candidate 1 मन अब खाना खाया
maine ab khana khaya matching unigrams 3I now food ate matching bigrams 1I ate food now
Candidate 2 मन अभी लच एटmaine abhi lunch ate matching unigrams 2I now lunch ate
I ate lunch(OOV) now(OOV) matching bigrams 1Unigram precision Candidate 1 34 = 075 Candidate 2 24 = 05Similarly bigram precision Candidate 1 033 Candidate 2 = 033
Reference 1 मन अभी खाना खायाmaine abhi khana khayaI now food ateI ate food now
Reference 2 मन अभी भोजन कयाmaine abhi bhojan kiyaa
I now meal didI did meal now
Precision Not good enough
Reference मझपर तरा स र छायाmujh-par tera suroor chhaaya
me-on your spell castYour spell was cast on me
Candidate 1 मर तरा स र छाया matching unigram 3
mere tera suroor chhaayamy your spell castYour spell cast my
Candidate 2 तरा तरा तरा स र matching unigrams 4tera tera tera surooryour your your spell
Unigram precision Candidate 1 34 = 075 Candidate 2 44 = 1
Formulating BLEU (Step 2) Modified Precision
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujh-par tera suroor chhaaya
me-on your spell castYour spell was cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spell
bull matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Modified n-gram precision
For entire test corpus for a given n
n-gram Matching n-grams in C
n-gramrsquo All n-grams in CModified precision for n-grams Overall candidates of
test corpus
Formula from [2]
Calculating modified n-gram precision (12)
bull 127 source sentences were translated by two human translators and three MT systems
bull Translated sentences evaluated against professional reference translations using modified n-gram precision
Calculating modified n-gram precision (22)
bull Decaying precision with increasing nbull Comparative ranking of the five
Combining precision for different values of n-grams
Graph from [2]
Formulation of BLEU Recap
bull Precision cannot be used as isbull Modified precision considers lsquoclipped word
countrsquo
Recall for MT (12)
bull Candidates shorter than referencesbull Reference या ल लब वा य क गणव ता को समझ
पाएगाkya blue lambe vaakya ki guNvatta ko samajh paaegawill blue long sentence-of quality (case-marker) understand able(III-person-male-singular)Will blue be able to understand quality of long sentence
Candidate लब वा य
lambe vaakyalong sentencelong sentence
modified unigram precision 22 = 1modified bigram precision 11 = 1
Recall for MT (22)
bull Candidates longer than referencesReference 2 मन भोजन कया
maine bhojan kiyaaI meal didI had meal
Candidate 1 मन खाना भोजन कया
maine khaana bhojan kiyaI food meal didI had food meal
Modified unigram precision 1
Reference 1 मन खाना खायाmaine khaana khaayaI food ateI ate food
Candidate 2 मन खाना खायाmaine khaana khaayaI food ateI ate food
Modified unigram precision 1
Formulating BLEU (Step 3) Incorporating recall
bull Sentence length indicates lsquobest matchrsquobull Brevity penalty (BP)
ndash Multiplicative factorndash Candidate translations that match reference
translations in length must be ranked higher
Candidate 1 लब वा य
Candidate 2 या ल लब वा य क गणव तासमझ पाएगा
Formulating BLEU (Step 3) Brevity Penalty
e^(1-x)
Graph drawn using wwwfooplotcom
BP
BP = 1 for c gt r
Why
x = ( r c )Formula from [2]
r Reference sentence lengthc Candidate sentence length
BP leaves out longer translations
WhyTranslations longer than reference are already
penalized by modified precision
Validating the claim
Formula from [2]
BLEU scorePrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Understanding BLEUUnderstanding BLEUDissecting the formulaUsing it news headline example
Decay in precision
Why log pnTo accommodate decay in precision values
Graph from [2]Formula from [2]
Dissecting the Formula
Claim BLEU should lie between 0 and 1Reason To intuitively satisfy ldquo1 implies perfect
translationrdquoUnderstanding constituents of the formula to
validate the claim
Brevity Penalty
Modified precision
Set to 1NFormula from [2]
Validation of range of BLEU
BP
pn Between 0 and 1log pn Between ndash(infinity) and 0 A Between ndash(infinity) and 0 e ^ (A) Between 0 and 1
A
BP Between 0 and 1
Graph drawn using wwwfooplotcom
(rc)
pn
log pn
A
e^(A)
Calculating BLEU An actual example
Ref भावना मक एक करण लान क लए एक योहारbhaavanaatmak ekikaran laane ke liye ek tyohaaremotional unity bring-to a festivalA festival to bring emotional unity
C1 लान क लए एक योहार भावना मक एक करणlaane-ke-liye ek tyoaar bhaavanaatmak ekikaranbring-to a festival emotional unity (This is invalid Hindi ordering)
C2 को एक उ सव क बार म लाना भावना मक एक करणko ek utsav ke baare mein laana bhaavanaatmak ekikaran
for a festival-about to-bring emotional unity (This is invalid Hindi ordering)
Modified n-gram precision
Ref भावना मक एक करण लान क लए एक योहारC1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 7 7 1
2 5 6 083
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 4 9 044
2 1 8 0125
r 7c 7
r 7c 9
Calculating BLEU scoren wn pn log pn wn
log pn
1 1 2 1 0 0
2 1 2 083 -0186 -0093
Total -0093
BLEU 0911
wn= 1 N = 1 2r = 7
C1c = 7BP = exp(1 ndash 77) = 1
C2c = 9BP = 1
n wn pn log pn wn log pn
1 1 2 044 -0821 04105
2 1 2 0125 -207 -1035
Total -1445
BLEU 0235
C1
C2
Hence the BLEU scores
C1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
0911
0235
Ref भावना मक एक करण लान क लए एक योहार
BLEU vs human BLEU vs human judgment judgment [2][2]
Target language EnglishSource language Chinese
Setup
BLEU scores obtained for each system
Five systems perform translation3 automatic MT systems2 human translators
Human judgment (on scale of 5) obtained for each system
bull Group 1 Ten Monolingual speakers of target language (English)
bull Group 2 Ten Bilingual speakers of Chinese and English
BLEU vs human judgment
bull Monolingual speakersCorrelation co-efficient 099
bull Bilingual speakersCorrelation co-efficient 096
BLEU and human evaluation for S2 and S3 Graph from [2]
Comparison of normalized values
bull High correlation between monolingual group and BLEU score
bull Bilingual group were lenient on lsquofluencyrsquo for H1
bull Demarcation between S1-S3 and H1-H2 is captured by BLEU
Graph from [2]
Conclusion
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Part II Shortcomings of BLEU(continued from 7th March 2013)
Presented by(As a part of CS 712)
Aditya JoshiKashyap PopatShubham Gautam
IIT Bombay
GuideProf Pushpak Bhattacharyya
IIT Bombay
BLEU score revisitedPrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Outline
bull Evaluationbull Formulating BLEU Metricbull Understanding BLEU formulabull Shortcomings of BLEU in generalbull Shortcomings in context of English-Hindi MT
Part I
Part II
Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
[A domesticated bird wants to learn to fly]
I can do this I just have to work out the physics I have quadrated by vector angles I have adjusted for wind shear
This is it Lets fly Just keep it simple Thrust lift drag and wait Thrust lift drag wait[speeds up to get to the of the table]Thrust lift drag waiwaiwait [gets scared and tries to stop himself
but instead falls from the table]
Lines from Rio (2011)
- Blu the Macaw
ldquo
Image Source wwwfanpopcom
Use of BLEU metric [3]
bull Evaluating incremental system changesndash Neglecting actual examples
bull Systems ranked on basis of BLEU
bull Does minimizing error rate with respect to BLEU indeed guarantee translation improvements
Criticisms of BLEU [4]
bull Admits too much variation [3]
bull Admits too little variation [4] [3]
bull Poor correlation with human judgment [3]
Admits too much variation [3]
Permuting phrasesDrawing different items from reference set
Admits too much variation
bull BLEU relies on n-gram matching only
bull Puts very few constraints on how n-gram matches can be drawn from multiple reference translations
Brevity Penalty(Incorporating recall)
Modified precision
Set to 1N
Modified n-gram precision in BLEU [2]
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujhpar teraa suroor chhaayame-on your spell has-been-castYour spell has been cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spellYour your your spell
matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Permuting phrases [3]
bull Reordering of unmatched phrases does not affect precision
bull Bigram mismatch sites can be freely permuted
Possible to randomly produce other hypothesis translations that have the same BLEU score
B1 B2 B3 B4
B4 B2 B1 B3Bigram Bigrammismatch
Issues with precision (12)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate राजा और रानी शकार करन जगल गय म raaja aur raani shikaar karne gaye jungle meinKing and queen to-do-hunting went jungle to (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Issues with precision (22)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate शकार करन जगल राजा और रानी म गयshikaar karne jungle raaja aur raani mein gayeto-do hunting jungle raja and rani in went (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Permuting phrases in generalbull For lsquobrsquo bi-gram matches in a candidate translation of length
lsquokrsquo(k ndash b) possible ways to generate similarly score items using only the words
in this translation
In our example (8-4) = 24 candidate translations
In sentence of length ktotal bigrams = k ndash 1
matched bigrams = b
no of mismatched bigrams = k ndash 1 ndash bno of matched chunks = k ndash 1 ndash b + 1
= k ndash b
These (k-b) chunks can be reordered in (k ndash b) ways
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Typological evaluation
bull Use a test suite of examplesbull Ensure all relevant phenomena to be tested
are covered
bull Specific-to-language phenomenaFor example हर आख वाला लड़का म कराया
hari aankhon-wala ladka muskurayagreen eyes-with boy smiledThe boy with green eyes smiled
Declarative evaluation
bull Assign scores to specific qualities of outputndash Intelligibility How good the output is as a well-
formed target language entityndash Accuracy How good the output is in terms of
preserving content of the source text
For example I am attending a lectureम एक या यान बठा ह Main ek vyaakhyan baitha hoonI a lecture sit (Present-first person)I sit a lecture Accurate but not intelligibleम या यान ह Main vyakhyan hoonI lecture amI am lecture Intelligible but not accurate
Evaluation bottleneck
bull Typological evaluation is time-consumingbull Operational evaluation needs accurate
modeling of cost-benefit
bull Automatic MT evaluation DeclarativeBLEU Bilingual Evaluation Understudy
Deriving BLEU Deriving BLEU [2][2]
Incorporating PrecisionIncorporating Recall
How is translation performance measured
The closer a machine translation is to a professional human translation the better it is
bull A corpus of good quality human reference translations
bull A numerical ldquotranslation closenessrdquo metric
Preliminaries
bull Candidate Translation(s) Translation returned by an MT system
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
Goal of BLEU To correlate with human judgment
Formulating BLEU (Step 1) PrecisionI had lunch now
Candidate 1 मन अब खाना खाया
maine ab khana khaya matching unigrams 3I now food ate matching bigrams 1I ate food now
Candidate 2 मन अभी लच एटmaine abhi lunch ate matching unigrams 2I now lunch ate
I ate lunch(OOV) now(OOV) matching bigrams 1Unigram precision Candidate 1 34 = 075 Candidate 2 24 = 05Similarly bigram precision Candidate 1 033 Candidate 2 = 033
Reference 1 मन अभी खाना खायाmaine abhi khana khayaI now food ateI ate food now
Reference 2 मन अभी भोजन कयाmaine abhi bhojan kiyaa
I now meal didI did meal now
Precision Not good enough
Reference मझपर तरा स र छायाmujh-par tera suroor chhaaya
me-on your spell castYour spell was cast on me
Candidate 1 मर तरा स र छाया matching unigram 3
mere tera suroor chhaayamy your spell castYour spell cast my
Candidate 2 तरा तरा तरा स र matching unigrams 4tera tera tera surooryour your your spell
Unigram precision Candidate 1 34 = 075 Candidate 2 44 = 1
Formulating BLEU (Step 2) Modified Precision
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujh-par tera suroor chhaaya
me-on your spell castYour spell was cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spell
bull matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Modified n-gram precision
For entire test corpus for a given n
n-gram Matching n-grams in C
n-gramrsquo All n-grams in CModified precision for n-grams Overall candidates of
test corpus
Formula from [2]
Calculating modified n-gram precision (12)
bull 127 source sentences were translated by two human translators and three MT systems
bull Translated sentences evaluated against professional reference translations using modified n-gram precision
Calculating modified n-gram precision (22)
bull Decaying precision with increasing nbull Comparative ranking of the five
Combining precision for different values of n-grams
Graph from [2]
Formulation of BLEU Recap
bull Precision cannot be used as isbull Modified precision considers lsquoclipped word
countrsquo
Recall for MT (12)
bull Candidates shorter than referencesbull Reference या ल लब वा य क गणव ता को समझ
पाएगाkya blue lambe vaakya ki guNvatta ko samajh paaegawill blue long sentence-of quality (case-marker) understand able(III-person-male-singular)Will blue be able to understand quality of long sentence
Candidate लब वा य
lambe vaakyalong sentencelong sentence
modified unigram precision 22 = 1modified bigram precision 11 = 1
Recall for MT (22)
bull Candidates longer than referencesReference 2 मन भोजन कया
maine bhojan kiyaaI meal didI had meal
Candidate 1 मन खाना भोजन कया
maine khaana bhojan kiyaI food meal didI had food meal
Modified unigram precision 1
Reference 1 मन खाना खायाmaine khaana khaayaI food ateI ate food
Candidate 2 मन खाना खायाmaine khaana khaayaI food ateI ate food
Modified unigram precision 1
Formulating BLEU (Step 3) Incorporating recall
bull Sentence length indicates lsquobest matchrsquobull Brevity penalty (BP)
ndash Multiplicative factorndash Candidate translations that match reference
translations in length must be ranked higher
Candidate 1 लब वा य
Candidate 2 या ल लब वा य क गणव तासमझ पाएगा
Formulating BLEU (Step 3) Brevity Penalty
e^(1-x)
Graph drawn using wwwfooplotcom
BP
BP = 1 for c gt r
Why
x = ( r c )Formula from [2]
r Reference sentence lengthc Candidate sentence length
BP leaves out longer translations
WhyTranslations longer than reference are already
penalized by modified precision
Validating the claim
Formula from [2]
BLEU scorePrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Understanding BLEUUnderstanding BLEUDissecting the formulaUsing it news headline example
Decay in precision
Why log pnTo accommodate decay in precision values
Graph from [2]Formula from [2]
Dissecting the Formula
Claim BLEU should lie between 0 and 1Reason To intuitively satisfy ldquo1 implies perfect
translationrdquoUnderstanding constituents of the formula to
validate the claim
Brevity Penalty
Modified precision
Set to 1NFormula from [2]
Validation of range of BLEU
BP
pn Between 0 and 1log pn Between ndash(infinity) and 0 A Between ndash(infinity) and 0 e ^ (A) Between 0 and 1
A
BP Between 0 and 1
Graph drawn using wwwfooplotcom
(rc)
pn
log pn
A
e^(A)
Calculating BLEU An actual example
Ref भावना मक एक करण लान क लए एक योहारbhaavanaatmak ekikaran laane ke liye ek tyohaaremotional unity bring-to a festivalA festival to bring emotional unity
C1 लान क लए एक योहार भावना मक एक करणlaane-ke-liye ek tyoaar bhaavanaatmak ekikaranbring-to a festival emotional unity (This is invalid Hindi ordering)
C2 को एक उ सव क बार म लाना भावना मक एक करणko ek utsav ke baare mein laana bhaavanaatmak ekikaran
for a festival-about to-bring emotional unity (This is invalid Hindi ordering)
Modified n-gram precision
Ref भावना मक एक करण लान क लए एक योहारC1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 7 7 1
2 5 6 083
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 4 9 044
2 1 8 0125
r 7c 7
r 7c 9
Calculating BLEU scoren wn pn log pn wn
log pn
1 1 2 1 0 0
2 1 2 083 -0186 -0093
Total -0093
BLEU 0911
wn= 1 N = 1 2r = 7
C1c = 7BP = exp(1 ndash 77) = 1
C2c = 9BP = 1
n wn pn log pn wn log pn
1 1 2 044 -0821 04105
2 1 2 0125 -207 -1035
Total -1445
BLEU 0235
C1
C2
Hence the BLEU scores
C1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
0911
0235
Ref भावना मक एक करण लान क लए एक योहार
BLEU vs human BLEU vs human judgment judgment [2][2]
Target language EnglishSource language Chinese
Setup
BLEU scores obtained for each system
Five systems perform translation3 automatic MT systems2 human translators
Human judgment (on scale of 5) obtained for each system
bull Group 1 Ten Monolingual speakers of target language (English)
bull Group 2 Ten Bilingual speakers of Chinese and English
BLEU vs human judgment
bull Monolingual speakersCorrelation co-efficient 099
bull Bilingual speakersCorrelation co-efficient 096
BLEU and human evaluation for S2 and S3 Graph from [2]
Comparison of normalized values
bull High correlation between monolingual group and BLEU score
bull Bilingual group were lenient on lsquofluencyrsquo for H1
bull Demarcation between S1-S3 and H1-H2 is captured by BLEU
Graph from [2]
Conclusion
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Part II Shortcomings of BLEU(continued from 7th March 2013)
Presented by(As a part of CS 712)
Aditya JoshiKashyap PopatShubham Gautam
IIT Bombay
GuideProf Pushpak Bhattacharyya
IIT Bombay
BLEU score revisitedPrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Outline
bull Evaluationbull Formulating BLEU Metricbull Understanding BLEU formulabull Shortcomings of BLEU in generalbull Shortcomings in context of English-Hindi MT
Part I
Part II
Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
[A domesticated bird wants to learn to fly]
I can do this I just have to work out the physics I have quadrated by vector angles I have adjusted for wind shear
This is it Lets fly Just keep it simple Thrust lift drag and wait Thrust lift drag wait[speeds up to get to the of the table]Thrust lift drag waiwaiwait [gets scared and tries to stop himself
but instead falls from the table]
Lines from Rio (2011)
- Blu the Macaw
ldquo
Image Source wwwfanpopcom
Use of BLEU metric [3]
bull Evaluating incremental system changesndash Neglecting actual examples
bull Systems ranked on basis of BLEU
bull Does minimizing error rate with respect to BLEU indeed guarantee translation improvements
Criticisms of BLEU [4]
bull Admits too much variation [3]
bull Admits too little variation [4] [3]
bull Poor correlation with human judgment [3]
Admits too much variation [3]
Permuting phrasesDrawing different items from reference set
Admits too much variation
bull BLEU relies on n-gram matching only
bull Puts very few constraints on how n-gram matches can be drawn from multiple reference translations
Brevity Penalty(Incorporating recall)
Modified precision
Set to 1N
Modified n-gram precision in BLEU [2]
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujhpar teraa suroor chhaayame-on your spell has-been-castYour spell has been cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spellYour your your spell
matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Permuting phrases [3]
bull Reordering of unmatched phrases does not affect precision
bull Bigram mismatch sites can be freely permuted
Possible to randomly produce other hypothesis translations that have the same BLEU score
B1 B2 B3 B4
B4 B2 B1 B3Bigram Bigrammismatch
Issues with precision (12)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate राजा और रानी शकार करन जगल गय म raaja aur raani shikaar karne gaye jungle meinKing and queen to-do-hunting went jungle to (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Issues with precision (22)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate शकार करन जगल राजा और रानी म गयshikaar karne jungle raaja aur raani mein gayeto-do hunting jungle raja and rani in went (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Permuting phrases in generalbull For lsquobrsquo bi-gram matches in a candidate translation of length
lsquokrsquo(k ndash b) possible ways to generate similarly score items using only the words
in this translation
In our example (8-4) = 24 candidate translations
In sentence of length ktotal bigrams = k ndash 1
matched bigrams = b
no of mismatched bigrams = k ndash 1 ndash bno of matched chunks = k ndash 1 ndash b + 1
= k ndash b
These (k-b) chunks can be reordered in (k ndash b) ways
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Declarative evaluation
bull Assign scores to specific qualities of outputndash Intelligibility How good the output is as a well-
formed target language entityndash Accuracy How good the output is in terms of
preserving content of the source text
For example I am attending a lectureम एक या यान बठा ह Main ek vyaakhyan baitha hoonI a lecture sit (Present-first person)I sit a lecture Accurate but not intelligibleम या यान ह Main vyakhyan hoonI lecture amI am lecture Intelligible but not accurate
Evaluation bottleneck
bull Typological evaluation is time-consumingbull Operational evaluation needs accurate
modeling of cost-benefit
bull Automatic MT evaluation DeclarativeBLEU Bilingual Evaluation Understudy
Deriving BLEU Deriving BLEU [2][2]
Incorporating PrecisionIncorporating Recall
How is translation performance measured
The closer a machine translation is to a professional human translation the better it is
bull A corpus of good quality human reference translations
bull A numerical ldquotranslation closenessrdquo metric
Preliminaries
bull Candidate Translation(s) Translation returned by an MT system
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
Goal of BLEU To correlate with human judgment
Formulating BLEU (Step 1) PrecisionI had lunch now
Candidate 1 मन अब खाना खाया
maine ab khana khaya matching unigrams 3I now food ate matching bigrams 1I ate food now
Candidate 2 मन अभी लच एटmaine abhi lunch ate matching unigrams 2I now lunch ate
I ate lunch(OOV) now(OOV) matching bigrams 1Unigram precision Candidate 1 34 = 075 Candidate 2 24 = 05Similarly bigram precision Candidate 1 033 Candidate 2 = 033
Reference 1 मन अभी खाना खायाmaine abhi khana khayaI now food ateI ate food now
Reference 2 मन अभी भोजन कयाmaine abhi bhojan kiyaa
I now meal didI did meal now
Precision Not good enough
Reference मझपर तरा स र छायाmujh-par tera suroor chhaaya
me-on your spell castYour spell was cast on me
Candidate 1 मर तरा स र छाया matching unigram 3
mere tera suroor chhaayamy your spell castYour spell cast my
Candidate 2 तरा तरा तरा स र matching unigrams 4tera tera tera surooryour your your spell
Unigram precision Candidate 1 34 = 075 Candidate 2 44 = 1
Formulating BLEU (Step 2) Modified Precision
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujh-par tera suroor chhaaya
me-on your spell castYour spell was cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spell
bull matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Modified n-gram precision
For entire test corpus for a given n
n-gram Matching n-grams in C
n-gramrsquo All n-grams in CModified precision for n-grams Overall candidates of
test corpus
Formula from [2]
Calculating modified n-gram precision (12)
bull 127 source sentences were translated by two human translators and three MT systems
bull Translated sentences evaluated against professional reference translations using modified n-gram precision
Calculating modified n-gram precision (22)
bull Decaying precision with increasing nbull Comparative ranking of the five
Combining precision for different values of n-grams
Graph from [2]
Formulation of BLEU Recap
bull Precision cannot be used as isbull Modified precision considers lsquoclipped word
countrsquo
Recall for MT (12)
bull Candidates shorter than referencesbull Reference या ल लब वा य क गणव ता को समझ
पाएगाkya blue lambe vaakya ki guNvatta ko samajh paaegawill blue long sentence-of quality (case-marker) understand able(III-person-male-singular)Will blue be able to understand quality of long sentence
Candidate लब वा य
lambe vaakyalong sentencelong sentence
modified unigram precision 22 = 1modified bigram precision 11 = 1
Recall for MT (22)
bull Candidates longer than referencesReference 2 मन भोजन कया
maine bhojan kiyaaI meal didI had meal
Candidate 1 मन खाना भोजन कया
maine khaana bhojan kiyaI food meal didI had food meal
Modified unigram precision 1
Reference 1 मन खाना खायाmaine khaana khaayaI food ateI ate food
Candidate 2 मन खाना खायाmaine khaana khaayaI food ateI ate food
Modified unigram precision 1
Formulating BLEU (Step 3) Incorporating recall
bull Sentence length indicates lsquobest matchrsquobull Brevity penalty (BP)
ndash Multiplicative factorndash Candidate translations that match reference
translations in length must be ranked higher
Candidate 1 लब वा य
Candidate 2 या ल लब वा य क गणव तासमझ पाएगा
Formulating BLEU (Step 3) Brevity Penalty
e^(1-x)
Graph drawn using wwwfooplotcom
BP
BP = 1 for c gt r
Why
x = ( r c )Formula from [2]
r Reference sentence lengthc Candidate sentence length
BP leaves out longer translations
WhyTranslations longer than reference are already
penalized by modified precision
Validating the claim
Formula from [2]
BLEU scorePrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Understanding BLEUUnderstanding BLEUDissecting the formulaUsing it news headline example
Decay in precision
Why log pnTo accommodate decay in precision values
Graph from [2]Formula from [2]
Dissecting the Formula
Claim BLEU should lie between 0 and 1Reason To intuitively satisfy ldquo1 implies perfect
translationrdquoUnderstanding constituents of the formula to
validate the claim
Brevity Penalty
Modified precision
Set to 1NFormula from [2]
Validation of range of BLEU
BP
pn Between 0 and 1log pn Between ndash(infinity) and 0 A Between ndash(infinity) and 0 e ^ (A) Between 0 and 1
A
BP Between 0 and 1
Graph drawn using wwwfooplotcom
(rc)
pn
log pn
A
e^(A)
Calculating BLEU An actual example
Ref भावना मक एक करण लान क लए एक योहारbhaavanaatmak ekikaran laane ke liye ek tyohaaremotional unity bring-to a festivalA festival to bring emotional unity
C1 लान क लए एक योहार भावना मक एक करणlaane-ke-liye ek tyoaar bhaavanaatmak ekikaranbring-to a festival emotional unity (This is invalid Hindi ordering)
C2 को एक उ सव क बार म लाना भावना मक एक करणko ek utsav ke baare mein laana bhaavanaatmak ekikaran
for a festival-about to-bring emotional unity (This is invalid Hindi ordering)
Modified n-gram precision
Ref भावना मक एक करण लान क लए एक योहारC1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 7 7 1
2 5 6 083
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 4 9 044
2 1 8 0125
r 7c 7
r 7c 9
Calculating BLEU scoren wn pn log pn wn
log pn
1 1 2 1 0 0
2 1 2 083 -0186 -0093
Total -0093
BLEU 0911
wn= 1 N = 1 2r = 7
C1c = 7BP = exp(1 ndash 77) = 1
C2c = 9BP = 1
n wn pn log pn wn log pn
1 1 2 044 -0821 04105
2 1 2 0125 -207 -1035
Total -1445
BLEU 0235
C1
C2
Hence the BLEU scores
C1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
0911
0235
Ref भावना मक एक करण लान क लए एक योहार
BLEU vs human BLEU vs human judgment judgment [2][2]
Target language EnglishSource language Chinese
Setup
BLEU scores obtained for each system
Five systems perform translation3 automatic MT systems2 human translators
Human judgment (on scale of 5) obtained for each system
bull Group 1 Ten Monolingual speakers of target language (English)
bull Group 2 Ten Bilingual speakers of Chinese and English
BLEU vs human judgment
bull Monolingual speakersCorrelation co-efficient 099
bull Bilingual speakersCorrelation co-efficient 096
BLEU and human evaluation for S2 and S3 Graph from [2]
Comparison of normalized values
bull High correlation between monolingual group and BLEU score
bull Bilingual group were lenient on lsquofluencyrsquo for H1
bull Demarcation between S1-S3 and H1-H2 is captured by BLEU
Graph from [2]
Conclusion
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Part II Shortcomings of BLEU(continued from 7th March 2013)
Presented by(As a part of CS 712)
Aditya JoshiKashyap PopatShubham Gautam
IIT Bombay
GuideProf Pushpak Bhattacharyya
IIT Bombay
BLEU score revisitedPrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Outline
bull Evaluationbull Formulating BLEU Metricbull Understanding BLEU formulabull Shortcomings of BLEU in generalbull Shortcomings in context of English-Hindi MT
Part I
Part II
Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
[A domesticated bird wants to learn to fly]
I can do this I just have to work out the physics I have quadrated by vector angles I have adjusted for wind shear
This is it Lets fly Just keep it simple Thrust lift drag and wait Thrust lift drag wait[speeds up to get to the of the table]Thrust lift drag waiwaiwait [gets scared and tries to stop himself
but instead falls from the table]
Lines from Rio (2011)
- Blu the Macaw
ldquo
Image Source wwwfanpopcom
Use of BLEU metric [3]
bull Evaluating incremental system changesndash Neglecting actual examples
bull Systems ranked on basis of BLEU
bull Does minimizing error rate with respect to BLEU indeed guarantee translation improvements
Criticisms of BLEU [4]
bull Admits too much variation [3]
bull Admits too little variation [4] [3]
bull Poor correlation with human judgment [3]
Admits too much variation [3]
Permuting phrasesDrawing different items from reference set
Admits too much variation
bull BLEU relies on n-gram matching only
bull Puts very few constraints on how n-gram matches can be drawn from multiple reference translations
Brevity Penalty(Incorporating recall)
Modified precision
Set to 1N
Modified n-gram precision in BLEU [2]
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujhpar teraa suroor chhaayame-on your spell has-been-castYour spell has been cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spellYour your your spell
matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Permuting phrases [3]
bull Reordering of unmatched phrases does not affect precision
bull Bigram mismatch sites can be freely permuted
Possible to randomly produce other hypothesis translations that have the same BLEU score
B1 B2 B3 B4
B4 B2 B1 B3Bigram Bigrammismatch
Issues with precision (12)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate राजा और रानी शकार करन जगल गय म raaja aur raani shikaar karne gaye jungle meinKing and queen to-do-hunting went jungle to (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Issues with precision (22)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate शकार करन जगल राजा और रानी म गयshikaar karne jungle raaja aur raani mein gayeto-do hunting jungle raja and rani in went (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Permuting phrases in generalbull For lsquobrsquo bi-gram matches in a candidate translation of length
lsquokrsquo(k ndash b) possible ways to generate similarly score items using only the words
in this translation
In our example (8-4) = 24 candidate translations
In sentence of length ktotal bigrams = k ndash 1
matched bigrams = b
no of mismatched bigrams = k ndash 1 ndash bno of matched chunks = k ndash 1 ndash b + 1
= k ndash b
These (k-b) chunks can be reordered in (k ndash b) ways
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Evaluation bottleneck
bull Typological evaluation is time-consumingbull Operational evaluation needs accurate
modeling of cost-benefit
bull Automatic MT evaluation DeclarativeBLEU Bilingual Evaluation Understudy
Deriving BLEU Deriving BLEU [2][2]
Incorporating PrecisionIncorporating Recall
How is translation performance measured
The closer a machine translation is to a professional human translation the better it is
bull A corpus of good quality human reference translations
bull A numerical ldquotranslation closenessrdquo metric
Preliminaries
bull Candidate Translation(s) Translation returned by an MT system
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
Goal of BLEU To correlate with human judgment
Formulating BLEU (Step 1) PrecisionI had lunch now
Candidate 1 मन अब खाना खाया
maine ab khana khaya matching unigrams 3I now food ate matching bigrams 1I ate food now
Candidate 2 मन अभी लच एटmaine abhi lunch ate matching unigrams 2I now lunch ate
I ate lunch(OOV) now(OOV) matching bigrams 1Unigram precision Candidate 1 34 = 075 Candidate 2 24 = 05Similarly bigram precision Candidate 1 033 Candidate 2 = 033
Reference 1 मन अभी खाना खायाmaine abhi khana khayaI now food ateI ate food now
Reference 2 मन अभी भोजन कयाmaine abhi bhojan kiyaa
I now meal didI did meal now
Precision Not good enough
Reference मझपर तरा स र छायाmujh-par tera suroor chhaaya
me-on your spell castYour spell was cast on me
Candidate 1 मर तरा स र छाया matching unigram 3
mere tera suroor chhaayamy your spell castYour spell cast my
Candidate 2 तरा तरा तरा स र matching unigrams 4tera tera tera surooryour your your spell
Unigram precision Candidate 1 34 = 075 Candidate 2 44 = 1
Formulating BLEU (Step 2) Modified Precision
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujh-par tera suroor chhaaya
me-on your spell castYour spell was cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spell
bull matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Modified n-gram precision
For entire test corpus for a given n
n-gram Matching n-grams in C
n-gramrsquo All n-grams in CModified precision for n-grams Overall candidates of
test corpus
Formula from [2]
Calculating modified n-gram precision (12)
bull 127 source sentences were translated by two human translators and three MT systems
bull Translated sentences evaluated against professional reference translations using modified n-gram precision
Calculating modified n-gram precision (22)
bull Decaying precision with increasing nbull Comparative ranking of the five
Combining precision for different values of n-grams
Graph from [2]
Formulation of BLEU Recap
bull Precision cannot be used as isbull Modified precision considers lsquoclipped word
countrsquo
Recall for MT (12)
bull Candidates shorter than referencesbull Reference या ल लब वा य क गणव ता को समझ
पाएगाkya blue lambe vaakya ki guNvatta ko samajh paaegawill blue long sentence-of quality (case-marker) understand able(III-person-male-singular)Will blue be able to understand quality of long sentence
Candidate लब वा य
lambe vaakyalong sentencelong sentence
modified unigram precision 22 = 1modified bigram precision 11 = 1
Recall for MT (22)
bull Candidates longer than referencesReference 2 मन भोजन कया
maine bhojan kiyaaI meal didI had meal
Candidate 1 मन खाना भोजन कया
maine khaana bhojan kiyaI food meal didI had food meal
Modified unigram precision 1
Reference 1 मन खाना खायाmaine khaana khaayaI food ateI ate food
Candidate 2 मन खाना खायाmaine khaana khaayaI food ateI ate food
Modified unigram precision 1
Formulating BLEU (Step 3) Incorporating recall
bull Sentence length indicates lsquobest matchrsquobull Brevity penalty (BP)
ndash Multiplicative factorndash Candidate translations that match reference
translations in length must be ranked higher
Candidate 1 लब वा य
Candidate 2 या ल लब वा य क गणव तासमझ पाएगा
Formulating BLEU (Step 3) Brevity Penalty
e^(1-x)
Graph drawn using wwwfooplotcom
BP
BP = 1 for c gt r
Why
x = ( r c )Formula from [2]
r Reference sentence lengthc Candidate sentence length
BP leaves out longer translations
WhyTranslations longer than reference are already
penalized by modified precision
Validating the claim
Formula from [2]
BLEU scorePrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Understanding BLEUUnderstanding BLEUDissecting the formulaUsing it news headline example
Decay in precision
Why log pnTo accommodate decay in precision values
Graph from [2]Formula from [2]
Dissecting the Formula
Claim BLEU should lie between 0 and 1Reason To intuitively satisfy ldquo1 implies perfect
translationrdquoUnderstanding constituents of the formula to
validate the claim
Brevity Penalty
Modified precision
Set to 1NFormula from [2]
Validation of range of BLEU
BP
pn Between 0 and 1log pn Between ndash(infinity) and 0 A Between ndash(infinity) and 0 e ^ (A) Between 0 and 1
A
BP Between 0 and 1
Graph drawn using wwwfooplotcom
(rc)
pn
log pn
A
e^(A)
Calculating BLEU An actual example
Ref भावना मक एक करण लान क लए एक योहारbhaavanaatmak ekikaran laane ke liye ek tyohaaremotional unity bring-to a festivalA festival to bring emotional unity
C1 लान क लए एक योहार भावना मक एक करणlaane-ke-liye ek tyoaar bhaavanaatmak ekikaranbring-to a festival emotional unity (This is invalid Hindi ordering)
C2 को एक उ सव क बार म लाना भावना मक एक करणko ek utsav ke baare mein laana bhaavanaatmak ekikaran
for a festival-about to-bring emotional unity (This is invalid Hindi ordering)
Modified n-gram precision
Ref भावना मक एक करण लान क लए एक योहारC1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 7 7 1
2 5 6 083
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 4 9 044
2 1 8 0125
r 7c 7
r 7c 9
Calculating BLEU scoren wn pn log pn wn
log pn
1 1 2 1 0 0
2 1 2 083 -0186 -0093
Total -0093
BLEU 0911
wn= 1 N = 1 2r = 7
C1c = 7BP = exp(1 ndash 77) = 1
C2c = 9BP = 1
n wn pn log pn wn log pn
1 1 2 044 -0821 04105
2 1 2 0125 -207 -1035
Total -1445
BLEU 0235
C1
C2
Hence the BLEU scores
C1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
0911
0235
Ref भावना मक एक करण लान क लए एक योहार
BLEU vs human BLEU vs human judgment judgment [2][2]
Target language EnglishSource language Chinese
Setup
BLEU scores obtained for each system
Five systems perform translation3 automatic MT systems2 human translators
Human judgment (on scale of 5) obtained for each system
bull Group 1 Ten Monolingual speakers of target language (English)
bull Group 2 Ten Bilingual speakers of Chinese and English
BLEU vs human judgment
bull Monolingual speakersCorrelation co-efficient 099
bull Bilingual speakersCorrelation co-efficient 096
BLEU and human evaluation for S2 and S3 Graph from [2]
Comparison of normalized values
bull High correlation between monolingual group and BLEU score
bull Bilingual group were lenient on lsquofluencyrsquo for H1
bull Demarcation between S1-S3 and H1-H2 is captured by BLEU
Graph from [2]
Conclusion
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Part II Shortcomings of BLEU(continued from 7th March 2013)
Presented by(As a part of CS 712)
Aditya JoshiKashyap PopatShubham Gautam
IIT Bombay
GuideProf Pushpak Bhattacharyya
IIT Bombay
BLEU score revisitedPrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Outline
bull Evaluationbull Formulating BLEU Metricbull Understanding BLEU formulabull Shortcomings of BLEU in generalbull Shortcomings in context of English-Hindi MT
Part I
Part II
Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
[A domesticated bird wants to learn to fly]
I can do this I just have to work out the physics I have quadrated by vector angles I have adjusted for wind shear
This is it Lets fly Just keep it simple Thrust lift drag and wait Thrust lift drag wait[speeds up to get to the of the table]Thrust lift drag waiwaiwait [gets scared and tries to stop himself
but instead falls from the table]
Lines from Rio (2011)
- Blu the Macaw
ldquo
Image Source wwwfanpopcom
Use of BLEU metric [3]
bull Evaluating incremental system changesndash Neglecting actual examples
bull Systems ranked on basis of BLEU
bull Does minimizing error rate with respect to BLEU indeed guarantee translation improvements
Criticisms of BLEU [4]
bull Admits too much variation [3]
bull Admits too little variation [4] [3]
bull Poor correlation with human judgment [3]
Admits too much variation [3]
Permuting phrasesDrawing different items from reference set
Admits too much variation
bull BLEU relies on n-gram matching only
bull Puts very few constraints on how n-gram matches can be drawn from multiple reference translations
Brevity Penalty(Incorporating recall)
Modified precision
Set to 1N
Modified n-gram precision in BLEU [2]
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujhpar teraa suroor chhaayame-on your spell has-been-castYour spell has been cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spellYour your your spell
matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Permuting phrases [3]
bull Reordering of unmatched phrases does not affect precision
bull Bigram mismatch sites can be freely permuted
Possible to randomly produce other hypothesis translations that have the same BLEU score
B1 B2 B3 B4
B4 B2 B1 B3Bigram Bigrammismatch
Issues with precision (12)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate राजा और रानी शकार करन जगल गय म raaja aur raani shikaar karne gaye jungle meinKing and queen to-do-hunting went jungle to (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Issues with precision (22)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate शकार करन जगल राजा और रानी म गयshikaar karne jungle raaja aur raani mein gayeto-do hunting jungle raja and rani in went (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Permuting phrases in generalbull For lsquobrsquo bi-gram matches in a candidate translation of length
lsquokrsquo(k ndash b) possible ways to generate similarly score items using only the words
in this translation
In our example (8-4) = 24 candidate translations
In sentence of length ktotal bigrams = k ndash 1
matched bigrams = b
no of mismatched bigrams = k ndash 1 ndash bno of matched chunks = k ndash 1 ndash b + 1
= k ndash b
These (k-b) chunks can be reordered in (k ndash b) ways
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Deriving BLEU Deriving BLEU [2][2]
Incorporating PrecisionIncorporating Recall
How is translation performance measured
The closer a machine translation is to a professional human translation the better it is
bull A corpus of good quality human reference translations
bull A numerical ldquotranslation closenessrdquo metric
Preliminaries
bull Candidate Translation(s) Translation returned by an MT system
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
Goal of BLEU To correlate with human judgment
Formulating BLEU (Step 1) PrecisionI had lunch now
Candidate 1 मन अब खाना खाया
maine ab khana khaya matching unigrams 3I now food ate matching bigrams 1I ate food now
Candidate 2 मन अभी लच एटmaine abhi lunch ate matching unigrams 2I now lunch ate
I ate lunch(OOV) now(OOV) matching bigrams 1Unigram precision Candidate 1 34 = 075 Candidate 2 24 = 05Similarly bigram precision Candidate 1 033 Candidate 2 = 033
Reference 1 मन अभी खाना खायाmaine abhi khana khayaI now food ateI ate food now
Reference 2 मन अभी भोजन कयाmaine abhi bhojan kiyaa
I now meal didI did meal now
Precision Not good enough
Reference मझपर तरा स र छायाmujh-par tera suroor chhaaya
me-on your spell castYour spell was cast on me
Candidate 1 मर तरा स र छाया matching unigram 3
mere tera suroor chhaayamy your spell castYour spell cast my
Candidate 2 तरा तरा तरा स र matching unigrams 4tera tera tera surooryour your your spell
Unigram precision Candidate 1 34 = 075 Candidate 2 44 = 1
Formulating BLEU (Step 2) Modified Precision
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujh-par tera suroor chhaaya
me-on your spell castYour spell was cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spell
bull matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Modified n-gram precision
For entire test corpus for a given n
n-gram Matching n-grams in C
n-gramrsquo All n-grams in CModified precision for n-grams Overall candidates of
test corpus
Formula from [2]
Calculating modified n-gram precision (12)
bull 127 source sentences were translated by two human translators and three MT systems
bull Translated sentences evaluated against professional reference translations using modified n-gram precision
Calculating modified n-gram precision (22)
bull Decaying precision with increasing nbull Comparative ranking of the five
Combining precision for different values of n-grams
Graph from [2]
Formulation of BLEU Recap
bull Precision cannot be used as isbull Modified precision considers lsquoclipped word
countrsquo
Recall for MT (12)
bull Candidates shorter than referencesbull Reference या ल लब वा य क गणव ता को समझ
पाएगाkya blue lambe vaakya ki guNvatta ko samajh paaegawill blue long sentence-of quality (case-marker) understand able(III-person-male-singular)Will blue be able to understand quality of long sentence
Candidate लब वा य
lambe vaakyalong sentencelong sentence
modified unigram precision 22 = 1modified bigram precision 11 = 1
Recall for MT (22)
bull Candidates longer than referencesReference 2 मन भोजन कया
maine bhojan kiyaaI meal didI had meal
Candidate 1 मन खाना भोजन कया
maine khaana bhojan kiyaI food meal didI had food meal
Modified unigram precision 1
Reference 1 मन खाना खायाmaine khaana khaayaI food ateI ate food
Candidate 2 मन खाना खायाmaine khaana khaayaI food ateI ate food
Modified unigram precision 1
Formulating BLEU (Step 3) Incorporating recall
bull Sentence length indicates lsquobest matchrsquobull Brevity penalty (BP)
ndash Multiplicative factorndash Candidate translations that match reference
translations in length must be ranked higher
Candidate 1 लब वा य
Candidate 2 या ल लब वा य क गणव तासमझ पाएगा
Formulating BLEU (Step 3) Brevity Penalty
e^(1-x)
Graph drawn using wwwfooplotcom
BP
BP = 1 for c gt r
Why
x = ( r c )Formula from [2]
r Reference sentence lengthc Candidate sentence length
BP leaves out longer translations
WhyTranslations longer than reference are already
penalized by modified precision
Validating the claim
Formula from [2]
BLEU scorePrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Understanding BLEUUnderstanding BLEUDissecting the formulaUsing it news headline example
Decay in precision
Why log pnTo accommodate decay in precision values
Graph from [2]Formula from [2]
Dissecting the Formula
Claim BLEU should lie between 0 and 1Reason To intuitively satisfy ldquo1 implies perfect
translationrdquoUnderstanding constituents of the formula to
validate the claim
Brevity Penalty
Modified precision
Set to 1NFormula from [2]
Validation of range of BLEU
BP
pn Between 0 and 1log pn Between ndash(infinity) and 0 A Between ndash(infinity) and 0 e ^ (A) Between 0 and 1
A
BP Between 0 and 1
Graph drawn using wwwfooplotcom
(rc)
pn
log pn
A
e^(A)
Calculating BLEU An actual example
Ref भावना मक एक करण लान क लए एक योहारbhaavanaatmak ekikaran laane ke liye ek tyohaaremotional unity bring-to a festivalA festival to bring emotional unity
C1 लान क लए एक योहार भावना मक एक करणlaane-ke-liye ek tyoaar bhaavanaatmak ekikaranbring-to a festival emotional unity (This is invalid Hindi ordering)
C2 को एक उ सव क बार म लाना भावना मक एक करणko ek utsav ke baare mein laana bhaavanaatmak ekikaran
for a festival-about to-bring emotional unity (This is invalid Hindi ordering)
Modified n-gram precision
Ref भावना मक एक करण लान क लए एक योहारC1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 7 7 1
2 5 6 083
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 4 9 044
2 1 8 0125
r 7c 7
r 7c 9
Calculating BLEU scoren wn pn log pn wn
log pn
1 1 2 1 0 0
2 1 2 083 -0186 -0093
Total -0093
BLEU 0911
wn= 1 N = 1 2r = 7
C1c = 7BP = exp(1 ndash 77) = 1
C2c = 9BP = 1
n wn pn log pn wn log pn
1 1 2 044 -0821 04105
2 1 2 0125 -207 -1035
Total -1445
BLEU 0235
C1
C2
Hence the BLEU scores
C1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
0911
0235
Ref भावना मक एक करण लान क लए एक योहार
BLEU vs human BLEU vs human judgment judgment [2][2]
Target language EnglishSource language Chinese
Setup
BLEU scores obtained for each system
Five systems perform translation3 automatic MT systems2 human translators
Human judgment (on scale of 5) obtained for each system
bull Group 1 Ten Monolingual speakers of target language (English)
bull Group 2 Ten Bilingual speakers of Chinese and English
BLEU vs human judgment
bull Monolingual speakersCorrelation co-efficient 099
bull Bilingual speakersCorrelation co-efficient 096
BLEU and human evaluation for S2 and S3 Graph from [2]
Comparison of normalized values
bull High correlation between monolingual group and BLEU score
bull Bilingual group were lenient on lsquofluencyrsquo for H1
bull Demarcation between S1-S3 and H1-H2 is captured by BLEU
Graph from [2]
Conclusion
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Part II Shortcomings of BLEU(continued from 7th March 2013)
Presented by(As a part of CS 712)
Aditya JoshiKashyap PopatShubham Gautam
IIT Bombay
GuideProf Pushpak Bhattacharyya
IIT Bombay
BLEU score revisitedPrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Outline
bull Evaluationbull Formulating BLEU Metricbull Understanding BLEU formulabull Shortcomings of BLEU in generalbull Shortcomings in context of English-Hindi MT
Part I
Part II
Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
[A domesticated bird wants to learn to fly]
I can do this I just have to work out the physics I have quadrated by vector angles I have adjusted for wind shear
This is it Lets fly Just keep it simple Thrust lift drag and wait Thrust lift drag wait[speeds up to get to the of the table]Thrust lift drag waiwaiwait [gets scared and tries to stop himself
but instead falls from the table]
Lines from Rio (2011)
- Blu the Macaw
ldquo
Image Source wwwfanpopcom
Use of BLEU metric [3]
bull Evaluating incremental system changesndash Neglecting actual examples
bull Systems ranked on basis of BLEU
bull Does minimizing error rate with respect to BLEU indeed guarantee translation improvements
Criticisms of BLEU [4]
bull Admits too much variation [3]
bull Admits too little variation [4] [3]
bull Poor correlation with human judgment [3]
Admits too much variation [3]
Permuting phrasesDrawing different items from reference set
Admits too much variation
bull BLEU relies on n-gram matching only
bull Puts very few constraints on how n-gram matches can be drawn from multiple reference translations
Brevity Penalty(Incorporating recall)
Modified precision
Set to 1N
Modified n-gram precision in BLEU [2]
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujhpar teraa suroor chhaayame-on your spell has-been-castYour spell has been cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spellYour your your spell
matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Permuting phrases [3]
bull Reordering of unmatched phrases does not affect precision
bull Bigram mismatch sites can be freely permuted
Possible to randomly produce other hypothesis translations that have the same BLEU score
B1 B2 B3 B4
B4 B2 B1 B3Bigram Bigrammismatch
Issues with precision (12)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate राजा और रानी शकार करन जगल गय म raaja aur raani shikaar karne gaye jungle meinKing and queen to-do-hunting went jungle to (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Issues with precision (22)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate शकार करन जगल राजा और रानी म गयshikaar karne jungle raaja aur raani mein gayeto-do hunting jungle raja and rani in went (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Permuting phrases in generalbull For lsquobrsquo bi-gram matches in a candidate translation of length
lsquokrsquo(k ndash b) possible ways to generate similarly score items using only the words
in this translation
In our example (8-4) = 24 candidate translations
In sentence of length ktotal bigrams = k ndash 1
matched bigrams = b
no of mismatched bigrams = k ndash 1 ndash bno of matched chunks = k ndash 1 ndash b + 1
= k ndash b
These (k-b) chunks can be reordered in (k ndash b) ways
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
How is translation performance measured
The closer a machine translation is to a professional human translation the better it is
bull A corpus of good quality human reference translations
bull A numerical ldquotranslation closenessrdquo metric
Preliminaries
bull Candidate Translation(s) Translation returned by an MT system
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
Goal of BLEU To correlate with human judgment
Formulating BLEU (Step 1) PrecisionI had lunch now
Candidate 1 मन अब खाना खाया
maine ab khana khaya matching unigrams 3I now food ate matching bigrams 1I ate food now
Candidate 2 मन अभी लच एटmaine abhi lunch ate matching unigrams 2I now lunch ate
I ate lunch(OOV) now(OOV) matching bigrams 1Unigram precision Candidate 1 34 = 075 Candidate 2 24 = 05Similarly bigram precision Candidate 1 033 Candidate 2 = 033
Reference 1 मन अभी खाना खायाmaine abhi khana khayaI now food ateI ate food now
Reference 2 मन अभी भोजन कयाmaine abhi bhojan kiyaa
I now meal didI did meal now
Precision Not good enough
Reference मझपर तरा स र छायाmujh-par tera suroor chhaaya
me-on your spell castYour spell was cast on me
Candidate 1 मर तरा स र छाया matching unigram 3
mere tera suroor chhaayamy your spell castYour spell cast my
Candidate 2 तरा तरा तरा स र matching unigrams 4tera tera tera surooryour your your spell
Unigram precision Candidate 1 34 = 075 Candidate 2 44 = 1
Formulating BLEU (Step 2) Modified Precision
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujh-par tera suroor chhaaya
me-on your spell castYour spell was cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spell
bull matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Modified n-gram precision
For entire test corpus for a given n
n-gram Matching n-grams in C
n-gramrsquo All n-grams in CModified precision for n-grams Overall candidates of
test corpus
Formula from [2]
Calculating modified n-gram precision (12)
bull 127 source sentences were translated by two human translators and three MT systems
bull Translated sentences evaluated against professional reference translations using modified n-gram precision
Calculating modified n-gram precision (22)
bull Decaying precision with increasing nbull Comparative ranking of the five
Combining precision for different values of n-grams
Graph from [2]
Formulation of BLEU Recap
bull Precision cannot be used as isbull Modified precision considers lsquoclipped word
countrsquo
Recall for MT (12)
bull Candidates shorter than referencesbull Reference या ल लब वा य क गणव ता को समझ
पाएगाkya blue lambe vaakya ki guNvatta ko samajh paaegawill blue long sentence-of quality (case-marker) understand able(III-person-male-singular)Will blue be able to understand quality of long sentence
Candidate लब वा य
lambe vaakyalong sentencelong sentence
modified unigram precision 22 = 1modified bigram precision 11 = 1
Recall for MT (22)
bull Candidates longer than referencesReference 2 मन भोजन कया
maine bhojan kiyaaI meal didI had meal
Candidate 1 मन खाना भोजन कया
maine khaana bhojan kiyaI food meal didI had food meal
Modified unigram precision 1
Reference 1 मन खाना खायाmaine khaana khaayaI food ateI ate food
Candidate 2 मन खाना खायाmaine khaana khaayaI food ateI ate food
Modified unigram precision 1
Formulating BLEU (Step 3) Incorporating recall
bull Sentence length indicates lsquobest matchrsquobull Brevity penalty (BP)
ndash Multiplicative factorndash Candidate translations that match reference
translations in length must be ranked higher
Candidate 1 लब वा य
Candidate 2 या ल लब वा य क गणव तासमझ पाएगा
Formulating BLEU (Step 3) Brevity Penalty
e^(1-x)
Graph drawn using wwwfooplotcom
BP
BP = 1 for c gt r
Why
x = ( r c )Formula from [2]
r Reference sentence lengthc Candidate sentence length
BP leaves out longer translations
WhyTranslations longer than reference are already
penalized by modified precision
Validating the claim
Formula from [2]
BLEU scorePrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Understanding BLEUUnderstanding BLEUDissecting the formulaUsing it news headline example
Decay in precision
Why log pnTo accommodate decay in precision values
Graph from [2]Formula from [2]
Dissecting the Formula
Claim BLEU should lie between 0 and 1Reason To intuitively satisfy ldquo1 implies perfect
translationrdquoUnderstanding constituents of the formula to
validate the claim
Brevity Penalty
Modified precision
Set to 1NFormula from [2]
Validation of range of BLEU
BP
pn Between 0 and 1log pn Between ndash(infinity) and 0 A Between ndash(infinity) and 0 e ^ (A) Between 0 and 1
A
BP Between 0 and 1
Graph drawn using wwwfooplotcom
(rc)
pn
log pn
A
e^(A)
Calculating BLEU An actual example
Ref भावना मक एक करण लान क लए एक योहारbhaavanaatmak ekikaran laane ke liye ek tyohaaremotional unity bring-to a festivalA festival to bring emotional unity
C1 लान क लए एक योहार भावना मक एक करणlaane-ke-liye ek tyoaar bhaavanaatmak ekikaranbring-to a festival emotional unity (This is invalid Hindi ordering)
C2 को एक उ सव क बार म लाना भावना मक एक करणko ek utsav ke baare mein laana bhaavanaatmak ekikaran
for a festival-about to-bring emotional unity (This is invalid Hindi ordering)
Modified n-gram precision
Ref भावना मक एक करण लान क लए एक योहारC1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 7 7 1
2 5 6 083
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 4 9 044
2 1 8 0125
r 7c 7
r 7c 9
Calculating BLEU scoren wn pn log pn wn
log pn
1 1 2 1 0 0
2 1 2 083 -0186 -0093
Total -0093
BLEU 0911
wn= 1 N = 1 2r = 7
C1c = 7BP = exp(1 ndash 77) = 1
C2c = 9BP = 1
n wn pn log pn wn log pn
1 1 2 044 -0821 04105
2 1 2 0125 -207 -1035
Total -1445
BLEU 0235
C1
C2
Hence the BLEU scores
C1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
0911
0235
Ref भावना मक एक करण लान क लए एक योहार
BLEU vs human BLEU vs human judgment judgment [2][2]
Target language EnglishSource language Chinese
Setup
BLEU scores obtained for each system
Five systems perform translation3 automatic MT systems2 human translators
Human judgment (on scale of 5) obtained for each system
bull Group 1 Ten Monolingual speakers of target language (English)
bull Group 2 Ten Bilingual speakers of Chinese and English
BLEU vs human judgment
bull Monolingual speakersCorrelation co-efficient 099
bull Bilingual speakersCorrelation co-efficient 096
BLEU and human evaluation for S2 and S3 Graph from [2]
Comparison of normalized values
bull High correlation between monolingual group and BLEU score
bull Bilingual group were lenient on lsquofluencyrsquo for H1
bull Demarcation between S1-S3 and H1-H2 is captured by BLEU
Graph from [2]
Conclusion
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Part II Shortcomings of BLEU(continued from 7th March 2013)
Presented by(As a part of CS 712)
Aditya JoshiKashyap PopatShubham Gautam
IIT Bombay
GuideProf Pushpak Bhattacharyya
IIT Bombay
BLEU score revisitedPrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Outline
bull Evaluationbull Formulating BLEU Metricbull Understanding BLEU formulabull Shortcomings of BLEU in generalbull Shortcomings in context of English-Hindi MT
Part I
Part II
Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
[A domesticated bird wants to learn to fly]
I can do this I just have to work out the physics I have quadrated by vector angles I have adjusted for wind shear
This is it Lets fly Just keep it simple Thrust lift drag and wait Thrust lift drag wait[speeds up to get to the of the table]Thrust lift drag waiwaiwait [gets scared and tries to stop himself
but instead falls from the table]
Lines from Rio (2011)
- Blu the Macaw
ldquo
Image Source wwwfanpopcom
Use of BLEU metric [3]
bull Evaluating incremental system changesndash Neglecting actual examples
bull Systems ranked on basis of BLEU
bull Does minimizing error rate with respect to BLEU indeed guarantee translation improvements
Criticisms of BLEU [4]
bull Admits too much variation [3]
bull Admits too little variation [4] [3]
bull Poor correlation with human judgment [3]
Admits too much variation [3]
Permuting phrasesDrawing different items from reference set
Admits too much variation
bull BLEU relies on n-gram matching only
bull Puts very few constraints on how n-gram matches can be drawn from multiple reference translations
Brevity Penalty(Incorporating recall)
Modified precision
Set to 1N
Modified n-gram precision in BLEU [2]
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujhpar teraa suroor chhaayame-on your spell has-been-castYour spell has been cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spellYour your your spell
matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Permuting phrases [3]
bull Reordering of unmatched phrases does not affect precision
bull Bigram mismatch sites can be freely permuted
Possible to randomly produce other hypothesis translations that have the same BLEU score
B1 B2 B3 B4
B4 B2 B1 B3Bigram Bigrammismatch
Issues with precision (12)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate राजा और रानी शकार करन जगल गय म raaja aur raani shikaar karne gaye jungle meinKing and queen to-do-hunting went jungle to (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Issues with precision (22)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate शकार करन जगल राजा और रानी म गयshikaar karne jungle raaja aur raani mein gayeto-do hunting jungle raja and rani in went (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Permuting phrases in generalbull For lsquobrsquo bi-gram matches in a candidate translation of length
lsquokrsquo(k ndash b) possible ways to generate similarly score items using only the words
in this translation
In our example (8-4) = 24 candidate translations
In sentence of length ktotal bigrams = k ndash 1
matched bigrams = b
no of mismatched bigrams = k ndash 1 ndash bno of matched chunks = k ndash 1 ndash b + 1
= k ndash b
These (k-b) chunks can be reordered in (k ndash b) ways
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Preliminaries
bull Candidate Translation(s) Translation returned by an MT system
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
Goal of BLEU To correlate with human judgment
Formulating BLEU (Step 1) PrecisionI had lunch now
Candidate 1 मन अब खाना खाया
maine ab khana khaya matching unigrams 3I now food ate matching bigrams 1I ate food now
Candidate 2 मन अभी लच एटmaine abhi lunch ate matching unigrams 2I now lunch ate
I ate lunch(OOV) now(OOV) matching bigrams 1Unigram precision Candidate 1 34 = 075 Candidate 2 24 = 05Similarly bigram precision Candidate 1 033 Candidate 2 = 033
Reference 1 मन अभी खाना खायाmaine abhi khana khayaI now food ateI ate food now
Reference 2 मन अभी भोजन कयाmaine abhi bhojan kiyaa
I now meal didI did meal now
Precision Not good enough
Reference मझपर तरा स र छायाmujh-par tera suroor chhaaya
me-on your spell castYour spell was cast on me
Candidate 1 मर तरा स र छाया matching unigram 3
mere tera suroor chhaayamy your spell castYour spell cast my
Candidate 2 तरा तरा तरा स र matching unigrams 4tera tera tera surooryour your your spell
Unigram precision Candidate 1 34 = 075 Candidate 2 44 = 1
Formulating BLEU (Step 2) Modified Precision
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujh-par tera suroor chhaaya
me-on your spell castYour spell was cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spell
bull matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Modified n-gram precision
For entire test corpus for a given n
n-gram Matching n-grams in C
n-gramrsquo All n-grams in CModified precision for n-grams Overall candidates of
test corpus
Formula from [2]
Calculating modified n-gram precision (12)
bull 127 source sentences were translated by two human translators and three MT systems
bull Translated sentences evaluated against professional reference translations using modified n-gram precision
Calculating modified n-gram precision (22)
bull Decaying precision with increasing nbull Comparative ranking of the five
Combining precision for different values of n-grams
Graph from [2]
Formulation of BLEU Recap
bull Precision cannot be used as isbull Modified precision considers lsquoclipped word
countrsquo
Recall for MT (12)
bull Candidates shorter than referencesbull Reference या ल लब वा य क गणव ता को समझ
पाएगाkya blue lambe vaakya ki guNvatta ko samajh paaegawill blue long sentence-of quality (case-marker) understand able(III-person-male-singular)Will blue be able to understand quality of long sentence
Candidate लब वा य
lambe vaakyalong sentencelong sentence
modified unigram precision 22 = 1modified bigram precision 11 = 1
Recall for MT (22)
bull Candidates longer than referencesReference 2 मन भोजन कया
maine bhojan kiyaaI meal didI had meal
Candidate 1 मन खाना भोजन कया
maine khaana bhojan kiyaI food meal didI had food meal
Modified unigram precision 1
Reference 1 मन खाना खायाmaine khaana khaayaI food ateI ate food
Candidate 2 मन खाना खायाmaine khaana khaayaI food ateI ate food
Modified unigram precision 1
Formulating BLEU (Step 3) Incorporating recall
bull Sentence length indicates lsquobest matchrsquobull Brevity penalty (BP)
ndash Multiplicative factorndash Candidate translations that match reference
translations in length must be ranked higher
Candidate 1 लब वा य
Candidate 2 या ल लब वा य क गणव तासमझ पाएगा
Formulating BLEU (Step 3) Brevity Penalty
e^(1-x)
Graph drawn using wwwfooplotcom
BP
BP = 1 for c gt r
Why
x = ( r c )Formula from [2]
r Reference sentence lengthc Candidate sentence length
BP leaves out longer translations
WhyTranslations longer than reference are already
penalized by modified precision
Validating the claim
Formula from [2]
BLEU scorePrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Understanding BLEUUnderstanding BLEUDissecting the formulaUsing it news headline example
Decay in precision
Why log pnTo accommodate decay in precision values
Graph from [2]Formula from [2]
Dissecting the Formula
Claim BLEU should lie between 0 and 1Reason To intuitively satisfy ldquo1 implies perfect
translationrdquoUnderstanding constituents of the formula to
validate the claim
Brevity Penalty
Modified precision
Set to 1NFormula from [2]
Validation of range of BLEU
BP
pn Between 0 and 1log pn Between ndash(infinity) and 0 A Between ndash(infinity) and 0 e ^ (A) Between 0 and 1
A
BP Between 0 and 1
Graph drawn using wwwfooplotcom
(rc)
pn
log pn
A
e^(A)
Calculating BLEU An actual example
Ref भावना मक एक करण लान क लए एक योहारbhaavanaatmak ekikaran laane ke liye ek tyohaaremotional unity bring-to a festivalA festival to bring emotional unity
C1 लान क लए एक योहार भावना मक एक करणlaane-ke-liye ek tyoaar bhaavanaatmak ekikaranbring-to a festival emotional unity (This is invalid Hindi ordering)
C2 को एक उ सव क बार म लाना भावना मक एक करणko ek utsav ke baare mein laana bhaavanaatmak ekikaran
for a festival-about to-bring emotional unity (This is invalid Hindi ordering)
Modified n-gram precision
Ref भावना मक एक करण लान क लए एक योहारC1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 7 7 1
2 5 6 083
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 4 9 044
2 1 8 0125
r 7c 7
r 7c 9
Calculating BLEU scoren wn pn log pn wn
log pn
1 1 2 1 0 0
2 1 2 083 -0186 -0093
Total -0093
BLEU 0911
wn= 1 N = 1 2r = 7
C1c = 7BP = exp(1 ndash 77) = 1
C2c = 9BP = 1
n wn pn log pn wn log pn
1 1 2 044 -0821 04105
2 1 2 0125 -207 -1035
Total -1445
BLEU 0235
C1
C2
Hence the BLEU scores
C1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
0911
0235
Ref भावना मक एक करण लान क लए एक योहार
BLEU vs human BLEU vs human judgment judgment [2][2]
Target language EnglishSource language Chinese
Setup
BLEU scores obtained for each system
Five systems perform translation3 automatic MT systems2 human translators
Human judgment (on scale of 5) obtained for each system
bull Group 1 Ten Monolingual speakers of target language (English)
bull Group 2 Ten Bilingual speakers of Chinese and English
BLEU vs human judgment
bull Monolingual speakersCorrelation co-efficient 099
bull Bilingual speakersCorrelation co-efficient 096
BLEU and human evaluation for S2 and S3 Graph from [2]
Comparison of normalized values
bull High correlation between monolingual group and BLEU score
bull Bilingual group were lenient on lsquofluencyrsquo for H1
bull Demarcation between S1-S3 and H1-H2 is captured by BLEU
Graph from [2]
Conclusion
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Part II Shortcomings of BLEU(continued from 7th March 2013)
Presented by(As a part of CS 712)
Aditya JoshiKashyap PopatShubham Gautam
IIT Bombay
GuideProf Pushpak Bhattacharyya
IIT Bombay
BLEU score revisitedPrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Outline
bull Evaluationbull Formulating BLEU Metricbull Understanding BLEU formulabull Shortcomings of BLEU in generalbull Shortcomings in context of English-Hindi MT
Part I
Part II
Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
[A domesticated bird wants to learn to fly]
I can do this I just have to work out the physics I have quadrated by vector angles I have adjusted for wind shear
This is it Lets fly Just keep it simple Thrust lift drag and wait Thrust lift drag wait[speeds up to get to the of the table]Thrust lift drag waiwaiwait [gets scared and tries to stop himself
but instead falls from the table]
Lines from Rio (2011)
- Blu the Macaw
ldquo
Image Source wwwfanpopcom
Use of BLEU metric [3]
bull Evaluating incremental system changesndash Neglecting actual examples
bull Systems ranked on basis of BLEU
bull Does minimizing error rate with respect to BLEU indeed guarantee translation improvements
Criticisms of BLEU [4]
bull Admits too much variation [3]
bull Admits too little variation [4] [3]
bull Poor correlation with human judgment [3]
Admits too much variation [3]
Permuting phrasesDrawing different items from reference set
Admits too much variation
bull BLEU relies on n-gram matching only
bull Puts very few constraints on how n-gram matches can be drawn from multiple reference translations
Brevity Penalty(Incorporating recall)
Modified precision
Set to 1N
Modified n-gram precision in BLEU [2]
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujhpar teraa suroor chhaayame-on your spell has-been-castYour spell has been cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spellYour your your spell
matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Permuting phrases [3]
bull Reordering of unmatched phrases does not affect precision
bull Bigram mismatch sites can be freely permuted
Possible to randomly produce other hypothesis translations that have the same BLEU score
B1 B2 B3 B4
B4 B2 B1 B3Bigram Bigrammismatch
Issues with precision (12)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate राजा और रानी शकार करन जगल गय म raaja aur raani shikaar karne gaye jungle meinKing and queen to-do-hunting went jungle to (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Issues with precision (22)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate शकार करन जगल राजा और रानी म गयshikaar karne jungle raaja aur raani mein gayeto-do hunting jungle raja and rani in went (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Permuting phrases in generalbull For lsquobrsquo bi-gram matches in a candidate translation of length
lsquokrsquo(k ndash b) possible ways to generate similarly score items using only the words
in this translation
In our example (8-4) = 24 candidate translations
In sentence of length ktotal bigrams = k ndash 1
matched bigrams = b
no of mismatched bigrams = k ndash 1 ndash bno of matched chunks = k ndash 1 ndash b + 1
= k ndash b
These (k-b) chunks can be reordered in (k ndash b) ways
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Formulating BLEU (Step 1) PrecisionI had lunch now
Candidate 1 मन अब खाना खाया
maine ab khana khaya matching unigrams 3I now food ate matching bigrams 1I ate food now
Candidate 2 मन अभी लच एटmaine abhi lunch ate matching unigrams 2I now lunch ate
I ate lunch(OOV) now(OOV) matching bigrams 1Unigram precision Candidate 1 34 = 075 Candidate 2 24 = 05Similarly bigram precision Candidate 1 033 Candidate 2 = 033
Reference 1 मन अभी खाना खायाmaine abhi khana khayaI now food ateI ate food now
Reference 2 मन अभी भोजन कयाmaine abhi bhojan kiyaa
I now meal didI did meal now
Precision Not good enough
Reference मझपर तरा स र छायाmujh-par tera suroor chhaaya
me-on your spell castYour spell was cast on me
Candidate 1 मर तरा स र छाया matching unigram 3
mere tera suroor chhaayamy your spell castYour spell cast my
Candidate 2 तरा तरा तरा स र matching unigrams 4tera tera tera surooryour your your spell
Unigram precision Candidate 1 34 = 075 Candidate 2 44 = 1
Formulating BLEU (Step 2) Modified Precision
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujh-par tera suroor chhaaya
me-on your spell castYour spell was cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spell
bull matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Modified n-gram precision
For entire test corpus for a given n
n-gram Matching n-grams in C
n-gramrsquo All n-grams in CModified precision for n-grams Overall candidates of
test corpus
Formula from [2]
Calculating modified n-gram precision (12)
bull 127 source sentences were translated by two human translators and three MT systems
bull Translated sentences evaluated against professional reference translations using modified n-gram precision
Calculating modified n-gram precision (22)
bull Decaying precision with increasing nbull Comparative ranking of the five
Combining precision for different values of n-grams
Graph from [2]
Formulation of BLEU Recap
bull Precision cannot be used as isbull Modified precision considers lsquoclipped word
countrsquo
Recall for MT (12)
bull Candidates shorter than referencesbull Reference या ल लब वा य क गणव ता को समझ
पाएगाkya blue lambe vaakya ki guNvatta ko samajh paaegawill blue long sentence-of quality (case-marker) understand able(III-person-male-singular)Will blue be able to understand quality of long sentence
Candidate लब वा य
lambe vaakyalong sentencelong sentence
modified unigram precision 22 = 1modified bigram precision 11 = 1
Recall for MT (22)
bull Candidates longer than referencesReference 2 मन भोजन कया
maine bhojan kiyaaI meal didI had meal
Candidate 1 मन खाना भोजन कया
maine khaana bhojan kiyaI food meal didI had food meal
Modified unigram precision 1
Reference 1 मन खाना खायाmaine khaana khaayaI food ateI ate food
Candidate 2 मन खाना खायाmaine khaana khaayaI food ateI ate food
Modified unigram precision 1
Formulating BLEU (Step 3) Incorporating recall
bull Sentence length indicates lsquobest matchrsquobull Brevity penalty (BP)
ndash Multiplicative factorndash Candidate translations that match reference
translations in length must be ranked higher
Candidate 1 लब वा य
Candidate 2 या ल लब वा य क गणव तासमझ पाएगा
Formulating BLEU (Step 3) Brevity Penalty
e^(1-x)
Graph drawn using wwwfooplotcom
BP
BP = 1 for c gt r
Why
x = ( r c )Formula from [2]
r Reference sentence lengthc Candidate sentence length
BP leaves out longer translations
WhyTranslations longer than reference are already
penalized by modified precision
Validating the claim
Formula from [2]
BLEU scorePrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Understanding BLEUUnderstanding BLEUDissecting the formulaUsing it news headline example
Decay in precision
Why log pnTo accommodate decay in precision values
Graph from [2]Formula from [2]
Dissecting the Formula
Claim BLEU should lie between 0 and 1Reason To intuitively satisfy ldquo1 implies perfect
translationrdquoUnderstanding constituents of the formula to
validate the claim
Brevity Penalty
Modified precision
Set to 1NFormula from [2]
Validation of range of BLEU
BP
pn Between 0 and 1log pn Between ndash(infinity) and 0 A Between ndash(infinity) and 0 e ^ (A) Between 0 and 1
A
BP Between 0 and 1
Graph drawn using wwwfooplotcom
(rc)
pn
log pn
A
e^(A)
Calculating BLEU An actual example
Ref भावना मक एक करण लान क लए एक योहारbhaavanaatmak ekikaran laane ke liye ek tyohaaremotional unity bring-to a festivalA festival to bring emotional unity
C1 लान क लए एक योहार भावना मक एक करणlaane-ke-liye ek tyoaar bhaavanaatmak ekikaranbring-to a festival emotional unity (This is invalid Hindi ordering)
C2 को एक उ सव क बार म लाना भावना मक एक करणko ek utsav ke baare mein laana bhaavanaatmak ekikaran
for a festival-about to-bring emotional unity (This is invalid Hindi ordering)
Modified n-gram precision
Ref भावना मक एक करण लान क लए एक योहारC1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 7 7 1
2 5 6 083
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 4 9 044
2 1 8 0125
r 7c 7
r 7c 9
Calculating BLEU scoren wn pn log pn wn
log pn
1 1 2 1 0 0
2 1 2 083 -0186 -0093
Total -0093
BLEU 0911
wn= 1 N = 1 2r = 7
C1c = 7BP = exp(1 ndash 77) = 1
C2c = 9BP = 1
n wn pn log pn wn log pn
1 1 2 044 -0821 04105
2 1 2 0125 -207 -1035
Total -1445
BLEU 0235
C1
C2
Hence the BLEU scores
C1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
0911
0235
Ref भावना मक एक करण लान क लए एक योहार
BLEU vs human BLEU vs human judgment judgment [2][2]
Target language EnglishSource language Chinese
Setup
BLEU scores obtained for each system
Five systems perform translation3 automatic MT systems2 human translators
Human judgment (on scale of 5) obtained for each system
bull Group 1 Ten Monolingual speakers of target language (English)
bull Group 2 Ten Bilingual speakers of Chinese and English
BLEU vs human judgment
bull Monolingual speakersCorrelation co-efficient 099
bull Bilingual speakersCorrelation co-efficient 096
BLEU and human evaluation for S2 and S3 Graph from [2]
Comparison of normalized values
bull High correlation between monolingual group and BLEU score
bull Bilingual group were lenient on lsquofluencyrsquo for H1
bull Demarcation between S1-S3 and H1-H2 is captured by BLEU
Graph from [2]
Conclusion
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Part II Shortcomings of BLEU(continued from 7th March 2013)
Presented by(As a part of CS 712)
Aditya JoshiKashyap PopatShubham Gautam
IIT Bombay
GuideProf Pushpak Bhattacharyya
IIT Bombay
BLEU score revisitedPrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Outline
bull Evaluationbull Formulating BLEU Metricbull Understanding BLEU formulabull Shortcomings of BLEU in generalbull Shortcomings in context of English-Hindi MT
Part I
Part II
Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
[A domesticated bird wants to learn to fly]
I can do this I just have to work out the physics I have quadrated by vector angles I have adjusted for wind shear
This is it Lets fly Just keep it simple Thrust lift drag and wait Thrust lift drag wait[speeds up to get to the of the table]Thrust lift drag waiwaiwait [gets scared and tries to stop himself
but instead falls from the table]
Lines from Rio (2011)
- Blu the Macaw
ldquo
Image Source wwwfanpopcom
Use of BLEU metric [3]
bull Evaluating incremental system changesndash Neglecting actual examples
bull Systems ranked on basis of BLEU
bull Does minimizing error rate with respect to BLEU indeed guarantee translation improvements
Criticisms of BLEU [4]
bull Admits too much variation [3]
bull Admits too little variation [4] [3]
bull Poor correlation with human judgment [3]
Admits too much variation [3]
Permuting phrasesDrawing different items from reference set
Admits too much variation
bull BLEU relies on n-gram matching only
bull Puts very few constraints on how n-gram matches can be drawn from multiple reference translations
Brevity Penalty(Incorporating recall)
Modified precision
Set to 1N
Modified n-gram precision in BLEU [2]
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujhpar teraa suroor chhaayame-on your spell has-been-castYour spell has been cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spellYour your your spell
matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Permuting phrases [3]
bull Reordering of unmatched phrases does not affect precision
bull Bigram mismatch sites can be freely permuted
Possible to randomly produce other hypothesis translations that have the same BLEU score
B1 B2 B3 B4
B4 B2 B1 B3Bigram Bigrammismatch
Issues with precision (12)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate राजा और रानी शकार करन जगल गय म raaja aur raani shikaar karne gaye jungle meinKing and queen to-do-hunting went jungle to (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Issues with precision (22)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate शकार करन जगल राजा और रानी म गयshikaar karne jungle raaja aur raani mein gayeto-do hunting jungle raja and rani in went (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Permuting phrases in generalbull For lsquobrsquo bi-gram matches in a candidate translation of length
lsquokrsquo(k ndash b) possible ways to generate similarly score items using only the words
in this translation
In our example (8-4) = 24 candidate translations
In sentence of length ktotal bigrams = k ndash 1
matched bigrams = b
no of mismatched bigrams = k ndash 1 ndash bno of matched chunks = k ndash 1 ndash b + 1
= k ndash b
These (k-b) chunks can be reordered in (k ndash b) ways
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Precision Not good enough
Reference मझपर तरा स र छायाmujh-par tera suroor chhaaya
me-on your spell castYour spell was cast on me
Candidate 1 मर तरा स र छाया matching unigram 3
mere tera suroor chhaayamy your spell castYour spell cast my
Candidate 2 तरा तरा तरा स र matching unigrams 4tera tera tera surooryour your your spell
Unigram precision Candidate 1 34 = 075 Candidate 2 44 = 1
Formulating BLEU (Step 2) Modified Precision
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujh-par tera suroor chhaaya
me-on your spell castYour spell was cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spell
bull matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Modified n-gram precision
For entire test corpus for a given n
n-gram Matching n-grams in C
n-gramrsquo All n-grams in CModified precision for n-grams Overall candidates of
test corpus
Formula from [2]
Calculating modified n-gram precision (12)
bull 127 source sentences were translated by two human translators and three MT systems
bull Translated sentences evaluated against professional reference translations using modified n-gram precision
Calculating modified n-gram precision (22)
bull Decaying precision with increasing nbull Comparative ranking of the five
Combining precision for different values of n-grams
Graph from [2]
Formulation of BLEU Recap
bull Precision cannot be used as isbull Modified precision considers lsquoclipped word
countrsquo
Recall for MT (12)
bull Candidates shorter than referencesbull Reference या ल लब वा य क गणव ता को समझ
पाएगाkya blue lambe vaakya ki guNvatta ko samajh paaegawill blue long sentence-of quality (case-marker) understand able(III-person-male-singular)Will blue be able to understand quality of long sentence
Candidate लब वा य
lambe vaakyalong sentencelong sentence
modified unigram precision 22 = 1modified bigram precision 11 = 1
Recall for MT (22)
bull Candidates longer than referencesReference 2 मन भोजन कया
maine bhojan kiyaaI meal didI had meal
Candidate 1 मन खाना भोजन कया
maine khaana bhojan kiyaI food meal didI had food meal
Modified unigram precision 1
Reference 1 मन खाना खायाmaine khaana khaayaI food ateI ate food
Candidate 2 मन खाना खायाmaine khaana khaayaI food ateI ate food
Modified unigram precision 1
Formulating BLEU (Step 3) Incorporating recall
bull Sentence length indicates lsquobest matchrsquobull Brevity penalty (BP)
ndash Multiplicative factorndash Candidate translations that match reference
translations in length must be ranked higher
Candidate 1 लब वा य
Candidate 2 या ल लब वा य क गणव तासमझ पाएगा
Formulating BLEU (Step 3) Brevity Penalty
e^(1-x)
Graph drawn using wwwfooplotcom
BP
BP = 1 for c gt r
Why
x = ( r c )Formula from [2]
r Reference sentence lengthc Candidate sentence length
BP leaves out longer translations
WhyTranslations longer than reference are already
penalized by modified precision
Validating the claim
Formula from [2]
BLEU scorePrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Understanding BLEUUnderstanding BLEUDissecting the formulaUsing it news headline example
Decay in precision
Why log pnTo accommodate decay in precision values
Graph from [2]Formula from [2]
Dissecting the Formula
Claim BLEU should lie between 0 and 1Reason To intuitively satisfy ldquo1 implies perfect
translationrdquoUnderstanding constituents of the formula to
validate the claim
Brevity Penalty
Modified precision
Set to 1NFormula from [2]
Validation of range of BLEU
BP
pn Between 0 and 1log pn Between ndash(infinity) and 0 A Between ndash(infinity) and 0 e ^ (A) Between 0 and 1
A
BP Between 0 and 1
Graph drawn using wwwfooplotcom
(rc)
pn
log pn
A
e^(A)
Calculating BLEU An actual example
Ref भावना मक एक करण लान क लए एक योहारbhaavanaatmak ekikaran laane ke liye ek tyohaaremotional unity bring-to a festivalA festival to bring emotional unity
C1 लान क लए एक योहार भावना मक एक करणlaane-ke-liye ek tyoaar bhaavanaatmak ekikaranbring-to a festival emotional unity (This is invalid Hindi ordering)
C2 को एक उ सव क बार म लाना भावना मक एक करणko ek utsav ke baare mein laana bhaavanaatmak ekikaran
for a festival-about to-bring emotional unity (This is invalid Hindi ordering)
Modified n-gram precision
Ref भावना मक एक करण लान क लए एक योहारC1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 7 7 1
2 5 6 083
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 4 9 044
2 1 8 0125
r 7c 7
r 7c 9
Calculating BLEU scoren wn pn log pn wn
log pn
1 1 2 1 0 0
2 1 2 083 -0186 -0093
Total -0093
BLEU 0911
wn= 1 N = 1 2r = 7
C1c = 7BP = exp(1 ndash 77) = 1
C2c = 9BP = 1
n wn pn log pn wn log pn
1 1 2 044 -0821 04105
2 1 2 0125 -207 -1035
Total -1445
BLEU 0235
C1
C2
Hence the BLEU scores
C1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
0911
0235
Ref भावना मक एक करण लान क लए एक योहार
BLEU vs human BLEU vs human judgment judgment [2][2]
Target language EnglishSource language Chinese
Setup
BLEU scores obtained for each system
Five systems perform translation3 automatic MT systems2 human translators
Human judgment (on scale of 5) obtained for each system
bull Group 1 Ten Monolingual speakers of target language (English)
bull Group 2 Ten Bilingual speakers of Chinese and English
BLEU vs human judgment
bull Monolingual speakersCorrelation co-efficient 099
bull Bilingual speakersCorrelation co-efficient 096
BLEU and human evaluation for S2 and S3 Graph from [2]
Comparison of normalized values
bull High correlation between monolingual group and BLEU score
bull Bilingual group were lenient on lsquofluencyrsquo for H1
bull Demarcation between S1-S3 and H1-H2 is captured by BLEU
Graph from [2]
Conclusion
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Part II Shortcomings of BLEU(continued from 7th March 2013)
Presented by(As a part of CS 712)
Aditya JoshiKashyap PopatShubham Gautam
IIT Bombay
GuideProf Pushpak Bhattacharyya
IIT Bombay
BLEU score revisitedPrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Outline
bull Evaluationbull Formulating BLEU Metricbull Understanding BLEU formulabull Shortcomings of BLEU in generalbull Shortcomings in context of English-Hindi MT
Part I
Part II
Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
[A domesticated bird wants to learn to fly]
I can do this I just have to work out the physics I have quadrated by vector angles I have adjusted for wind shear
This is it Lets fly Just keep it simple Thrust lift drag and wait Thrust lift drag wait[speeds up to get to the of the table]Thrust lift drag waiwaiwait [gets scared and tries to stop himself
but instead falls from the table]
Lines from Rio (2011)
- Blu the Macaw
ldquo
Image Source wwwfanpopcom
Use of BLEU metric [3]
bull Evaluating incremental system changesndash Neglecting actual examples
bull Systems ranked on basis of BLEU
bull Does minimizing error rate with respect to BLEU indeed guarantee translation improvements
Criticisms of BLEU [4]
bull Admits too much variation [3]
bull Admits too little variation [4] [3]
bull Poor correlation with human judgment [3]
Admits too much variation [3]
Permuting phrasesDrawing different items from reference set
Admits too much variation
bull BLEU relies on n-gram matching only
bull Puts very few constraints on how n-gram matches can be drawn from multiple reference translations
Brevity Penalty(Incorporating recall)
Modified precision
Set to 1N
Modified n-gram precision in BLEU [2]
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujhpar teraa suroor chhaayame-on your spell has-been-castYour spell has been cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spellYour your your spell
matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Permuting phrases [3]
bull Reordering of unmatched phrases does not affect precision
bull Bigram mismatch sites can be freely permuted
Possible to randomly produce other hypothesis translations that have the same BLEU score
B1 B2 B3 B4
B4 B2 B1 B3Bigram Bigrammismatch
Issues with precision (12)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate राजा और रानी शकार करन जगल गय म raaja aur raani shikaar karne gaye jungle meinKing and queen to-do-hunting went jungle to (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Issues with precision (22)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate शकार करन जगल राजा और रानी म गयshikaar karne jungle raaja aur raani mein gayeto-do hunting jungle raja and rani in went (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Permuting phrases in generalbull For lsquobrsquo bi-gram matches in a candidate translation of length
lsquokrsquo(k ndash b) possible ways to generate similarly score items using only the words
in this translation
In our example (8-4) = 24 candidate translations
In sentence of length ktotal bigrams = k ndash 1
matched bigrams = b
no of mismatched bigrams = k ndash 1 ndash bno of matched chunks = k ndash 1 ndash b + 1
= k ndash b
These (k-b) chunks can be reordered in (k ndash b) ways
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Formulating BLEU (Step 2) Modified Precision
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujh-par tera suroor chhaaya
me-on your spell castYour spell was cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spell
bull matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Modified n-gram precision
For entire test corpus for a given n
n-gram Matching n-grams in C
n-gramrsquo All n-grams in CModified precision for n-grams Overall candidates of
test corpus
Formula from [2]
Calculating modified n-gram precision (12)
bull 127 source sentences were translated by two human translators and three MT systems
bull Translated sentences evaluated against professional reference translations using modified n-gram precision
Calculating modified n-gram precision (22)
bull Decaying precision with increasing nbull Comparative ranking of the five
Combining precision for different values of n-grams
Graph from [2]
Formulation of BLEU Recap
bull Precision cannot be used as isbull Modified precision considers lsquoclipped word
countrsquo
Recall for MT (12)
bull Candidates shorter than referencesbull Reference या ल लब वा य क गणव ता को समझ
पाएगाkya blue lambe vaakya ki guNvatta ko samajh paaegawill blue long sentence-of quality (case-marker) understand able(III-person-male-singular)Will blue be able to understand quality of long sentence
Candidate लब वा य
lambe vaakyalong sentencelong sentence
modified unigram precision 22 = 1modified bigram precision 11 = 1
Recall for MT (22)
bull Candidates longer than referencesReference 2 मन भोजन कया
maine bhojan kiyaaI meal didI had meal
Candidate 1 मन खाना भोजन कया
maine khaana bhojan kiyaI food meal didI had food meal
Modified unigram precision 1
Reference 1 मन खाना खायाmaine khaana khaayaI food ateI ate food
Candidate 2 मन खाना खायाmaine khaana khaayaI food ateI ate food
Modified unigram precision 1
Formulating BLEU (Step 3) Incorporating recall
bull Sentence length indicates lsquobest matchrsquobull Brevity penalty (BP)
ndash Multiplicative factorndash Candidate translations that match reference
translations in length must be ranked higher
Candidate 1 लब वा य
Candidate 2 या ल लब वा य क गणव तासमझ पाएगा
Formulating BLEU (Step 3) Brevity Penalty
e^(1-x)
Graph drawn using wwwfooplotcom
BP
BP = 1 for c gt r
Why
x = ( r c )Formula from [2]
r Reference sentence lengthc Candidate sentence length
BP leaves out longer translations
WhyTranslations longer than reference are already
penalized by modified precision
Validating the claim
Formula from [2]
BLEU scorePrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Understanding BLEUUnderstanding BLEUDissecting the formulaUsing it news headline example
Decay in precision
Why log pnTo accommodate decay in precision values
Graph from [2]Formula from [2]
Dissecting the Formula
Claim BLEU should lie between 0 and 1Reason To intuitively satisfy ldquo1 implies perfect
translationrdquoUnderstanding constituents of the formula to
validate the claim
Brevity Penalty
Modified precision
Set to 1NFormula from [2]
Validation of range of BLEU
BP
pn Between 0 and 1log pn Between ndash(infinity) and 0 A Between ndash(infinity) and 0 e ^ (A) Between 0 and 1
A
BP Between 0 and 1
Graph drawn using wwwfooplotcom
(rc)
pn
log pn
A
e^(A)
Calculating BLEU An actual example
Ref भावना मक एक करण लान क लए एक योहारbhaavanaatmak ekikaran laane ke liye ek tyohaaremotional unity bring-to a festivalA festival to bring emotional unity
C1 लान क लए एक योहार भावना मक एक करणlaane-ke-liye ek tyoaar bhaavanaatmak ekikaranbring-to a festival emotional unity (This is invalid Hindi ordering)
C2 को एक उ सव क बार म लाना भावना मक एक करणko ek utsav ke baare mein laana bhaavanaatmak ekikaran
for a festival-about to-bring emotional unity (This is invalid Hindi ordering)
Modified n-gram precision
Ref भावना मक एक करण लान क लए एक योहारC1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 7 7 1
2 5 6 083
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 4 9 044
2 1 8 0125
r 7c 7
r 7c 9
Calculating BLEU scoren wn pn log pn wn
log pn
1 1 2 1 0 0
2 1 2 083 -0186 -0093
Total -0093
BLEU 0911
wn= 1 N = 1 2r = 7
C1c = 7BP = exp(1 ndash 77) = 1
C2c = 9BP = 1
n wn pn log pn wn log pn
1 1 2 044 -0821 04105
2 1 2 0125 -207 -1035
Total -1445
BLEU 0235
C1
C2
Hence the BLEU scores
C1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
0911
0235
Ref भावना मक एक करण लान क लए एक योहार
BLEU vs human BLEU vs human judgment judgment [2][2]
Target language EnglishSource language Chinese
Setup
BLEU scores obtained for each system
Five systems perform translation3 automatic MT systems2 human translators
Human judgment (on scale of 5) obtained for each system
bull Group 1 Ten Monolingual speakers of target language (English)
bull Group 2 Ten Bilingual speakers of Chinese and English
BLEU vs human judgment
bull Monolingual speakersCorrelation co-efficient 099
bull Bilingual speakersCorrelation co-efficient 096
BLEU and human evaluation for S2 and S3 Graph from [2]
Comparison of normalized values
bull High correlation between monolingual group and BLEU score
bull Bilingual group were lenient on lsquofluencyrsquo for H1
bull Demarcation between S1-S3 and H1-H2 is captured by BLEU
Graph from [2]
Conclusion
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Part II Shortcomings of BLEU(continued from 7th March 2013)
Presented by(As a part of CS 712)
Aditya JoshiKashyap PopatShubham Gautam
IIT Bombay
GuideProf Pushpak Bhattacharyya
IIT Bombay
BLEU score revisitedPrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Outline
bull Evaluationbull Formulating BLEU Metricbull Understanding BLEU formulabull Shortcomings of BLEU in generalbull Shortcomings in context of English-Hindi MT
Part I
Part II
Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
[A domesticated bird wants to learn to fly]
I can do this I just have to work out the physics I have quadrated by vector angles I have adjusted for wind shear
This is it Lets fly Just keep it simple Thrust lift drag and wait Thrust lift drag wait[speeds up to get to the of the table]Thrust lift drag waiwaiwait [gets scared and tries to stop himself
but instead falls from the table]
Lines from Rio (2011)
- Blu the Macaw
ldquo
Image Source wwwfanpopcom
Use of BLEU metric [3]
bull Evaluating incremental system changesndash Neglecting actual examples
bull Systems ranked on basis of BLEU
bull Does minimizing error rate with respect to BLEU indeed guarantee translation improvements
Criticisms of BLEU [4]
bull Admits too much variation [3]
bull Admits too little variation [4] [3]
bull Poor correlation with human judgment [3]
Admits too much variation [3]
Permuting phrasesDrawing different items from reference set
Admits too much variation
bull BLEU relies on n-gram matching only
bull Puts very few constraints on how n-gram matches can be drawn from multiple reference translations
Brevity Penalty(Incorporating recall)
Modified precision
Set to 1N
Modified n-gram precision in BLEU [2]
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujhpar teraa suroor chhaayame-on your spell has-been-castYour spell has been cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spellYour your your spell
matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Permuting phrases [3]
bull Reordering of unmatched phrases does not affect precision
bull Bigram mismatch sites can be freely permuted
Possible to randomly produce other hypothesis translations that have the same BLEU score
B1 B2 B3 B4
B4 B2 B1 B3Bigram Bigrammismatch
Issues with precision (12)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate राजा और रानी शकार करन जगल गय म raaja aur raani shikaar karne gaye jungle meinKing and queen to-do-hunting went jungle to (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Issues with precision (22)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate शकार करन जगल राजा और रानी म गयshikaar karne jungle raaja aur raani mein gayeto-do hunting jungle raja and rani in went (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Permuting phrases in generalbull For lsquobrsquo bi-gram matches in a candidate translation of length
lsquokrsquo(k ndash b) possible ways to generate similarly score items using only the words
in this translation
In our example (8-4) = 24 candidate translations
In sentence of length ktotal bigrams = k ndash 1
matched bigrams = b
no of mismatched bigrams = k ndash 1 ndash bno of matched chunks = k ndash 1 ndash b + 1
= k ndash b
These (k-b) chunks can be reordered in (k ndash b) ways
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Modified n-gram precision
For entire test corpus for a given n
n-gram Matching n-grams in C
n-gramrsquo All n-grams in CModified precision for n-grams Overall candidates of
test corpus
Formula from [2]
Calculating modified n-gram precision (12)
bull 127 source sentences were translated by two human translators and three MT systems
bull Translated sentences evaluated against professional reference translations using modified n-gram precision
Calculating modified n-gram precision (22)
bull Decaying precision with increasing nbull Comparative ranking of the five
Combining precision for different values of n-grams
Graph from [2]
Formulation of BLEU Recap
bull Precision cannot be used as isbull Modified precision considers lsquoclipped word
countrsquo
Recall for MT (12)
bull Candidates shorter than referencesbull Reference या ल लब वा य क गणव ता को समझ
पाएगाkya blue lambe vaakya ki guNvatta ko samajh paaegawill blue long sentence-of quality (case-marker) understand able(III-person-male-singular)Will blue be able to understand quality of long sentence
Candidate लब वा य
lambe vaakyalong sentencelong sentence
modified unigram precision 22 = 1modified bigram precision 11 = 1
Recall for MT (22)
bull Candidates longer than referencesReference 2 मन भोजन कया
maine bhojan kiyaaI meal didI had meal
Candidate 1 मन खाना भोजन कया
maine khaana bhojan kiyaI food meal didI had food meal
Modified unigram precision 1
Reference 1 मन खाना खायाmaine khaana khaayaI food ateI ate food
Candidate 2 मन खाना खायाmaine khaana khaayaI food ateI ate food
Modified unigram precision 1
Formulating BLEU (Step 3) Incorporating recall
bull Sentence length indicates lsquobest matchrsquobull Brevity penalty (BP)
ndash Multiplicative factorndash Candidate translations that match reference
translations in length must be ranked higher
Candidate 1 लब वा य
Candidate 2 या ल लब वा य क गणव तासमझ पाएगा
Formulating BLEU (Step 3) Brevity Penalty
e^(1-x)
Graph drawn using wwwfooplotcom
BP
BP = 1 for c gt r
Why
x = ( r c )Formula from [2]
r Reference sentence lengthc Candidate sentence length
BP leaves out longer translations
WhyTranslations longer than reference are already
penalized by modified precision
Validating the claim
Formula from [2]
BLEU scorePrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Understanding BLEUUnderstanding BLEUDissecting the formulaUsing it news headline example
Decay in precision
Why log pnTo accommodate decay in precision values
Graph from [2]Formula from [2]
Dissecting the Formula
Claim BLEU should lie between 0 and 1Reason To intuitively satisfy ldquo1 implies perfect
translationrdquoUnderstanding constituents of the formula to
validate the claim
Brevity Penalty
Modified precision
Set to 1NFormula from [2]
Validation of range of BLEU
BP
pn Between 0 and 1log pn Between ndash(infinity) and 0 A Between ndash(infinity) and 0 e ^ (A) Between 0 and 1
A
BP Between 0 and 1
Graph drawn using wwwfooplotcom
(rc)
pn
log pn
A
e^(A)
Calculating BLEU An actual example
Ref भावना मक एक करण लान क लए एक योहारbhaavanaatmak ekikaran laane ke liye ek tyohaaremotional unity bring-to a festivalA festival to bring emotional unity
C1 लान क लए एक योहार भावना मक एक करणlaane-ke-liye ek tyoaar bhaavanaatmak ekikaranbring-to a festival emotional unity (This is invalid Hindi ordering)
C2 को एक उ सव क बार म लाना भावना मक एक करणko ek utsav ke baare mein laana bhaavanaatmak ekikaran
for a festival-about to-bring emotional unity (This is invalid Hindi ordering)
Modified n-gram precision
Ref भावना मक एक करण लान क लए एक योहारC1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 7 7 1
2 5 6 083
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 4 9 044
2 1 8 0125
r 7c 7
r 7c 9
Calculating BLEU scoren wn pn log pn wn
log pn
1 1 2 1 0 0
2 1 2 083 -0186 -0093
Total -0093
BLEU 0911
wn= 1 N = 1 2r = 7
C1c = 7BP = exp(1 ndash 77) = 1
C2c = 9BP = 1
n wn pn log pn wn log pn
1 1 2 044 -0821 04105
2 1 2 0125 -207 -1035
Total -1445
BLEU 0235
C1
C2
Hence the BLEU scores
C1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
0911
0235
Ref भावना मक एक करण लान क लए एक योहार
BLEU vs human BLEU vs human judgment judgment [2][2]
Target language EnglishSource language Chinese
Setup
BLEU scores obtained for each system
Five systems perform translation3 automatic MT systems2 human translators
Human judgment (on scale of 5) obtained for each system
bull Group 1 Ten Monolingual speakers of target language (English)
bull Group 2 Ten Bilingual speakers of Chinese and English
BLEU vs human judgment
bull Monolingual speakersCorrelation co-efficient 099
bull Bilingual speakersCorrelation co-efficient 096
BLEU and human evaluation for S2 and S3 Graph from [2]
Comparison of normalized values
bull High correlation between monolingual group and BLEU score
bull Bilingual group were lenient on lsquofluencyrsquo for H1
bull Demarcation between S1-S3 and H1-H2 is captured by BLEU
Graph from [2]
Conclusion
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Part II Shortcomings of BLEU(continued from 7th March 2013)
Presented by(As a part of CS 712)
Aditya JoshiKashyap PopatShubham Gautam
IIT Bombay
GuideProf Pushpak Bhattacharyya
IIT Bombay
BLEU score revisitedPrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Outline
bull Evaluationbull Formulating BLEU Metricbull Understanding BLEU formulabull Shortcomings of BLEU in generalbull Shortcomings in context of English-Hindi MT
Part I
Part II
Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
[A domesticated bird wants to learn to fly]
I can do this I just have to work out the physics I have quadrated by vector angles I have adjusted for wind shear
This is it Lets fly Just keep it simple Thrust lift drag and wait Thrust lift drag wait[speeds up to get to the of the table]Thrust lift drag waiwaiwait [gets scared and tries to stop himself
but instead falls from the table]
Lines from Rio (2011)
- Blu the Macaw
ldquo
Image Source wwwfanpopcom
Use of BLEU metric [3]
bull Evaluating incremental system changesndash Neglecting actual examples
bull Systems ranked on basis of BLEU
bull Does minimizing error rate with respect to BLEU indeed guarantee translation improvements
Criticisms of BLEU [4]
bull Admits too much variation [3]
bull Admits too little variation [4] [3]
bull Poor correlation with human judgment [3]
Admits too much variation [3]
Permuting phrasesDrawing different items from reference set
Admits too much variation
bull BLEU relies on n-gram matching only
bull Puts very few constraints on how n-gram matches can be drawn from multiple reference translations
Brevity Penalty(Incorporating recall)
Modified precision
Set to 1N
Modified n-gram precision in BLEU [2]
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujhpar teraa suroor chhaayame-on your spell has-been-castYour spell has been cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spellYour your your spell
matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Permuting phrases [3]
bull Reordering of unmatched phrases does not affect precision
bull Bigram mismatch sites can be freely permuted
Possible to randomly produce other hypothesis translations that have the same BLEU score
B1 B2 B3 B4
B4 B2 B1 B3Bigram Bigrammismatch
Issues with precision (12)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate राजा और रानी शकार करन जगल गय म raaja aur raani shikaar karne gaye jungle meinKing and queen to-do-hunting went jungle to (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Issues with precision (22)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate शकार करन जगल राजा और रानी म गयshikaar karne jungle raaja aur raani mein gayeto-do hunting jungle raja and rani in went (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Permuting phrases in generalbull For lsquobrsquo bi-gram matches in a candidate translation of length
lsquokrsquo(k ndash b) possible ways to generate similarly score items using only the words
in this translation
In our example (8-4) = 24 candidate translations
In sentence of length ktotal bigrams = k ndash 1
matched bigrams = b
no of mismatched bigrams = k ndash 1 ndash bno of matched chunks = k ndash 1 ndash b + 1
= k ndash b
These (k-b) chunks can be reordered in (k ndash b) ways
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Calculating modified n-gram precision (12)
bull 127 source sentences were translated by two human translators and three MT systems
bull Translated sentences evaluated against professional reference translations using modified n-gram precision
Calculating modified n-gram precision (22)
bull Decaying precision with increasing nbull Comparative ranking of the five
Combining precision for different values of n-grams
Graph from [2]
Formulation of BLEU Recap
bull Precision cannot be used as isbull Modified precision considers lsquoclipped word
countrsquo
Recall for MT (12)
bull Candidates shorter than referencesbull Reference या ल लब वा य क गणव ता को समझ
पाएगाkya blue lambe vaakya ki guNvatta ko samajh paaegawill blue long sentence-of quality (case-marker) understand able(III-person-male-singular)Will blue be able to understand quality of long sentence
Candidate लब वा य
lambe vaakyalong sentencelong sentence
modified unigram precision 22 = 1modified bigram precision 11 = 1
Recall for MT (22)
bull Candidates longer than referencesReference 2 मन भोजन कया
maine bhojan kiyaaI meal didI had meal
Candidate 1 मन खाना भोजन कया
maine khaana bhojan kiyaI food meal didI had food meal
Modified unigram precision 1
Reference 1 मन खाना खायाmaine khaana khaayaI food ateI ate food
Candidate 2 मन खाना खायाmaine khaana khaayaI food ateI ate food
Modified unigram precision 1
Formulating BLEU (Step 3) Incorporating recall
bull Sentence length indicates lsquobest matchrsquobull Brevity penalty (BP)
ndash Multiplicative factorndash Candidate translations that match reference
translations in length must be ranked higher
Candidate 1 लब वा य
Candidate 2 या ल लब वा य क गणव तासमझ पाएगा
Formulating BLEU (Step 3) Brevity Penalty
e^(1-x)
Graph drawn using wwwfooplotcom
BP
BP = 1 for c gt r
Why
x = ( r c )Formula from [2]
r Reference sentence lengthc Candidate sentence length
BP leaves out longer translations
WhyTranslations longer than reference are already
penalized by modified precision
Validating the claim
Formula from [2]
BLEU scorePrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Understanding BLEUUnderstanding BLEUDissecting the formulaUsing it news headline example
Decay in precision
Why log pnTo accommodate decay in precision values
Graph from [2]Formula from [2]
Dissecting the Formula
Claim BLEU should lie between 0 and 1Reason To intuitively satisfy ldquo1 implies perfect
translationrdquoUnderstanding constituents of the formula to
validate the claim
Brevity Penalty
Modified precision
Set to 1NFormula from [2]
Validation of range of BLEU
BP
pn Between 0 and 1log pn Between ndash(infinity) and 0 A Between ndash(infinity) and 0 e ^ (A) Between 0 and 1
A
BP Between 0 and 1
Graph drawn using wwwfooplotcom
(rc)
pn
log pn
A
e^(A)
Calculating BLEU An actual example
Ref भावना मक एक करण लान क लए एक योहारbhaavanaatmak ekikaran laane ke liye ek tyohaaremotional unity bring-to a festivalA festival to bring emotional unity
C1 लान क लए एक योहार भावना मक एक करणlaane-ke-liye ek tyoaar bhaavanaatmak ekikaranbring-to a festival emotional unity (This is invalid Hindi ordering)
C2 को एक उ सव क बार म लाना भावना मक एक करणko ek utsav ke baare mein laana bhaavanaatmak ekikaran
for a festival-about to-bring emotional unity (This is invalid Hindi ordering)
Modified n-gram precision
Ref भावना मक एक करण लान क लए एक योहारC1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 7 7 1
2 5 6 083
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 4 9 044
2 1 8 0125
r 7c 7
r 7c 9
Calculating BLEU scoren wn pn log pn wn
log pn
1 1 2 1 0 0
2 1 2 083 -0186 -0093
Total -0093
BLEU 0911
wn= 1 N = 1 2r = 7
C1c = 7BP = exp(1 ndash 77) = 1
C2c = 9BP = 1
n wn pn log pn wn log pn
1 1 2 044 -0821 04105
2 1 2 0125 -207 -1035
Total -1445
BLEU 0235
C1
C2
Hence the BLEU scores
C1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
0911
0235
Ref भावना मक एक करण लान क लए एक योहार
BLEU vs human BLEU vs human judgment judgment [2][2]
Target language EnglishSource language Chinese
Setup
BLEU scores obtained for each system
Five systems perform translation3 automatic MT systems2 human translators
Human judgment (on scale of 5) obtained for each system
bull Group 1 Ten Monolingual speakers of target language (English)
bull Group 2 Ten Bilingual speakers of Chinese and English
BLEU vs human judgment
bull Monolingual speakersCorrelation co-efficient 099
bull Bilingual speakersCorrelation co-efficient 096
BLEU and human evaluation for S2 and S3 Graph from [2]
Comparison of normalized values
bull High correlation between monolingual group and BLEU score
bull Bilingual group were lenient on lsquofluencyrsquo for H1
bull Demarcation between S1-S3 and H1-H2 is captured by BLEU
Graph from [2]
Conclusion
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Part II Shortcomings of BLEU(continued from 7th March 2013)
Presented by(As a part of CS 712)
Aditya JoshiKashyap PopatShubham Gautam
IIT Bombay
GuideProf Pushpak Bhattacharyya
IIT Bombay
BLEU score revisitedPrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Outline
bull Evaluationbull Formulating BLEU Metricbull Understanding BLEU formulabull Shortcomings of BLEU in generalbull Shortcomings in context of English-Hindi MT
Part I
Part II
Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
[A domesticated bird wants to learn to fly]
I can do this I just have to work out the physics I have quadrated by vector angles I have adjusted for wind shear
This is it Lets fly Just keep it simple Thrust lift drag and wait Thrust lift drag wait[speeds up to get to the of the table]Thrust lift drag waiwaiwait [gets scared and tries to stop himself
but instead falls from the table]
Lines from Rio (2011)
- Blu the Macaw
ldquo
Image Source wwwfanpopcom
Use of BLEU metric [3]
bull Evaluating incremental system changesndash Neglecting actual examples
bull Systems ranked on basis of BLEU
bull Does minimizing error rate with respect to BLEU indeed guarantee translation improvements
Criticisms of BLEU [4]
bull Admits too much variation [3]
bull Admits too little variation [4] [3]
bull Poor correlation with human judgment [3]
Admits too much variation [3]
Permuting phrasesDrawing different items from reference set
Admits too much variation
bull BLEU relies on n-gram matching only
bull Puts very few constraints on how n-gram matches can be drawn from multiple reference translations
Brevity Penalty(Incorporating recall)
Modified precision
Set to 1N
Modified n-gram precision in BLEU [2]
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujhpar teraa suroor chhaayame-on your spell has-been-castYour spell has been cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spellYour your your spell
matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Permuting phrases [3]
bull Reordering of unmatched phrases does not affect precision
bull Bigram mismatch sites can be freely permuted
Possible to randomly produce other hypothesis translations that have the same BLEU score
B1 B2 B3 B4
B4 B2 B1 B3Bigram Bigrammismatch
Issues with precision (12)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate राजा और रानी शकार करन जगल गय म raaja aur raani shikaar karne gaye jungle meinKing and queen to-do-hunting went jungle to (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Issues with precision (22)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate शकार करन जगल राजा और रानी म गयshikaar karne jungle raaja aur raani mein gayeto-do hunting jungle raja and rani in went (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Permuting phrases in generalbull For lsquobrsquo bi-gram matches in a candidate translation of length
lsquokrsquo(k ndash b) possible ways to generate similarly score items using only the words
in this translation
In our example (8-4) = 24 candidate translations
In sentence of length ktotal bigrams = k ndash 1
matched bigrams = b
no of mismatched bigrams = k ndash 1 ndash bno of matched chunks = k ndash 1 ndash b + 1
= k ndash b
These (k-b) chunks can be reordered in (k ndash b) ways
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Calculating modified n-gram precision (22)
bull Decaying precision with increasing nbull Comparative ranking of the five
Combining precision for different values of n-grams
Graph from [2]
Formulation of BLEU Recap
bull Precision cannot be used as isbull Modified precision considers lsquoclipped word
countrsquo
Recall for MT (12)
bull Candidates shorter than referencesbull Reference या ल लब वा य क गणव ता को समझ
पाएगाkya blue lambe vaakya ki guNvatta ko samajh paaegawill blue long sentence-of quality (case-marker) understand able(III-person-male-singular)Will blue be able to understand quality of long sentence
Candidate लब वा य
lambe vaakyalong sentencelong sentence
modified unigram precision 22 = 1modified bigram precision 11 = 1
Recall for MT (22)
bull Candidates longer than referencesReference 2 मन भोजन कया
maine bhojan kiyaaI meal didI had meal
Candidate 1 मन खाना भोजन कया
maine khaana bhojan kiyaI food meal didI had food meal
Modified unigram precision 1
Reference 1 मन खाना खायाmaine khaana khaayaI food ateI ate food
Candidate 2 मन खाना खायाmaine khaana khaayaI food ateI ate food
Modified unigram precision 1
Formulating BLEU (Step 3) Incorporating recall
bull Sentence length indicates lsquobest matchrsquobull Brevity penalty (BP)
ndash Multiplicative factorndash Candidate translations that match reference
translations in length must be ranked higher
Candidate 1 लब वा य
Candidate 2 या ल लब वा य क गणव तासमझ पाएगा
Formulating BLEU (Step 3) Brevity Penalty
e^(1-x)
Graph drawn using wwwfooplotcom
BP
BP = 1 for c gt r
Why
x = ( r c )Formula from [2]
r Reference sentence lengthc Candidate sentence length
BP leaves out longer translations
WhyTranslations longer than reference are already
penalized by modified precision
Validating the claim
Formula from [2]
BLEU scorePrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Understanding BLEUUnderstanding BLEUDissecting the formulaUsing it news headline example
Decay in precision
Why log pnTo accommodate decay in precision values
Graph from [2]Formula from [2]
Dissecting the Formula
Claim BLEU should lie between 0 and 1Reason To intuitively satisfy ldquo1 implies perfect
translationrdquoUnderstanding constituents of the formula to
validate the claim
Brevity Penalty
Modified precision
Set to 1NFormula from [2]
Validation of range of BLEU
BP
pn Between 0 and 1log pn Between ndash(infinity) and 0 A Between ndash(infinity) and 0 e ^ (A) Between 0 and 1
A
BP Between 0 and 1
Graph drawn using wwwfooplotcom
(rc)
pn
log pn
A
e^(A)
Calculating BLEU An actual example
Ref भावना मक एक करण लान क लए एक योहारbhaavanaatmak ekikaran laane ke liye ek tyohaaremotional unity bring-to a festivalA festival to bring emotional unity
C1 लान क लए एक योहार भावना मक एक करणlaane-ke-liye ek tyoaar bhaavanaatmak ekikaranbring-to a festival emotional unity (This is invalid Hindi ordering)
C2 को एक उ सव क बार म लाना भावना मक एक करणko ek utsav ke baare mein laana bhaavanaatmak ekikaran
for a festival-about to-bring emotional unity (This is invalid Hindi ordering)
Modified n-gram precision
Ref भावना मक एक करण लान क लए एक योहारC1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 7 7 1
2 5 6 083
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 4 9 044
2 1 8 0125
r 7c 7
r 7c 9
Calculating BLEU scoren wn pn log pn wn
log pn
1 1 2 1 0 0
2 1 2 083 -0186 -0093
Total -0093
BLEU 0911
wn= 1 N = 1 2r = 7
C1c = 7BP = exp(1 ndash 77) = 1
C2c = 9BP = 1
n wn pn log pn wn log pn
1 1 2 044 -0821 04105
2 1 2 0125 -207 -1035
Total -1445
BLEU 0235
C1
C2
Hence the BLEU scores
C1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
0911
0235
Ref भावना मक एक करण लान क लए एक योहार
BLEU vs human BLEU vs human judgment judgment [2][2]
Target language EnglishSource language Chinese
Setup
BLEU scores obtained for each system
Five systems perform translation3 automatic MT systems2 human translators
Human judgment (on scale of 5) obtained for each system
bull Group 1 Ten Monolingual speakers of target language (English)
bull Group 2 Ten Bilingual speakers of Chinese and English
BLEU vs human judgment
bull Monolingual speakersCorrelation co-efficient 099
bull Bilingual speakersCorrelation co-efficient 096
BLEU and human evaluation for S2 and S3 Graph from [2]
Comparison of normalized values
bull High correlation between monolingual group and BLEU score
bull Bilingual group were lenient on lsquofluencyrsquo for H1
bull Demarcation between S1-S3 and H1-H2 is captured by BLEU
Graph from [2]
Conclusion
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Part II Shortcomings of BLEU(continued from 7th March 2013)
Presented by(As a part of CS 712)
Aditya JoshiKashyap PopatShubham Gautam
IIT Bombay
GuideProf Pushpak Bhattacharyya
IIT Bombay
BLEU score revisitedPrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Outline
bull Evaluationbull Formulating BLEU Metricbull Understanding BLEU formulabull Shortcomings of BLEU in generalbull Shortcomings in context of English-Hindi MT
Part I
Part II
Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
[A domesticated bird wants to learn to fly]
I can do this I just have to work out the physics I have quadrated by vector angles I have adjusted for wind shear
This is it Lets fly Just keep it simple Thrust lift drag and wait Thrust lift drag wait[speeds up to get to the of the table]Thrust lift drag waiwaiwait [gets scared and tries to stop himself
but instead falls from the table]
Lines from Rio (2011)
- Blu the Macaw
ldquo
Image Source wwwfanpopcom
Use of BLEU metric [3]
bull Evaluating incremental system changesndash Neglecting actual examples
bull Systems ranked on basis of BLEU
bull Does minimizing error rate with respect to BLEU indeed guarantee translation improvements
Criticisms of BLEU [4]
bull Admits too much variation [3]
bull Admits too little variation [4] [3]
bull Poor correlation with human judgment [3]
Admits too much variation [3]
Permuting phrasesDrawing different items from reference set
Admits too much variation
bull BLEU relies on n-gram matching only
bull Puts very few constraints on how n-gram matches can be drawn from multiple reference translations
Brevity Penalty(Incorporating recall)
Modified precision
Set to 1N
Modified n-gram precision in BLEU [2]
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujhpar teraa suroor chhaayame-on your spell has-been-castYour spell has been cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spellYour your your spell
matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Permuting phrases [3]
bull Reordering of unmatched phrases does not affect precision
bull Bigram mismatch sites can be freely permuted
Possible to randomly produce other hypothesis translations that have the same BLEU score
B1 B2 B3 B4
B4 B2 B1 B3Bigram Bigrammismatch
Issues with precision (12)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate राजा और रानी शकार करन जगल गय म raaja aur raani shikaar karne gaye jungle meinKing and queen to-do-hunting went jungle to (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Issues with precision (22)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate शकार करन जगल राजा और रानी म गयshikaar karne jungle raaja aur raani mein gayeto-do hunting jungle raja and rani in went (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Permuting phrases in generalbull For lsquobrsquo bi-gram matches in a candidate translation of length
lsquokrsquo(k ndash b) possible ways to generate similarly score items using only the words
in this translation
In our example (8-4) = 24 candidate translations
In sentence of length ktotal bigrams = k ndash 1
matched bigrams = b
no of mismatched bigrams = k ndash 1 ndash bno of matched chunks = k ndash 1 ndash b + 1
= k ndash b
These (k-b) chunks can be reordered in (k ndash b) ways
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Formulation of BLEU Recap
bull Precision cannot be used as isbull Modified precision considers lsquoclipped word
countrsquo
Recall for MT (12)
bull Candidates shorter than referencesbull Reference या ल लब वा य क गणव ता को समझ
पाएगाkya blue lambe vaakya ki guNvatta ko samajh paaegawill blue long sentence-of quality (case-marker) understand able(III-person-male-singular)Will blue be able to understand quality of long sentence
Candidate लब वा य
lambe vaakyalong sentencelong sentence
modified unigram precision 22 = 1modified bigram precision 11 = 1
Recall for MT (22)
bull Candidates longer than referencesReference 2 मन भोजन कया
maine bhojan kiyaaI meal didI had meal
Candidate 1 मन खाना भोजन कया
maine khaana bhojan kiyaI food meal didI had food meal
Modified unigram precision 1
Reference 1 मन खाना खायाmaine khaana khaayaI food ateI ate food
Candidate 2 मन खाना खायाmaine khaana khaayaI food ateI ate food
Modified unigram precision 1
Formulating BLEU (Step 3) Incorporating recall
bull Sentence length indicates lsquobest matchrsquobull Brevity penalty (BP)
ndash Multiplicative factorndash Candidate translations that match reference
translations in length must be ranked higher
Candidate 1 लब वा य
Candidate 2 या ल लब वा य क गणव तासमझ पाएगा
Formulating BLEU (Step 3) Brevity Penalty
e^(1-x)
Graph drawn using wwwfooplotcom
BP
BP = 1 for c gt r
Why
x = ( r c )Formula from [2]
r Reference sentence lengthc Candidate sentence length
BP leaves out longer translations
WhyTranslations longer than reference are already
penalized by modified precision
Validating the claim
Formula from [2]
BLEU scorePrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Understanding BLEUUnderstanding BLEUDissecting the formulaUsing it news headline example
Decay in precision
Why log pnTo accommodate decay in precision values
Graph from [2]Formula from [2]
Dissecting the Formula
Claim BLEU should lie between 0 and 1Reason To intuitively satisfy ldquo1 implies perfect
translationrdquoUnderstanding constituents of the formula to
validate the claim
Brevity Penalty
Modified precision
Set to 1NFormula from [2]
Validation of range of BLEU
BP
pn Between 0 and 1log pn Between ndash(infinity) and 0 A Between ndash(infinity) and 0 e ^ (A) Between 0 and 1
A
BP Between 0 and 1
Graph drawn using wwwfooplotcom
(rc)
pn
log pn
A
e^(A)
Calculating BLEU An actual example
Ref भावना मक एक करण लान क लए एक योहारbhaavanaatmak ekikaran laane ke liye ek tyohaaremotional unity bring-to a festivalA festival to bring emotional unity
C1 लान क लए एक योहार भावना मक एक करणlaane-ke-liye ek tyoaar bhaavanaatmak ekikaranbring-to a festival emotional unity (This is invalid Hindi ordering)
C2 को एक उ सव क बार म लाना भावना मक एक करणko ek utsav ke baare mein laana bhaavanaatmak ekikaran
for a festival-about to-bring emotional unity (This is invalid Hindi ordering)
Modified n-gram precision
Ref भावना मक एक करण लान क लए एक योहारC1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 7 7 1
2 5 6 083
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 4 9 044
2 1 8 0125
r 7c 7
r 7c 9
Calculating BLEU scoren wn pn log pn wn
log pn
1 1 2 1 0 0
2 1 2 083 -0186 -0093
Total -0093
BLEU 0911
wn= 1 N = 1 2r = 7
C1c = 7BP = exp(1 ndash 77) = 1
C2c = 9BP = 1
n wn pn log pn wn log pn
1 1 2 044 -0821 04105
2 1 2 0125 -207 -1035
Total -1445
BLEU 0235
C1
C2
Hence the BLEU scores
C1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
0911
0235
Ref भावना मक एक करण लान क लए एक योहार
BLEU vs human BLEU vs human judgment judgment [2][2]
Target language EnglishSource language Chinese
Setup
BLEU scores obtained for each system
Five systems perform translation3 automatic MT systems2 human translators
Human judgment (on scale of 5) obtained for each system
bull Group 1 Ten Monolingual speakers of target language (English)
bull Group 2 Ten Bilingual speakers of Chinese and English
BLEU vs human judgment
bull Monolingual speakersCorrelation co-efficient 099
bull Bilingual speakersCorrelation co-efficient 096
BLEU and human evaluation for S2 and S3 Graph from [2]
Comparison of normalized values
bull High correlation between monolingual group and BLEU score
bull Bilingual group were lenient on lsquofluencyrsquo for H1
bull Demarcation between S1-S3 and H1-H2 is captured by BLEU
Graph from [2]
Conclusion
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Part II Shortcomings of BLEU(continued from 7th March 2013)
Presented by(As a part of CS 712)
Aditya JoshiKashyap PopatShubham Gautam
IIT Bombay
GuideProf Pushpak Bhattacharyya
IIT Bombay
BLEU score revisitedPrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Outline
bull Evaluationbull Formulating BLEU Metricbull Understanding BLEU formulabull Shortcomings of BLEU in generalbull Shortcomings in context of English-Hindi MT
Part I
Part II
Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
[A domesticated bird wants to learn to fly]
I can do this I just have to work out the physics I have quadrated by vector angles I have adjusted for wind shear
This is it Lets fly Just keep it simple Thrust lift drag and wait Thrust lift drag wait[speeds up to get to the of the table]Thrust lift drag waiwaiwait [gets scared and tries to stop himself
but instead falls from the table]
Lines from Rio (2011)
- Blu the Macaw
ldquo
Image Source wwwfanpopcom
Use of BLEU metric [3]
bull Evaluating incremental system changesndash Neglecting actual examples
bull Systems ranked on basis of BLEU
bull Does minimizing error rate with respect to BLEU indeed guarantee translation improvements
Criticisms of BLEU [4]
bull Admits too much variation [3]
bull Admits too little variation [4] [3]
bull Poor correlation with human judgment [3]
Admits too much variation [3]
Permuting phrasesDrawing different items from reference set
Admits too much variation
bull BLEU relies on n-gram matching only
bull Puts very few constraints on how n-gram matches can be drawn from multiple reference translations
Brevity Penalty(Incorporating recall)
Modified precision
Set to 1N
Modified n-gram precision in BLEU [2]
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujhpar teraa suroor chhaayame-on your spell has-been-castYour spell has been cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spellYour your your spell
matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Permuting phrases [3]
bull Reordering of unmatched phrases does not affect precision
bull Bigram mismatch sites can be freely permuted
Possible to randomly produce other hypothesis translations that have the same BLEU score
B1 B2 B3 B4
B4 B2 B1 B3Bigram Bigrammismatch
Issues with precision (12)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate राजा और रानी शकार करन जगल गय म raaja aur raani shikaar karne gaye jungle meinKing and queen to-do-hunting went jungle to (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Issues with precision (22)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate शकार करन जगल राजा और रानी म गयshikaar karne jungle raaja aur raani mein gayeto-do hunting jungle raja and rani in went (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Permuting phrases in generalbull For lsquobrsquo bi-gram matches in a candidate translation of length
lsquokrsquo(k ndash b) possible ways to generate similarly score items using only the words
in this translation
In our example (8-4) = 24 candidate translations
In sentence of length ktotal bigrams = k ndash 1
matched bigrams = b
no of mismatched bigrams = k ndash 1 ndash bno of matched chunks = k ndash 1 ndash b + 1
= k ndash b
These (k-b) chunks can be reordered in (k ndash b) ways
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Recall for MT (12)
bull Candidates shorter than referencesbull Reference या ल लब वा य क गणव ता को समझ
पाएगाkya blue lambe vaakya ki guNvatta ko samajh paaegawill blue long sentence-of quality (case-marker) understand able(III-person-male-singular)Will blue be able to understand quality of long sentence
Candidate लब वा य
lambe vaakyalong sentencelong sentence
modified unigram precision 22 = 1modified bigram precision 11 = 1
Recall for MT (22)
bull Candidates longer than referencesReference 2 मन भोजन कया
maine bhojan kiyaaI meal didI had meal
Candidate 1 मन खाना भोजन कया
maine khaana bhojan kiyaI food meal didI had food meal
Modified unigram precision 1
Reference 1 मन खाना खायाmaine khaana khaayaI food ateI ate food
Candidate 2 मन खाना खायाmaine khaana khaayaI food ateI ate food
Modified unigram precision 1
Formulating BLEU (Step 3) Incorporating recall
bull Sentence length indicates lsquobest matchrsquobull Brevity penalty (BP)
ndash Multiplicative factorndash Candidate translations that match reference
translations in length must be ranked higher
Candidate 1 लब वा य
Candidate 2 या ल लब वा य क गणव तासमझ पाएगा
Formulating BLEU (Step 3) Brevity Penalty
e^(1-x)
Graph drawn using wwwfooplotcom
BP
BP = 1 for c gt r
Why
x = ( r c )Formula from [2]
r Reference sentence lengthc Candidate sentence length
BP leaves out longer translations
WhyTranslations longer than reference are already
penalized by modified precision
Validating the claim
Formula from [2]
BLEU scorePrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Understanding BLEUUnderstanding BLEUDissecting the formulaUsing it news headline example
Decay in precision
Why log pnTo accommodate decay in precision values
Graph from [2]Formula from [2]
Dissecting the Formula
Claim BLEU should lie between 0 and 1Reason To intuitively satisfy ldquo1 implies perfect
translationrdquoUnderstanding constituents of the formula to
validate the claim
Brevity Penalty
Modified precision
Set to 1NFormula from [2]
Validation of range of BLEU
BP
pn Between 0 and 1log pn Between ndash(infinity) and 0 A Between ndash(infinity) and 0 e ^ (A) Between 0 and 1
A
BP Between 0 and 1
Graph drawn using wwwfooplotcom
(rc)
pn
log pn
A
e^(A)
Calculating BLEU An actual example
Ref भावना मक एक करण लान क लए एक योहारbhaavanaatmak ekikaran laane ke liye ek tyohaaremotional unity bring-to a festivalA festival to bring emotional unity
C1 लान क लए एक योहार भावना मक एक करणlaane-ke-liye ek tyoaar bhaavanaatmak ekikaranbring-to a festival emotional unity (This is invalid Hindi ordering)
C2 को एक उ सव क बार म लाना भावना मक एक करणko ek utsav ke baare mein laana bhaavanaatmak ekikaran
for a festival-about to-bring emotional unity (This is invalid Hindi ordering)
Modified n-gram precision
Ref भावना मक एक करण लान क लए एक योहारC1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 7 7 1
2 5 6 083
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 4 9 044
2 1 8 0125
r 7c 7
r 7c 9
Calculating BLEU scoren wn pn log pn wn
log pn
1 1 2 1 0 0
2 1 2 083 -0186 -0093
Total -0093
BLEU 0911
wn= 1 N = 1 2r = 7
C1c = 7BP = exp(1 ndash 77) = 1
C2c = 9BP = 1
n wn pn log pn wn log pn
1 1 2 044 -0821 04105
2 1 2 0125 -207 -1035
Total -1445
BLEU 0235
C1
C2
Hence the BLEU scores
C1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
0911
0235
Ref भावना मक एक करण लान क लए एक योहार
BLEU vs human BLEU vs human judgment judgment [2][2]
Target language EnglishSource language Chinese
Setup
BLEU scores obtained for each system
Five systems perform translation3 automatic MT systems2 human translators
Human judgment (on scale of 5) obtained for each system
bull Group 1 Ten Monolingual speakers of target language (English)
bull Group 2 Ten Bilingual speakers of Chinese and English
BLEU vs human judgment
bull Monolingual speakersCorrelation co-efficient 099
bull Bilingual speakersCorrelation co-efficient 096
BLEU and human evaluation for S2 and S3 Graph from [2]
Comparison of normalized values
bull High correlation between monolingual group and BLEU score
bull Bilingual group were lenient on lsquofluencyrsquo for H1
bull Demarcation between S1-S3 and H1-H2 is captured by BLEU
Graph from [2]
Conclusion
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Part II Shortcomings of BLEU(continued from 7th March 2013)
Presented by(As a part of CS 712)
Aditya JoshiKashyap PopatShubham Gautam
IIT Bombay
GuideProf Pushpak Bhattacharyya
IIT Bombay
BLEU score revisitedPrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Outline
bull Evaluationbull Formulating BLEU Metricbull Understanding BLEU formulabull Shortcomings of BLEU in generalbull Shortcomings in context of English-Hindi MT
Part I
Part II
Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
[A domesticated bird wants to learn to fly]
I can do this I just have to work out the physics I have quadrated by vector angles I have adjusted for wind shear
This is it Lets fly Just keep it simple Thrust lift drag and wait Thrust lift drag wait[speeds up to get to the of the table]Thrust lift drag waiwaiwait [gets scared and tries to stop himself
but instead falls from the table]
Lines from Rio (2011)
- Blu the Macaw
ldquo
Image Source wwwfanpopcom
Use of BLEU metric [3]
bull Evaluating incremental system changesndash Neglecting actual examples
bull Systems ranked on basis of BLEU
bull Does minimizing error rate with respect to BLEU indeed guarantee translation improvements
Criticisms of BLEU [4]
bull Admits too much variation [3]
bull Admits too little variation [4] [3]
bull Poor correlation with human judgment [3]
Admits too much variation [3]
Permuting phrasesDrawing different items from reference set
Admits too much variation
bull BLEU relies on n-gram matching only
bull Puts very few constraints on how n-gram matches can be drawn from multiple reference translations
Brevity Penalty(Incorporating recall)
Modified precision
Set to 1N
Modified n-gram precision in BLEU [2]
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujhpar teraa suroor chhaayame-on your spell has-been-castYour spell has been cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spellYour your your spell
matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Permuting phrases [3]
bull Reordering of unmatched phrases does not affect precision
bull Bigram mismatch sites can be freely permuted
Possible to randomly produce other hypothesis translations that have the same BLEU score
B1 B2 B3 B4
B4 B2 B1 B3Bigram Bigrammismatch
Issues with precision (12)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate राजा और रानी शकार करन जगल गय म raaja aur raani shikaar karne gaye jungle meinKing and queen to-do-hunting went jungle to (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Issues with precision (22)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate शकार करन जगल राजा और रानी म गयshikaar karne jungle raaja aur raani mein gayeto-do hunting jungle raja and rani in went (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Permuting phrases in generalbull For lsquobrsquo bi-gram matches in a candidate translation of length
lsquokrsquo(k ndash b) possible ways to generate similarly score items using only the words
in this translation
In our example (8-4) = 24 candidate translations
In sentence of length ktotal bigrams = k ndash 1
matched bigrams = b
no of mismatched bigrams = k ndash 1 ndash bno of matched chunks = k ndash 1 ndash b + 1
= k ndash b
These (k-b) chunks can be reordered in (k ndash b) ways
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Recall for MT (22)
bull Candidates longer than referencesReference 2 मन भोजन कया
maine bhojan kiyaaI meal didI had meal
Candidate 1 मन खाना भोजन कया
maine khaana bhojan kiyaI food meal didI had food meal
Modified unigram precision 1
Reference 1 मन खाना खायाmaine khaana khaayaI food ateI ate food
Candidate 2 मन खाना खायाmaine khaana khaayaI food ateI ate food
Modified unigram precision 1
Formulating BLEU (Step 3) Incorporating recall
bull Sentence length indicates lsquobest matchrsquobull Brevity penalty (BP)
ndash Multiplicative factorndash Candidate translations that match reference
translations in length must be ranked higher
Candidate 1 लब वा य
Candidate 2 या ल लब वा य क गणव तासमझ पाएगा
Formulating BLEU (Step 3) Brevity Penalty
e^(1-x)
Graph drawn using wwwfooplotcom
BP
BP = 1 for c gt r
Why
x = ( r c )Formula from [2]
r Reference sentence lengthc Candidate sentence length
BP leaves out longer translations
WhyTranslations longer than reference are already
penalized by modified precision
Validating the claim
Formula from [2]
BLEU scorePrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Understanding BLEUUnderstanding BLEUDissecting the formulaUsing it news headline example
Decay in precision
Why log pnTo accommodate decay in precision values
Graph from [2]Formula from [2]
Dissecting the Formula
Claim BLEU should lie between 0 and 1Reason To intuitively satisfy ldquo1 implies perfect
translationrdquoUnderstanding constituents of the formula to
validate the claim
Brevity Penalty
Modified precision
Set to 1NFormula from [2]
Validation of range of BLEU
BP
pn Between 0 and 1log pn Between ndash(infinity) and 0 A Between ndash(infinity) and 0 e ^ (A) Between 0 and 1
A
BP Between 0 and 1
Graph drawn using wwwfooplotcom
(rc)
pn
log pn
A
e^(A)
Calculating BLEU An actual example
Ref भावना मक एक करण लान क लए एक योहारbhaavanaatmak ekikaran laane ke liye ek tyohaaremotional unity bring-to a festivalA festival to bring emotional unity
C1 लान क लए एक योहार भावना मक एक करणlaane-ke-liye ek tyoaar bhaavanaatmak ekikaranbring-to a festival emotional unity (This is invalid Hindi ordering)
C2 को एक उ सव क बार म लाना भावना मक एक करणko ek utsav ke baare mein laana bhaavanaatmak ekikaran
for a festival-about to-bring emotional unity (This is invalid Hindi ordering)
Modified n-gram precision
Ref भावना मक एक करण लान क लए एक योहारC1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 7 7 1
2 5 6 083
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 4 9 044
2 1 8 0125
r 7c 7
r 7c 9
Calculating BLEU scoren wn pn log pn wn
log pn
1 1 2 1 0 0
2 1 2 083 -0186 -0093
Total -0093
BLEU 0911
wn= 1 N = 1 2r = 7
C1c = 7BP = exp(1 ndash 77) = 1
C2c = 9BP = 1
n wn pn log pn wn log pn
1 1 2 044 -0821 04105
2 1 2 0125 -207 -1035
Total -1445
BLEU 0235
C1
C2
Hence the BLEU scores
C1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
0911
0235
Ref भावना मक एक करण लान क लए एक योहार
BLEU vs human BLEU vs human judgment judgment [2][2]
Target language EnglishSource language Chinese
Setup
BLEU scores obtained for each system
Five systems perform translation3 automatic MT systems2 human translators
Human judgment (on scale of 5) obtained for each system
bull Group 1 Ten Monolingual speakers of target language (English)
bull Group 2 Ten Bilingual speakers of Chinese and English
BLEU vs human judgment
bull Monolingual speakersCorrelation co-efficient 099
bull Bilingual speakersCorrelation co-efficient 096
BLEU and human evaluation for S2 and S3 Graph from [2]
Comparison of normalized values
bull High correlation between monolingual group and BLEU score
bull Bilingual group were lenient on lsquofluencyrsquo for H1
bull Demarcation between S1-S3 and H1-H2 is captured by BLEU
Graph from [2]
Conclusion
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Part II Shortcomings of BLEU(continued from 7th March 2013)
Presented by(As a part of CS 712)
Aditya JoshiKashyap PopatShubham Gautam
IIT Bombay
GuideProf Pushpak Bhattacharyya
IIT Bombay
BLEU score revisitedPrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Outline
bull Evaluationbull Formulating BLEU Metricbull Understanding BLEU formulabull Shortcomings of BLEU in generalbull Shortcomings in context of English-Hindi MT
Part I
Part II
Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
[A domesticated bird wants to learn to fly]
I can do this I just have to work out the physics I have quadrated by vector angles I have adjusted for wind shear
This is it Lets fly Just keep it simple Thrust lift drag and wait Thrust lift drag wait[speeds up to get to the of the table]Thrust lift drag waiwaiwait [gets scared and tries to stop himself
but instead falls from the table]
Lines from Rio (2011)
- Blu the Macaw
ldquo
Image Source wwwfanpopcom
Use of BLEU metric [3]
bull Evaluating incremental system changesndash Neglecting actual examples
bull Systems ranked on basis of BLEU
bull Does minimizing error rate with respect to BLEU indeed guarantee translation improvements
Criticisms of BLEU [4]
bull Admits too much variation [3]
bull Admits too little variation [4] [3]
bull Poor correlation with human judgment [3]
Admits too much variation [3]
Permuting phrasesDrawing different items from reference set
Admits too much variation
bull BLEU relies on n-gram matching only
bull Puts very few constraints on how n-gram matches can be drawn from multiple reference translations
Brevity Penalty(Incorporating recall)
Modified precision
Set to 1N
Modified n-gram precision in BLEU [2]
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujhpar teraa suroor chhaayame-on your spell has-been-castYour spell has been cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spellYour your your spell
matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Permuting phrases [3]
bull Reordering of unmatched phrases does not affect precision
bull Bigram mismatch sites can be freely permuted
Possible to randomly produce other hypothesis translations that have the same BLEU score
B1 B2 B3 B4
B4 B2 B1 B3Bigram Bigrammismatch
Issues with precision (12)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate राजा और रानी शकार करन जगल गय म raaja aur raani shikaar karne gaye jungle meinKing and queen to-do-hunting went jungle to (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Issues with precision (22)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate शकार करन जगल राजा और रानी म गयshikaar karne jungle raaja aur raani mein gayeto-do hunting jungle raja and rani in went (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Permuting phrases in generalbull For lsquobrsquo bi-gram matches in a candidate translation of length
lsquokrsquo(k ndash b) possible ways to generate similarly score items using only the words
in this translation
In our example (8-4) = 24 candidate translations
In sentence of length ktotal bigrams = k ndash 1
matched bigrams = b
no of mismatched bigrams = k ndash 1 ndash bno of matched chunks = k ndash 1 ndash b + 1
= k ndash b
These (k-b) chunks can be reordered in (k ndash b) ways
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Formulating BLEU (Step 3) Incorporating recall
bull Sentence length indicates lsquobest matchrsquobull Brevity penalty (BP)
ndash Multiplicative factorndash Candidate translations that match reference
translations in length must be ranked higher
Candidate 1 लब वा य
Candidate 2 या ल लब वा य क गणव तासमझ पाएगा
Formulating BLEU (Step 3) Brevity Penalty
e^(1-x)
Graph drawn using wwwfooplotcom
BP
BP = 1 for c gt r
Why
x = ( r c )Formula from [2]
r Reference sentence lengthc Candidate sentence length
BP leaves out longer translations
WhyTranslations longer than reference are already
penalized by modified precision
Validating the claim
Formula from [2]
BLEU scorePrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Understanding BLEUUnderstanding BLEUDissecting the formulaUsing it news headline example
Decay in precision
Why log pnTo accommodate decay in precision values
Graph from [2]Formula from [2]
Dissecting the Formula
Claim BLEU should lie between 0 and 1Reason To intuitively satisfy ldquo1 implies perfect
translationrdquoUnderstanding constituents of the formula to
validate the claim
Brevity Penalty
Modified precision
Set to 1NFormula from [2]
Validation of range of BLEU
BP
pn Between 0 and 1log pn Between ndash(infinity) and 0 A Between ndash(infinity) and 0 e ^ (A) Between 0 and 1
A
BP Between 0 and 1
Graph drawn using wwwfooplotcom
(rc)
pn
log pn
A
e^(A)
Calculating BLEU An actual example
Ref भावना मक एक करण लान क लए एक योहारbhaavanaatmak ekikaran laane ke liye ek tyohaaremotional unity bring-to a festivalA festival to bring emotional unity
C1 लान क लए एक योहार भावना मक एक करणlaane-ke-liye ek tyoaar bhaavanaatmak ekikaranbring-to a festival emotional unity (This is invalid Hindi ordering)
C2 को एक उ सव क बार म लाना भावना मक एक करणko ek utsav ke baare mein laana bhaavanaatmak ekikaran
for a festival-about to-bring emotional unity (This is invalid Hindi ordering)
Modified n-gram precision
Ref भावना मक एक करण लान क लए एक योहारC1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 7 7 1
2 5 6 083
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 4 9 044
2 1 8 0125
r 7c 7
r 7c 9
Calculating BLEU scoren wn pn log pn wn
log pn
1 1 2 1 0 0
2 1 2 083 -0186 -0093
Total -0093
BLEU 0911
wn= 1 N = 1 2r = 7
C1c = 7BP = exp(1 ndash 77) = 1
C2c = 9BP = 1
n wn pn log pn wn log pn
1 1 2 044 -0821 04105
2 1 2 0125 -207 -1035
Total -1445
BLEU 0235
C1
C2
Hence the BLEU scores
C1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
0911
0235
Ref भावना मक एक करण लान क लए एक योहार
BLEU vs human BLEU vs human judgment judgment [2][2]
Target language EnglishSource language Chinese
Setup
BLEU scores obtained for each system
Five systems perform translation3 automatic MT systems2 human translators
Human judgment (on scale of 5) obtained for each system
bull Group 1 Ten Monolingual speakers of target language (English)
bull Group 2 Ten Bilingual speakers of Chinese and English
BLEU vs human judgment
bull Monolingual speakersCorrelation co-efficient 099
bull Bilingual speakersCorrelation co-efficient 096
BLEU and human evaluation for S2 and S3 Graph from [2]
Comparison of normalized values
bull High correlation between monolingual group and BLEU score
bull Bilingual group were lenient on lsquofluencyrsquo for H1
bull Demarcation between S1-S3 and H1-H2 is captured by BLEU
Graph from [2]
Conclusion
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Part II Shortcomings of BLEU(continued from 7th March 2013)
Presented by(As a part of CS 712)
Aditya JoshiKashyap PopatShubham Gautam
IIT Bombay
GuideProf Pushpak Bhattacharyya
IIT Bombay
BLEU score revisitedPrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Outline
bull Evaluationbull Formulating BLEU Metricbull Understanding BLEU formulabull Shortcomings of BLEU in generalbull Shortcomings in context of English-Hindi MT
Part I
Part II
Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
[A domesticated bird wants to learn to fly]
I can do this I just have to work out the physics I have quadrated by vector angles I have adjusted for wind shear
This is it Lets fly Just keep it simple Thrust lift drag and wait Thrust lift drag wait[speeds up to get to the of the table]Thrust lift drag waiwaiwait [gets scared and tries to stop himself
but instead falls from the table]
Lines from Rio (2011)
- Blu the Macaw
ldquo
Image Source wwwfanpopcom
Use of BLEU metric [3]
bull Evaluating incremental system changesndash Neglecting actual examples
bull Systems ranked on basis of BLEU
bull Does minimizing error rate with respect to BLEU indeed guarantee translation improvements
Criticisms of BLEU [4]
bull Admits too much variation [3]
bull Admits too little variation [4] [3]
bull Poor correlation with human judgment [3]
Admits too much variation [3]
Permuting phrasesDrawing different items from reference set
Admits too much variation
bull BLEU relies on n-gram matching only
bull Puts very few constraints on how n-gram matches can be drawn from multiple reference translations
Brevity Penalty(Incorporating recall)
Modified precision
Set to 1N
Modified n-gram precision in BLEU [2]
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujhpar teraa suroor chhaayame-on your spell has-been-castYour spell has been cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spellYour your your spell
matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Permuting phrases [3]
bull Reordering of unmatched phrases does not affect precision
bull Bigram mismatch sites can be freely permuted
Possible to randomly produce other hypothesis translations that have the same BLEU score
B1 B2 B3 B4
B4 B2 B1 B3Bigram Bigrammismatch
Issues with precision (12)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate राजा और रानी शकार करन जगल गय म raaja aur raani shikaar karne gaye jungle meinKing and queen to-do-hunting went jungle to (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Issues with precision (22)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate शकार करन जगल राजा और रानी म गयshikaar karne jungle raaja aur raani mein gayeto-do hunting jungle raja and rani in went (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Permuting phrases in generalbull For lsquobrsquo bi-gram matches in a candidate translation of length
lsquokrsquo(k ndash b) possible ways to generate similarly score items using only the words
in this translation
In our example (8-4) = 24 candidate translations
In sentence of length ktotal bigrams = k ndash 1
matched bigrams = b
no of mismatched bigrams = k ndash 1 ndash bno of matched chunks = k ndash 1 ndash b + 1
= k ndash b
These (k-b) chunks can be reordered in (k ndash b) ways
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Formulating BLEU (Step 3) Brevity Penalty
e^(1-x)
Graph drawn using wwwfooplotcom
BP
BP = 1 for c gt r
Why
x = ( r c )Formula from [2]
r Reference sentence lengthc Candidate sentence length
BP leaves out longer translations
WhyTranslations longer than reference are already
penalized by modified precision
Validating the claim
Formula from [2]
BLEU scorePrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Understanding BLEUUnderstanding BLEUDissecting the formulaUsing it news headline example
Decay in precision
Why log pnTo accommodate decay in precision values
Graph from [2]Formula from [2]
Dissecting the Formula
Claim BLEU should lie between 0 and 1Reason To intuitively satisfy ldquo1 implies perfect
translationrdquoUnderstanding constituents of the formula to
validate the claim
Brevity Penalty
Modified precision
Set to 1NFormula from [2]
Validation of range of BLEU
BP
pn Between 0 and 1log pn Between ndash(infinity) and 0 A Between ndash(infinity) and 0 e ^ (A) Between 0 and 1
A
BP Between 0 and 1
Graph drawn using wwwfooplotcom
(rc)
pn
log pn
A
e^(A)
Calculating BLEU An actual example
Ref भावना मक एक करण लान क लए एक योहारbhaavanaatmak ekikaran laane ke liye ek tyohaaremotional unity bring-to a festivalA festival to bring emotional unity
C1 लान क लए एक योहार भावना मक एक करणlaane-ke-liye ek tyoaar bhaavanaatmak ekikaranbring-to a festival emotional unity (This is invalid Hindi ordering)
C2 को एक उ सव क बार म लाना भावना मक एक करणko ek utsav ke baare mein laana bhaavanaatmak ekikaran
for a festival-about to-bring emotional unity (This is invalid Hindi ordering)
Modified n-gram precision
Ref भावना मक एक करण लान क लए एक योहारC1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 7 7 1
2 5 6 083
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 4 9 044
2 1 8 0125
r 7c 7
r 7c 9
Calculating BLEU scoren wn pn log pn wn
log pn
1 1 2 1 0 0
2 1 2 083 -0186 -0093
Total -0093
BLEU 0911
wn= 1 N = 1 2r = 7
C1c = 7BP = exp(1 ndash 77) = 1
C2c = 9BP = 1
n wn pn log pn wn log pn
1 1 2 044 -0821 04105
2 1 2 0125 -207 -1035
Total -1445
BLEU 0235
C1
C2
Hence the BLEU scores
C1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
0911
0235
Ref भावना मक एक करण लान क लए एक योहार
BLEU vs human BLEU vs human judgment judgment [2][2]
Target language EnglishSource language Chinese
Setup
BLEU scores obtained for each system
Five systems perform translation3 automatic MT systems2 human translators
Human judgment (on scale of 5) obtained for each system
bull Group 1 Ten Monolingual speakers of target language (English)
bull Group 2 Ten Bilingual speakers of Chinese and English
BLEU vs human judgment
bull Monolingual speakersCorrelation co-efficient 099
bull Bilingual speakersCorrelation co-efficient 096
BLEU and human evaluation for S2 and S3 Graph from [2]
Comparison of normalized values
bull High correlation between monolingual group and BLEU score
bull Bilingual group were lenient on lsquofluencyrsquo for H1
bull Demarcation between S1-S3 and H1-H2 is captured by BLEU
Graph from [2]
Conclusion
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Part II Shortcomings of BLEU(continued from 7th March 2013)
Presented by(As a part of CS 712)
Aditya JoshiKashyap PopatShubham Gautam
IIT Bombay
GuideProf Pushpak Bhattacharyya
IIT Bombay
BLEU score revisitedPrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Outline
bull Evaluationbull Formulating BLEU Metricbull Understanding BLEU formulabull Shortcomings of BLEU in generalbull Shortcomings in context of English-Hindi MT
Part I
Part II
Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
[A domesticated bird wants to learn to fly]
I can do this I just have to work out the physics I have quadrated by vector angles I have adjusted for wind shear
This is it Lets fly Just keep it simple Thrust lift drag and wait Thrust lift drag wait[speeds up to get to the of the table]Thrust lift drag waiwaiwait [gets scared and tries to stop himself
but instead falls from the table]
Lines from Rio (2011)
- Blu the Macaw
ldquo
Image Source wwwfanpopcom
Use of BLEU metric [3]
bull Evaluating incremental system changesndash Neglecting actual examples
bull Systems ranked on basis of BLEU
bull Does minimizing error rate with respect to BLEU indeed guarantee translation improvements
Criticisms of BLEU [4]
bull Admits too much variation [3]
bull Admits too little variation [4] [3]
bull Poor correlation with human judgment [3]
Admits too much variation [3]
Permuting phrasesDrawing different items from reference set
Admits too much variation
bull BLEU relies on n-gram matching only
bull Puts very few constraints on how n-gram matches can be drawn from multiple reference translations
Brevity Penalty(Incorporating recall)
Modified precision
Set to 1N
Modified n-gram precision in BLEU [2]
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujhpar teraa suroor chhaayame-on your spell has-been-castYour spell has been cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spellYour your your spell
matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Permuting phrases [3]
bull Reordering of unmatched phrases does not affect precision
bull Bigram mismatch sites can be freely permuted
Possible to randomly produce other hypothesis translations that have the same BLEU score
B1 B2 B3 B4
B4 B2 B1 B3Bigram Bigrammismatch
Issues with precision (12)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate राजा और रानी शकार करन जगल गय म raaja aur raani shikaar karne gaye jungle meinKing and queen to-do-hunting went jungle to (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Issues with precision (22)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate शकार करन जगल राजा और रानी म गयshikaar karne jungle raaja aur raani mein gayeto-do hunting jungle raja and rani in went (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Permuting phrases in generalbull For lsquobrsquo bi-gram matches in a candidate translation of length
lsquokrsquo(k ndash b) possible ways to generate similarly score items using only the words
in this translation
In our example (8-4) = 24 candidate translations
In sentence of length ktotal bigrams = k ndash 1
matched bigrams = b
no of mismatched bigrams = k ndash 1 ndash bno of matched chunks = k ndash 1 ndash b + 1
= k ndash b
These (k-b) chunks can be reordered in (k ndash b) ways
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
BP leaves out longer translations
WhyTranslations longer than reference are already
penalized by modified precision
Validating the claim
Formula from [2]
BLEU scorePrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Understanding BLEUUnderstanding BLEUDissecting the formulaUsing it news headline example
Decay in precision
Why log pnTo accommodate decay in precision values
Graph from [2]Formula from [2]
Dissecting the Formula
Claim BLEU should lie between 0 and 1Reason To intuitively satisfy ldquo1 implies perfect
translationrdquoUnderstanding constituents of the formula to
validate the claim
Brevity Penalty
Modified precision
Set to 1NFormula from [2]
Validation of range of BLEU
BP
pn Between 0 and 1log pn Between ndash(infinity) and 0 A Between ndash(infinity) and 0 e ^ (A) Between 0 and 1
A
BP Between 0 and 1
Graph drawn using wwwfooplotcom
(rc)
pn
log pn
A
e^(A)
Calculating BLEU An actual example
Ref भावना मक एक करण लान क लए एक योहारbhaavanaatmak ekikaran laane ke liye ek tyohaaremotional unity bring-to a festivalA festival to bring emotional unity
C1 लान क लए एक योहार भावना मक एक करणlaane-ke-liye ek tyoaar bhaavanaatmak ekikaranbring-to a festival emotional unity (This is invalid Hindi ordering)
C2 को एक उ सव क बार म लाना भावना मक एक करणko ek utsav ke baare mein laana bhaavanaatmak ekikaran
for a festival-about to-bring emotional unity (This is invalid Hindi ordering)
Modified n-gram precision
Ref भावना मक एक करण लान क लए एक योहारC1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 7 7 1
2 5 6 083
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 4 9 044
2 1 8 0125
r 7c 7
r 7c 9
Calculating BLEU scoren wn pn log pn wn
log pn
1 1 2 1 0 0
2 1 2 083 -0186 -0093
Total -0093
BLEU 0911
wn= 1 N = 1 2r = 7
C1c = 7BP = exp(1 ndash 77) = 1
C2c = 9BP = 1
n wn pn log pn wn log pn
1 1 2 044 -0821 04105
2 1 2 0125 -207 -1035
Total -1445
BLEU 0235
C1
C2
Hence the BLEU scores
C1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
0911
0235
Ref भावना मक एक करण लान क लए एक योहार
BLEU vs human BLEU vs human judgment judgment [2][2]
Target language EnglishSource language Chinese
Setup
BLEU scores obtained for each system
Five systems perform translation3 automatic MT systems2 human translators
Human judgment (on scale of 5) obtained for each system
bull Group 1 Ten Monolingual speakers of target language (English)
bull Group 2 Ten Bilingual speakers of Chinese and English
BLEU vs human judgment
bull Monolingual speakersCorrelation co-efficient 099
bull Bilingual speakersCorrelation co-efficient 096
BLEU and human evaluation for S2 and S3 Graph from [2]
Comparison of normalized values
bull High correlation between monolingual group and BLEU score
bull Bilingual group were lenient on lsquofluencyrsquo for H1
bull Demarcation between S1-S3 and H1-H2 is captured by BLEU
Graph from [2]
Conclusion
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Part II Shortcomings of BLEU(continued from 7th March 2013)
Presented by(As a part of CS 712)
Aditya JoshiKashyap PopatShubham Gautam
IIT Bombay
GuideProf Pushpak Bhattacharyya
IIT Bombay
BLEU score revisitedPrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Outline
bull Evaluationbull Formulating BLEU Metricbull Understanding BLEU formulabull Shortcomings of BLEU in generalbull Shortcomings in context of English-Hindi MT
Part I
Part II
Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
[A domesticated bird wants to learn to fly]
I can do this I just have to work out the physics I have quadrated by vector angles I have adjusted for wind shear
This is it Lets fly Just keep it simple Thrust lift drag and wait Thrust lift drag wait[speeds up to get to the of the table]Thrust lift drag waiwaiwait [gets scared and tries to stop himself
but instead falls from the table]
Lines from Rio (2011)
- Blu the Macaw
ldquo
Image Source wwwfanpopcom
Use of BLEU metric [3]
bull Evaluating incremental system changesndash Neglecting actual examples
bull Systems ranked on basis of BLEU
bull Does minimizing error rate with respect to BLEU indeed guarantee translation improvements
Criticisms of BLEU [4]
bull Admits too much variation [3]
bull Admits too little variation [4] [3]
bull Poor correlation with human judgment [3]
Admits too much variation [3]
Permuting phrasesDrawing different items from reference set
Admits too much variation
bull BLEU relies on n-gram matching only
bull Puts very few constraints on how n-gram matches can be drawn from multiple reference translations
Brevity Penalty(Incorporating recall)
Modified precision
Set to 1N
Modified n-gram precision in BLEU [2]
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujhpar teraa suroor chhaayame-on your spell has-been-castYour spell has been cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spellYour your your spell
matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Permuting phrases [3]
bull Reordering of unmatched phrases does not affect precision
bull Bigram mismatch sites can be freely permuted
Possible to randomly produce other hypothesis translations that have the same BLEU score
B1 B2 B3 B4
B4 B2 B1 B3Bigram Bigrammismatch
Issues with precision (12)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate राजा और रानी शकार करन जगल गय म raaja aur raani shikaar karne gaye jungle meinKing and queen to-do-hunting went jungle to (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Issues with precision (22)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate शकार करन जगल राजा और रानी म गयshikaar karne jungle raaja aur raani mein gayeto-do hunting jungle raja and rani in went (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Permuting phrases in generalbull For lsquobrsquo bi-gram matches in a candidate translation of length
lsquokrsquo(k ndash b) possible ways to generate similarly score items using only the words
in this translation
In our example (8-4) = 24 candidate translations
In sentence of length ktotal bigrams = k ndash 1
matched bigrams = b
no of mismatched bigrams = k ndash 1 ndash bno of matched chunks = k ndash 1 ndash b + 1
= k ndash b
These (k-b) chunks can be reordered in (k ndash b) ways
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
BLEU scorePrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Understanding BLEUUnderstanding BLEUDissecting the formulaUsing it news headline example
Decay in precision
Why log pnTo accommodate decay in precision values
Graph from [2]Formula from [2]
Dissecting the Formula
Claim BLEU should lie between 0 and 1Reason To intuitively satisfy ldquo1 implies perfect
translationrdquoUnderstanding constituents of the formula to
validate the claim
Brevity Penalty
Modified precision
Set to 1NFormula from [2]
Validation of range of BLEU
BP
pn Between 0 and 1log pn Between ndash(infinity) and 0 A Between ndash(infinity) and 0 e ^ (A) Between 0 and 1
A
BP Between 0 and 1
Graph drawn using wwwfooplotcom
(rc)
pn
log pn
A
e^(A)
Calculating BLEU An actual example
Ref भावना मक एक करण लान क लए एक योहारbhaavanaatmak ekikaran laane ke liye ek tyohaaremotional unity bring-to a festivalA festival to bring emotional unity
C1 लान क लए एक योहार भावना मक एक करणlaane-ke-liye ek tyoaar bhaavanaatmak ekikaranbring-to a festival emotional unity (This is invalid Hindi ordering)
C2 को एक उ सव क बार म लाना भावना मक एक करणko ek utsav ke baare mein laana bhaavanaatmak ekikaran
for a festival-about to-bring emotional unity (This is invalid Hindi ordering)
Modified n-gram precision
Ref भावना मक एक करण लान क लए एक योहारC1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 7 7 1
2 5 6 083
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 4 9 044
2 1 8 0125
r 7c 7
r 7c 9
Calculating BLEU scoren wn pn log pn wn
log pn
1 1 2 1 0 0
2 1 2 083 -0186 -0093
Total -0093
BLEU 0911
wn= 1 N = 1 2r = 7
C1c = 7BP = exp(1 ndash 77) = 1
C2c = 9BP = 1
n wn pn log pn wn log pn
1 1 2 044 -0821 04105
2 1 2 0125 -207 -1035
Total -1445
BLEU 0235
C1
C2
Hence the BLEU scores
C1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
0911
0235
Ref भावना मक एक करण लान क लए एक योहार
BLEU vs human BLEU vs human judgment judgment [2][2]
Target language EnglishSource language Chinese
Setup
BLEU scores obtained for each system
Five systems perform translation3 automatic MT systems2 human translators
Human judgment (on scale of 5) obtained for each system
bull Group 1 Ten Monolingual speakers of target language (English)
bull Group 2 Ten Bilingual speakers of Chinese and English
BLEU vs human judgment
bull Monolingual speakersCorrelation co-efficient 099
bull Bilingual speakersCorrelation co-efficient 096
BLEU and human evaluation for S2 and S3 Graph from [2]
Comparison of normalized values
bull High correlation between monolingual group and BLEU score
bull Bilingual group were lenient on lsquofluencyrsquo for H1
bull Demarcation between S1-S3 and H1-H2 is captured by BLEU
Graph from [2]
Conclusion
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Part II Shortcomings of BLEU(continued from 7th March 2013)
Presented by(As a part of CS 712)
Aditya JoshiKashyap PopatShubham Gautam
IIT Bombay
GuideProf Pushpak Bhattacharyya
IIT Bombay
BLEU score revisitedPrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Outline
bull Evaluationbull Formulating BLEU Metricbull Understanding BLEU formulabull Shortcomings of BLEU in generalbull Shortcomings in context of English-Hindi MT
Part I
Part II
Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
[A domesticated bird wants to learn to fly]
I can do this I just have to work out the physics I have quadrated by vector angles I have adjusted for wind shear
This is it Lets fly Just keep it simple Thrust lift drag and wait Thrust lift drag wait[speeds up to get to the of the table]Thrust lift drag waiwaiwait [gets scared and tries to stop himself
but instead falls from the table]
Lines from Rio (2011)
- Blu the Macaw
ldquo
Image Source wwwfanpopcom
Use of BLEU metric [3]
bull Evaluating incremental system changesndash Neglecting actual examples
bull Systems ranked on basis of BLEU
bull Does minimizing error rate with respect to BLEU indeed guarantee translation improvements
Criticisms of BLEU [4]
bull Admits too much variation [3]
bull Admits too little variation [4] [3]
bull Poor correlation with human judgment [3]
Admits too much variation [3]
Permuting phrasesDrawing different items from reference set
Admits too much variation
bull BLEU relies on n-gram matching only
bull Puts very few constraints on how n-gram matches can be drawn from multiple reference translations
Brevity Penalty(Incorporating recall)
Modified precision
Set to 1N
Modified n-gram precision in BLEU [2]
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujhpar teraa suroor chhaayame-on your spell has-been-castYour spell has been cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spellYour your your spell
matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Permuting phrases [3]
bull Reordering of unmatched phrases does not affect precision
bull Bigram mismatch sites can be freely permuted
Possible to randomly produce other hypothesis translations that have the same BLEU score
B1 B2 B3 B4
B4 B2 B1 B3Bigram Bigrammismatch
Issues with precision (12)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate राजा और रानी शकार करन जगल गय म raaja aur raani shikaar karne gaye jungle meinKing and queen to-do-hunting went jungle to (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Issues with precision (22)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate शकार करन जगल राजा और रानी म गयshikaar karne jungle raaja aur raani mein gayeto-do hunting jungle raja and rani in went (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Permuting phrases in generalbull For lsquobrsquo bi-gram matches in a candidate translation of length
lsquokrsquo(k ndash b) possible ways to generate similarly score items using only the words
in this translation
In our example (8-4) = 24 candidate translations
In sentence of length ktotal bigrams = k ndash 1
matched bigrams = b
no of mismatched bigrams = k ndash 1 ndash bno of matched chunks = k ndash 1 ndash b + 1
= k ndash b
These (k-b) chunks can be reordered in (k ndash b) ways
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Understanding BLEUUnderstanding BLEUDissecting the formulaUsing it news headline example
Decay in precision
Why log pnTo accommodate decay in precision values
Graph from [2]Formula from [2]
Dissecting the Formula
Claim BLEU should lie between 0 and 1Reason To intuitively satisfy ldquo1 implies perfect
translationrdquoUnderstanding constituents of the formula to
validate the claim
Brevity Penalty
Modified precision
Set to 1NFormula from [2]
Validation of range of BLEU
BP
pn Between 0 and 1log pn Between ndash(infinity) and 0 A Between ndash(infinity) and 0 e ^ (A) Between 0 and 1
A
BP Between 0 and 1
Graph drawn using wwwfooplotcom
(rc)
pn
log pn
A
e^(A)
Calculating BLEU An actual example
Ref भावना मक एक करण लान क लए एक योहारbhaavanaatmak ekikaran laane ke liye ek tyohaaremotional unity bring-to a festivalA festival to bring emotional unity
C1 लान क लए एक योहार भावना मक एक करणlaane-ke-liye ek tyoaar bhaavanaatmak ekikaranbring-to a festival emotional unity (This is invalid Hindi ordering)
C2 को एक उ सव क बार म लाना भावना मक एक करणko ek utsav ke baare mein laana bhaavanaatmak ekikaran
for a festival-about to-bring emotional unity (This is invalid Hindi ordering)
Modified n-gram precision
Ref भावना मक एक करण लान क लए एक योहारC1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 7 7 1
2 5 6 083
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 4 9 044
2 1 8 0125
r 7c 7
r 7c 9
Calculating BLEU scoren wn pn log pn wn
log pn
1 1 2 1 0 0
2 1 2 083 -0186 -0093
Total -0093
BLEU 0911
wn= 1 N = 1 2r = 7
C1c = 7BP = exp(1 ndash 77) = 1
C2c = 9BP = 1
n wn pn log pn wn log pn
1 1 2 044 -0821 04105
2 1 2 0125 -207 -1035
Total -1445
BLEU 0235
C1
C2
Hence the BLEU scores
C1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
0911
0235
Ref भावना मक एक करण लान क लए एक योहार
BLEU vs human BLEU vs human judgment judgment [2][2]
Target language EnglishSource language Chinese
Setup
BLEU scores obtained for each system
Five systems perform translation3 automatic MT systems2 human translators
Human judgment (on scale of 5) obtained for each system
bull Group 1 Ten Monolingual speakers of target language (English)
bull Group 2 Ten Bilingual speakers of Chinese and English
BLEU vs human judgment
bull Monolingual speakersCorrelation co-efficient 099
bull Bilingual speakersCorrelation co-efficient 096
BLEU and human evaluation for S2 and S3 Graph from [2]
Comparison of normalized values
bull High correlation between monolingual group and BLEU score
bull Bilingual group were lenient on lsquofluencyrsquo for H1
bull Demarcation between S1-S3 and H1-H2 is captured by BLEU
Graph from [2]
Conclusion
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Part II Shortcomings of BLEU(continued from 7th March 2013)
Presented by(As a part of CS 712)
Aditya JoshiKashyap PopatShubham Gautam
IIT Bombay
GuideProf Pushpak Bhattacharyya
IIT Bombay
BLEU score revisitedPrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Outline
bull Evaluationbull Formulating BLEU Metricbull Understanding BLEU formulabull Shortcomings of BLEU in generalbull Shortcomings in context of English-Hindi MT
Part I
Part II
Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
[A domesticated bird wants to learn to fly]
I can do this I just have to work out the physics I have quadrated by vector angles I have adjusted for wind shear
This is it Lets fly Just keep it simple Thrust lift drag and wait Thrust lift drag wait[speeds up to get to the of the table]Thrust lift drag waiwaiwait [gets scared and tries to stop himself
but instead falls from the table]
Lines from Rio (2011)
- Blu the Macaw
ldquo
Image Source wwwfanpopcom
Use of BLEU metric [3]
bull Evaluating incremental system changesndash Neglecting actual examples
bull Systems ranked on basis of BLEU
bull Does minimizing error rate with respect to BLEU indeed guarantee translation improvements
Criticisms of BLEU [4]
bull Admits too much variation [3]
bull Admits too little variation [4] [3]
bull Poor correlation with human judgment [3]
Admits too much variation [3]
Permuting phrasesDrawing different items from reference set
Admits too much variation
bull BLEU relies on n-gram matching only
bull Puts very few constraints on how n-gram matches can be drawn from multiple reference translations
Brevity Penalty(Incorporating recall)
Modified precision
Set to 1N
Modified n-gram precision in BLEU [2]
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujhpar teraa suroor chhaayame-on your spell has-been-castYour spell has been cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spellYour your your spell
matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Permuting phrases [3]
bull Reordering of unmatched phrases does not affect precision
bull Bigram mismatch sites can be freely permuted
Possible to randomly produce other hypothesis translations that have the same BLEU score
B1 B2 B3 B4
B4 B2 B1 B3Bigram Bigrammismatch
Issues with precision (12)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate राजा और रानी शकार करन जगल गय म raaja aur raani shikaar karne gaye jungle meinKing and queen to-do-hunting went jungle to (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Issues with precision (22)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate शकार करन जगल राजा और रानी म गयshikaar karne jungle raaja aur raani mein gayeto-do hunting jungle raja and rani in went (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Permuting phrases in generalbull For lsquobrsquo bi-gram matches in a candidate translation of length
lsquokrsquo(k ndash b) possible ways to generate similarly score items using only the words
in this translation
In our example (8-4) = 24 candidate translations
In sentence of length ktotal bigrams = k ndash 1
matched bigrams = b
no of mismatched bigrams = k ndash 1 ndash bno of matched chunks = k ndash 1 ndash b + 1
= k ndash b
These (k-b) chunks can be reordered in (k ndash b) ways
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Decay in precision
Why log pnTo accommodate decay in precision values
Graph from [2]Formula from [2]
Dissecting the Formula
Claim BLEU should lie between 0 and 1Reason To intuitively satisfy ldquo1 implies perfect
translationrdquoUnderstanding constituents of the formula to
validate the claim
Brevity Penalty
Modified precision
Set to 1NFormula from [2]
Validation of range of BLEU
BP
pn Between 0 and 1log pn Between ndash(infinity) and 0 A Between ndash(infinity) and 0 e ^ (A) Between 0 and 1
A
BP Between 0 and 1
Graph drawn using wwwfooplotcom
(rc)
pn
log pn
A
e^(A)
Calculating BLEU An actual example
Ref भावना मक एक करण लान क लए एक योहारbhaavanaatmak ekikaran laane ke liye ek tyohaaremotional unity bring-to a festivalA festival to bring emotional unity
C1 लान क लए एक योहार भावना मक एक करणlaane-ke-liye ek tyoaar bhaavanaatmak ekikaranbring-to a festival emotional unity (This is invalid Hindi ordering)
C2 को एक उ सव क बार म लाना भावना मक एक करणko ek utsav ke baare mein laana bhaavanaatmak ekikaran
for a festival-about to-bring emotional unity (This is invalid Hindi ordering)
Modified n-gram precision
Ref भावना मक एक करण लान क लए एक योहारC1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 7 7 1
2 5 6 083
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 4 9 044
2 1 8 0125
r 7c 7
r 7c 9
Calculating BLEU scoren wn pn log pn wn
log pn
1 1 2 1 0 0
2 1 2 083 -0186 -0093
Total -0093
BLEU 0911
wn= 1 N = 1 2r = 7
C1c = 7BP = exp(1 ndash 77) = 1
C2c = 9BP = 1
n wn pn log pn wn log pn
1 1 2 044 -0821 04105
2 1 2 0125 -207 -1035
Total -1445
BLEU 0235
C1
C2
Hence the BLEU scores
C1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
0911
0235
Ref भावना मक एक करण लान क लए एक योहार
BLEU vs human BLEU vs human judgment judgment [2][2]
Target language EnglishSource language Chinese
Setup
BLEU scores obtained for each system
Five systems perform translation3 automatic MT systems2 human translators
Human judgment (on scale of 5) obtained for each system
bull Group 1 Ten Monolingual speakers of target language (English)
bull Group 2 Ten Bilingual speakers of Chinese and English
BLEU vs human judgment
bull Monolingual speakersCorrelation co-efficient 099
bull Bilingual speakersCorrelation co-efficient 096
BLEU and human evaluation for S2 and S3 Graph from [2]
Comparison of normalized values
bull High correlation between monolingual group and BLEU score
bull Bilingual group were lenient on lsquofluencyrsquo for H1
bull Demarcation between S1-S3 and H1-H2 is captured by BLEU
Graph from [2]
Conclusion
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Part II Shortcomings of BLEU(continued from 7th March 2013)
Presented by(As a part of CS 712)
Aditya JoshiKashyap PopatShubham Gautam
IIT Bombay
GuideProf Pushpak Bhattacharyya
IIT Bombay
BLEU score revisitedPrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Outline
bull Evaluationbull Formulating BLEU Metricbull Understanding BLEU formulabull Shortcomings of BLEU in generalbull Shortcomings in context of English-Hindi MT
Part I
Part II
Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
[A domesticated bird wants to learn to fly]
I can do this I just have to work out the physics I have quadrated by vector angles I have adjusted for wind shear
This is it Lets fly Just keep it simple Thrust lift drag and wait Thrust lift drag wait[speeds up to get to the of the table]Thrust lift drag waiwaiwait [gets scared and tries to stop himself
but instead falls from the table]
Lines from Rio (2011)
- Blu the Macaw
ldquo
Image Source wwwfanpopcom
Use of BLEU metric [3]
bull Evaluating incremental system changesndash Neglecting actual examples
bull Systems ranked on basis of BLEU
bull Does minimizing error rate with respect to BLEU indeed guarantee translation improvements
Criticisms of BLEU [4]
bull Admits too much variation [3]
bull Admits too little variation [4] [3]
bull Poor correlation with human judgment [3]
Admits too much variation [3]
Permuting phrasesDrawing different items from reference set
Admits too much variation
bull BLEU relies on n-gram matching only
bull Puts very few constraints on how n-gram matches can be drawn from multiple reference translations
Brevity Penalty(Incorporating recall)
Modified precision
Set to 1N
Modified n-gram precision in BLEU [2]
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujhpar teraa suroor chhaayame-on your spell has-been-castYour spell has been cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spellYour your your spell
matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Permuting phrases [3]
bull Reordering of unmatched phrases does not affect precision
bull Bigram mismatch sites can be freely permuted
Possible to randomly produce other hypothesis translations that have the same BLEU score
B1 B2 B3 B4
B4 B2 B1 B3Bigram Bigrammismatch
Issues with precision (12)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate राजा और रानी शकार करन जगल गय म raaja aur raani shikaar karne gaye jungle meinKing and queen to-do-hunting went jungle to (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Issues with precision (22)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate शकार करन जगल राजा और रानी म गयshikaar karne jungle raaja aur raani mein gayeto-do hunting jungle raja and rani in went (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Permuting phrases in generalbull For lsquobrsquo bi-gram matches in a candidate translation of length
lsquokrsquo(k ndash b) possible ways to generate similarly score items using only the words
in this translation
In our example (8-4) = 24 candidate translations
In sentence of length ktotal bigrams = k ndash 1
matched bigrams = b
no of mismatched bigrams = k ndash 1 ndash bno of matched chunks = k ndash 1 ndash b + 1
= k ndash b
These (k-b) chunks can be reordered in (k ndash b) ways
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Dissecting the Formula
Claim BLEU should lie between 0 and 1Reason To intuitively satisfy ldquo1 implies perfect
translationrdquoUnderstanding constituents of the formula to
validate the claim
Brevity Penalty
Modified precision
Set to 1NFormula from [2]
Validation of range of BLEU
BP
pn Between 0 and 1log pn Between ndash(infinity) and 0 A Between ndash(infinity) and 0 e ^ (A) Between 0 and 1
A
BP Between 0 and 1
Graph drawn using wwwfooplotcom
(rc)
pn
log pn
A
e^(A)
Calculating BLEU An actual example
Ref भावना मक एक करण लान क लए एक योहारbhaavanaatmak ekikaran laane ke liye ek tyohaaremotional unity bring-to a festivalA festival to bring emotional unity
C1 लान क लए एक योहार भावना मक एक करणlaane-ke-liye ek tyoaar bhaavanaatmak ekikaranbring-to a festival emotional unity (This is invalid Hindi ordering)
C2 को एक उ सव क बार म लाना भावना मक एक करणko ek utsav ke baare mein laana bhaavanaatmak ekikaran
for a festival-about to-bring emotional unity (This is invalid Hindi ordering)
Modified n-gram precision
Ref भावना मक एक करण लान क लए एक योहारC1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 7 7 1
2 5 6 083
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 4 9 044
2 1 8 0125
r 7c 7
r 7c 9
Calculating BLEU scoren wn pn log pn wn
log pn
1 1 2 1 0 0
2 1 2 083 -0186 -0093
Total -0093
BLEU 0911
wn= 1 N = 1 2r = 7
C1c = 7BP = exp(1 ndash 77) = 1
C2c = 9BP = 1
n wn pn log pn wn log pn
1 1 2 044 -0821 04105
2 1 2 0125 -207 -1035
Total -1445
BLEU 0235
C1
C2
Hence the BLEU scores
C1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
0911
0235
Ref भावना मक एक करण लान क लए एक योहार
BLEU vs human BLEU vs human judgment judgment [2][2]
Target language EnglishSource language Chinese
Setup
BLEU scores obtained for each system
Five systems perform translation3 automatic MT systems2 human translators
Human judgment (on scale of 5) obtained for each system
bull Group 1 Ten Monolingual speakers of target language (English)
bull Group 2 Ten Bilingual speakers of Chinese and English
BLEU vs human judgment
bull Monolingual speakersCorrelation co-efficient 099
bull Bilingual speakersCorrelation co-efficient 096
BLEU and human evaluation for S2 and S3 Graph from [2]
Comparison of normalized values
bull High correlation between monolingual group and BLEU score
bull Bilingual group were lenient on lsquofluencyrsquo for H1
bull Demarcation between S1-S3 and H1-H2 is captured by BLEU
Graph from [2]
Conclusion
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Part II Shortcomings of BLEU(continued from 7th March 2013)
Presented by(As a part of CS 712)
Aditya JoshiKashyap PopatShubham Gautam
IIT Bombay
GuideProf Pushpak Bhattacharyya
IIT Bombay
BLEU score revisitedPrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Outline
bull Evaluationbull Formulating BLEU Metricbull Understanding BLEU formulabull Shortcomings of BLEU in generalbull Shortcomings in context of English-Hindi MT
Part I
Part II
Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
[A domesticated bird wants to learn to fly]
I can do this I just have to work out the physics I have quadrated by vector angles I have adjusted for wind shear
This is it Lets fly Just keep it simple Thrust lift drag and wait Thrust lift drag wait[speeds up to get to the of the table]Thrust lift drag waiwaiwait [gets scared and tries to stop himself
but instead falls from the table]
Lines from Rio (2011)
- Blu the Macaw
ldquo
Image Source wwwfanpopcom
Use of BLEU metric [3]
bull Evaluating incremental system changesndash Neglecting actual examples
bull Systems ranked on basis of BLEU
bull Does minimizing error rate with respect to BLEU indeed guarantee translation improvements
Criticisms of BLEU [4]
bull Admits too much variation [3]
bull Admits too little variation [4] [3]
bull Poor correlation with human judgment [3]
Admits too much variation [3]
Permuting phrasesDrawing different items from reference set
Admits too much variation
bull BLEU relies on n-gram matching only
bull Puts very few constraints on how n-gram matches can be drawn from multiple reference translations
Brevity Penalty(Incorporating recall)
Modified precision
Set to 1N
Modified n-gram precision in BLEU [2]
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujhpar teraa suroor chhaayame-on your spell has-been-castYour spell has been cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spellYour your your spell
matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Permuting phrases [3]
bull Reordering of unmatched phrases does not affect precision
bull Bigram mismatch sites can be freely permuted
Possible to randomly produce other hypothesis translations that have the same BLEU score
B1 B2 B3 B4
B4 B2 B1 B3Bigram Bigrammismatch
Issues with precision (12)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate राजा और रानी शकार करन जगल गय म raaja aur raani shikaar karne gaye jungle meinKing and queen to-do-hunting went jungle to (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Issues with precision (22)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate शकार करन जगल राजा और रानी म गयshikaar karne jungle raaja aur raani mein gayeto-do hunting jungle raja and rani in went (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Permuting phrases in generalbull For lsquobrsquo bi-gram matches in a candidate translation of length
lsquokrsquo(k ndash b) possible ways to generate similarly score items using only the words
in this translation
In our example (8-4) = 24 candidate translations
In sentence of length ktotal bigrams = k ndash 1
matched bigrams = b
no of mismatched bigrams = k ndash 1 ndash bno of matched chunks = k ndash 1 ndash b + 1
= k ndash b
These (k-b) chunks can be reordered in (k ndash b) ways
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Validation of range of BLEU
BP
pn Between 0 and 1log pn Between ndash(infinity) and 0 A Between ndash(infinity) and 0 e ^ (A) Between 0 and 1
A
BP Between 0 and 1
Graph drawn using wwwfooplotcom
(rc)
pn
log pn
A
e^(A)
Calculating BLEU An actual example
Ref भावना मक एक करण लान क लए एक योहारbhaavanaatmak ekikaran laane ke liye ek tyohaaremotional unity bring-to a festivalA festival to bring emotional unity
C1 लान क लए एक योहार भावना मक एक करणlaane-ke-liye ek tyoaar bhaavanaatmak ekikaranbring-to a festival emotional unity (This is invalid Hindi ordering)
C2 को एक उ सव क बार म लाना भावना मक एक करणko ek utsav ke baare mein laana bhaavanaatmak ekikaran
for a festival-about to-bring emotional unity (This is invalid Hindi ordering)
Modified n-gram precision
Ref भावना मक एक करण लान क लए एक योहारC1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 7 7 1
2 5 6 083
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 4 9 044
2 1 8 0125
r 7c 7
r 7c 9
Calculating BLEU scoren wn pn log pn wn
log pn
1 1 2 1 0 0
2 1 2 083 -0186 -0093
Total -0093
BLEU 0911
wn= 1 N = 1 2r = 7
C1c = 7BP = exp(1 ndash 77) = 1
C2c = 9BP = 1
n wn pn log pn wn log pn
1 1 2 044 -0821 04105
2 1 2 0125 -207 -1035
Total -1445
BLEU 0235
C1
C2
Hence the BLEU scores
C1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
0911
0235
Ref भावना मक एक करण लान क लए एक योहार
BLEU vs human BLEU vs human judgment judgment [2][2]
Target language EnglishSource language Chinese
Setup
BLEU scores obtained for each system
Five systems perform translation3 automatic MT systems2 human translators
Human judgment (on scale of 5) obtained for each system
bull Group 1 Ten Monolingual speakers of target language (English)
bull Group 2 Ten Bilingual speakers of Chinese and English
BLEU vs human judgment
bull Monolingual speakersCorrelation co-efficient 099
bull Bilingual speakersCorrelation co-efficient 096
BLEU and human evaluation for S2 and S3 Graph from [2]
Comparison of normalized values
bull High correlation between monolingual group and BLEU score
bull Bilingual group were lenient on lsquofluencyrsquo for H1
bull Demarcation between S1-S3 and H1-H2 is captured by BLEU
Graph from [2]
Conclusion
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Part II Shortcomings of BLEU(continued from 7th March 2013)
Presented by(As a part of CS 712)
Aditya JoshiKashyap PopatShubham Gautam
IIT Bombay
GuideProf Pushpak Bhattacharyya
IIT Bombay
BLEU score revisitedPrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Outline
bull Evaluationbull Formulating BLEU Metricbull Understanding BLEU formulabull Shortcomings of BLEU in generalbull Shortcomings in context of English-Hindi MT
Part I
Part II
Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
[A domesticated bird wants to learn to fly]
I can do this I just have to work out the physics I have quadrated by vector angles I have adjusted for wind shear
This is it Lets fly Just keep it simple Thrust lift drag and wait Thrust lift drag wait[speeds up to get to the of the table]Thrust lift drag waiwaiwait [gets scared and tries to stop himself
but instead falls from the table]
Lines from Rio (2011)
- Blu the Macaw
ldquo
Image Source wwwfanpopcom
Use of BLEU metric [3]
bull Evaluating incremental system changesndash Neglecting actual examples
bull Systems ranked on basis of BLEU
bull Does minimizing error rate with respect to BLEU indeed guarantee translation improvements
Criticisms of BLEU [4]
bull Admits too much variation [3]
bull Admits too little variation [4] [3]
bull Poor correlation with human judgment [3]
Admits too much variation [3]
Permuting phrasesDrawing different items from reference set
Admits too much variation
bull BLEU relies on n-gram matching only
bull Puts very few constraints on how n-gram matches can be drawn from multiple reference translations
Brevity Penalty(Incorporating recall)
Modified precision
Set to 1N
Modified n-gram precision in BLEU [2]
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujhpar teraa suroor chhaayame-on your spell has-been-castYour spell has been cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spellYour your your spell
matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Permuting phrases [3]
bull Reordering of unmatched phrases does not affect precision
bull Bigram mismatch sites can be freely permuted
Possible to randomly produce other hypothesis translations that have the same BLEU score
B1 B2 B3 B4
B4 B2 B1 B3Bigram Bigrammismatch
Issues with precision (12)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate राजा और रानी शकार करन जगल गय म raaja aur raani shikaar karne gaye jungle meinKing and queen to-do-hunting went jungle to (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Issues with precision (22)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate शकार करन जगल राजा और रानी म गयshikaar karne jungle raaja aur raani mein gayeto-do hunting jungle raja and rani in went (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Permuting phrases in generalbull For lsquobrsquo bi-gram matches in a candidate translation of length
lsquokrsquo(k ndash b) possible ways to generate similarly score items using only the words
in this translation
In our example (8-4) = 24 candidate translations
In sentence of length ktotal bigrams = k ndash 1
matched bigrams = b
no of mismatched bigrams = k ndash 1 ndash bno of matched chunks = k ndash 1 ndash b + 1
= k ndash b
These (k-b) chunks can be reordered in (k ndash b) ways
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Calculating BLEU An actual example
Ref भावना मक एक करण लान क लए एक योहारbhaavanaatmak ekikaran laane ke liye ek tyohaaremotional unity bring-to a festivalA festival to bring emotional unity
C1 लान क लए एक योहार भावना मक एक करणlaane-ke-liye ek tyoaar bhaavanaatmak ekikaranbring-to a festival emotional unity (This is invalid Hindi ordering)
C2 को एक उ सव क बार म लाना भावना मक एक करणko ek utsav ke baare mein laana bhaavanaatmak ekikaran
for a festival-about to-bring emotional unity (This is invalid Hindi ordering)
Modified n-gram precision
Ref भावना मक एक करण लान क लए एक योहारC1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 7 7 1
2 5 6 083
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 4 9 044
2 1 8 0125
r 7c 7
r 7c 9
Calculating BLEU scoren wn pn log pn wn
log pn
1 1 2 1 0 0
2 1 2 083 -0186 -0093
Total -0093
BLEU 0911
wn= 1 N = 1 2r = 7
C1c = 7BP = exp(1 ndash 77) = 1
C2c = 9BP = 1
n wn pn log pn wn log pn
1 1 2 044 -0821 04105
2 1 2 0125 -207 -1035
Total -1445
BLEU 0235
C1
C2
Hence the BLEU scores
C1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
0911
0235
Ref भावना मक एक करण लान क लए एक योहार
BLEU vs human BLEU vs human judgment judgment [2][2]
Target language EnglishSource language Chinese
Setup
BLEU scores obtained for each system
Five systems perform translation3 automatic MT systems2 human translators
Human judgment (on scale of 5) obtained for each system
bull Group 1 Ten Monolingual speakers of target language (English)
bull Group 2 Ten Bilingual speakers of Chinese and English
BLEU vs human judgment
bull Monolingual speakersCorrelation co-efficient 099
bull Bilingual speakersCorrelation co-efficient 096
BLEU and human evaluation for S2 and S3 Graph from [2]
Comparison of normalized values
bull High correlation between monolingual group and BLEU score
bull Bilingual group were lenient on lsquofluencyrsquo for H1
bull Demarcation between S1-S3 and H1-H2 is captured by BLEU
Graph from [2]
Conclusion
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Part II Shortcomings of BLEU(continued from 7th March 2013)
Presented by(As a part of CS 712)
Aditya JoshiKashyap PopatShubham Gautam
IIT Bombay
GuideProf Pushpak Bhattacharyya
IIT Bombay
BLEU score revisitedPrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Outline
bull Evaluationbull Formulating BLEU Metricbull Understanding BLEU formulabull Shortcomings of BLEU in generalbull Shortcomings in context of English-Hindi MT
Part I
Part II
Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
[A domesticated bird wants to learn to fly]
I can do this I just have to work out the physics I have quadrated by vector angles I have adjusted for wind shear
This is it Lets fly Just keep it simple Thrust lift drag and wait Thrust lift drag wait[speeds up to get to the of the table]Thrust lift drag waiwaiwait [gets scared and tries to stop himself
but instead falls from the table]
Lines from Rio (2011)
- Blu the Macaw
ldquo
Image Source wwwfanpopcom
Use of BLEU metric [3]
bull Evaluating incremental system changesndash Neglecting actual examples
bull Systems ranked on basis of BLEU
bull Does minimizing error rate with respect to BLEU indeed guarantee translation improvements
Criticisms of BLEU [4]
bull Admits too much variation [3]
bull Admits too little variation [4] [3]
bull Poor correlation with human judgment [3]
Admits too much variation [3]
Permuting phrasesDrawing different items from reference set
Admits too much variation
bull BLEU relies on n-gram matching only
bull Puts very few constraints on how n-gram matches can be drawn from multiple reference translations
Brevity Penalty(Incorporating recall)
Modified precision
Set to 1N
Modified n-gram precision in BLEU [2]
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujhpar teraa suroor chhaayame-on your spell has-been-castYour spell has been cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spellYour your your spell
matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Permuting phrases [3]
bull Reordering of unmatched phrases does not affect precision
bull Bigram mismatch sites can be freely permuted
Possible to randomly produce other hypothesis translations that have the same BLEU score
B1 B2 B3 B4
B4 B2 B1 B3Bigram Bigrammismatch
Issues with precision (12)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate राजा और रानी शकार करन जगल गय म raaja aur raani shikaar karne gaye jungle meinKing and queen to-do-hunting went jungle to (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Issues with precision (22)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate शकार करन जगल राजा और रानी म गयshikaar karne jungle raaja aur raani mein gayeto-do hunting jungle raja and rani in went (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Permuting phrases in generalbull For lsquobrsquo bi-gram matches in a candidate translation of length
lsquokrsquo(k ndash b) possible ways to generate similarly score items using only the words
in this translation
In our example (8-4) = 24 candidate translations
In sentence of length ktotal bigrams = k ndash 1
matched bigrams = b
no of mismatched bigrams = k ndash 1 ndash bno of matched chunks = k ndash 1 ndash b + 1
= k ndash b
These (k-b) chunks can be reordered in (k ndash b) ways
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Modified n-gram precision
Ref भावना मक एक करण लान क लए एक योहारC1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 7 7 1
2 5 6 083
lsquonrsquo Matching n-grams
Total n-grams Modified n-gram Precision
1 4 9 044
2 1 8 0125
r 7c 7
r 7c 9
Calculating BLEU scoren wn pn log pn wn
log pn
1 1 2 1 0 0
2 1 2 083 -0186 -0093
Total -0093
BLEU 0911
wn= 1 N = 1 2r = 7
C1c = 7BP = exp(1 ndash 77) = 1
C2c = 9BP = 1
n wn pn log pn wn log pn
1 1 2 044 -0821 04105
2 1 2 0125 -207 -1035
Total -1445
BLEU 0235
C1
C2
Hence the BLEU scores
C1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
0911
0235
Ref भावना मक एक करण लान क लए एक योहार
BLEU vs human BLEU vs human judgment judgment [2][2]
Target language EnglishSource language Chinese
Setup
BLEU scores obtained for each system
Five systems perform translation3 automatic MT systems2 human translators
Human judgment (on scale of 5) obtained for each system
bull Group 1 Ten Monolingual speakers of target language (English)
bull Group 2 Ten Bilingual speakers of Chinese and English
BLEU vs human judgment
bull Monolingual speakersCorrelation co-efficient 099
bull Bilingual speakersCorrelation co-efficient 096
BLEU and human evaluation for S2 and S3 Graph from [2]
Comparison of normalized values
bull High correlation between monolingual group and BLEU score
bull Bilingual group were lenient on lsquofluencyrsquo for H1
bull Demarcation between S1-S3 and H1-H2 is captured by BLEU
Graph from [2]
Conclusion
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Part II Shortcomings of BLEU(continued from 7th March 2013)
Presented by(As a part of CS 712)
Aditya JoshiKashyap PopatShubham Gautam
IIT Bombay
GuideProf Pushpak Bhattacharyya
IIT Bombay
BLEU score revisitedPrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Outline
bull Evaluationbull Formulating BLEU Metricbull Understanding BLEU formulabull Shortcomings of BLEU in generalbull Shortcomings in context of English-Hindi MT
Part I
Part II
Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
[A domesticated bird wants to learn to fly]
I can do this I just have to work out the physics I have quadrated by vector angles I have adjusted for wind shear
This is it Lets fly Just keep it simple Thrust lift drag and wait Thrust lift drag wait[speeds up to get to the of the table]Thrust lift drag waiwaiwait [gets scared and tries to stop himself
but instead falls from the table]
Lines from Rio (2011)
- Blu the Macaw
ldquo
Image Source wwwfanpopcom
Use of BLEU metric [3]
bull Evaluating incremental system changesndash Neglecting actual examples
bull Systems ranked on basis of BLEU
bull Does minimizing error rate with respect to BLEU indeed guarantee translation improvements
Criticisms of BLEU [4]
bull Admits too much variation [3]
bull Admits too little variation [4] [3]
bull Poor correlation with human judgment [3]
Admits too much variation [3]
Permuting phrasesDrawing different items from reference set
Admits too much variation
bull BLEU relies on n-gram matching only
bull Puts very few constraints on how n-gram matches can be drawn from multiple reference translations
Brevity Penalty(Incorporating recall)
Modified precision
Set to 1N
Modified n-gram precision in BLEU [2]
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujhpar teraa suroor chhaayame-on your spell has-been-castYour spell has been cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spellYour your your spell
matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Permuting phrases [3]
bull Reordering of unmatched phrases does not affect precision
bull Bigram mismatch sites can be freely permuted
Possible to randomly produce other hypothesis translations that have the same BLEU score
B1 B2 B3 B4
B4 B2 B1 B3Bigram Bigrammismatch
Issues with precision (12)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate राजा और रानी शकार करन जगल गय म raaja aur raani shikaar karne gaye jungle meinKing and queen to-do-hunting went jungle to (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Issues with precision (22)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate शकार करन जगल राजा और रानी म गयshikaar karne jungle raaja aur raani mein gayeto-do hunting jungle raja and rani in went (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Permuting phrases in generalbull For lsquobrsquo bi-gram matches in a candidate translation of length
lsquokrsquo(k ndash b) possible ways to generate similarly score items using only the words
in this translation
In our example (8-4) = 24 candidate translations
In sentence of length ktotal bigrams = k ndash 1
matched bigrams = b
no of mismatched bigrams = k ndash 1 ndash bno of matched chunks = k ndash 1 ndash b + 1
= k ndash b
These (k-b) chunks can be reordered in (k ndash b) ways
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Calculating BLEU scoren wn pn log pn wn
log pn
1 1 2 1 0 0
2 1 2 083 -0186 -0093
Total -0093
BLEU 0911
wn= 1 N = 1 2r = 7
C1c = 7BP = exp(1 ndash 77) = 1
C2c = 9BP = 1
n wn pn log pn wn log pn
1 1 2 044 -0821 04105
2 1 2 0125 -207 -1035
Total -1445
BLEU 0235
C1
C2
Hence the BLEU scores
C1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
0911
0235
Ref भावना मक एक करण लान क लए एक योहार
BLEU vs human BLEU vs human judgment judgment [2][2]
Target language EnglishSource language Chinese
Setup
BLEU scores obtained for each system
Five systems perform translation3 automatic MT systems2 human translators
Human judgment (on scale of 5) obtained for each system
bull Group 1 Ten Monolingual speakers of target language (English)
bull Group 2 Ten Bilingual speakers of Chinese and English
BLEU vs human judgment
bull Monolingual speakersCorrelation co-efficient 099
bull Bilingual speakersCorrelation co-efficient 096
BLEU and human evaluation for S2 and S3 Graph from [2]
Comparison of normalized values
bull High correlation between monolingual group and BLEU score
bull Bilingual group were lenient on lsquofluencyrsquo for H1
bull Demarcation between S1-S3 and H1-H2 is captured by BLEU
Graph from [2]
Conclusion
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Part II Shortcomings of BLEU(continued from 7th March 2013)
Presented by(As a part of CS 712)
Aditya JoshiKashyap PopatShubham Gautam
IIT Bombay
GuideProf Pushpak Bhattacharyya
IIT Bombay
BLEU score revisitedPrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Outline
bull Evaluationbull Formulating BLEU Metricbull Understanding BLEU formulabull Shortcomings of BLEU in generalbull Shortcomings in context of English-Hindi MT
Part I
Part II
Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
[A domesticated bird wants to learn to fly]
I can do this I just have to work out the physics I have quadrated by vector angles I have adjusted for wind shear
This is it Lets fly Just keep it simple Thrust lift drag and wait Thrust lift drag wait[speeds up to get to the of the table]Thrust lift drag waiwaiwait [gets scared and tries to stop himself
but instead falls from the table]
Lines from Rio (2011)
- Blu the Macaw
ldquo
Image Source wwwfanpopcom
Use of BLEU metric [3]
bull Evaluating incremental system changesndash Neglecting actual examples
bull Systems ranked on basis of BLEU
bull Does minimizing error rate with respect to BLEU indeed guarantee translation improvements
Criticisms of BLEU [4]
bull Admits too much variation [3]
bull Admits too little variation [4] [3]
bull Poor correlation with human judgment [3]
Admits too much variation [3]
Permuting phrasesDrawing different items from reference set
Admits too much variation
bull BLEU relies on n-gram matching only
bull Puts very few constraints on how n-gram matches can be drawn from multiple reference translations
Brevity Penalty(Incorporating recall)
Modified precision
Set to 1N
Modified n-gram precision in BLEU [2]
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujhpar teraa suroor chhaayame-on your spell has-been-castYour spell has been cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spellYour your your spell
matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Permuting phrases [3]
bull Reordering of unmatched phrases does not affect precision
bull Bigram mismatch sites can be freely permuted
Possible to randomly produce other hypothesis translations that have the same BLEU score
B1 B2 B3 B4
B4 B2 B1 B3Bigram Bigrammismatch
Issues with precision (12)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate राजा और रानी शकार करन जगल गय म raaja aur raani shikaar karne gaye jungle meinKing and queen to-do-hunting went jungle to (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Issues with precision (22)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate शकार करन जगल राजा और रानी म गयshikaar karne jungle raaja aur raani mein gayeto-do hunting jungle raja and rani in went (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Permuting phrases in generalbull For lsquobrsquo bi-gram matches in a candidate translation of length
lsquokrsquo(k ndash b) possible ways to generate similarly score items using only the words
in this translation
In our example (8-4) = 24 candidate translations
In sentence of length ktotal bigrams = k ndash 1
matched bigrams = b
no of mismatched bigrams = k ndash 1 ndash bno of matched chunks = k ndash 1 ndash b + 1
= k ndash b
These (k-b) chunks can be reordered in (k ndash b) ways
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Hence the BLEU scores
C1 लान क लए एक योहार भावना मक एक करण
C2 को एक उ सव क बार म लाना भावना मक एक करण
0911
0235
Ref भावना मक एक करण लान क लए एक योहार
BLEU vs human BLEU vs human judgment judgment [2][2]
Target language EnglishSource language Chinese
Setup
BLEU scores obtained for each system
Five systems perform translation3 automatic MT systems2 human translators
Human judgment (on scale of 5) obtained for each system
bull Group 1 Ten Monolingual speakers of target language (English)
bull Group 2 Ten Bilingual speakers of Chinese and English
BLEU vs human judgment
bull Monolingual speakersCorrelation co-efficient 099
bull Bilingual speakersCorrelation co-efficient 096
BLEU and human evaluation for S2 and S3 Graph from [2]
Comparison of normalized values
bull High correlation between monolingual group and BLEU score
bull Bilingual group were lenient on lsquofluencyrsquo for H1
bull Demarcation between S1-S3 and H1-H2 is captured by BLEU
Graph from [2]
Conclusion
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Part II Shortcomings of BLEU(continued from 7th March 2013)
Presented by(As a part of CS 712)
Aditya JoshiKashyap PopatShubham Gautam
IIT Bombay
GuideProf Pushpak Bhattacharyya
IIT Bombay
BLEU score revisitedPrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Outline
bull Evaluationbull Formulating BLEU Metricbull Understanding BLEU formulabull Shortcomings of BLEU in generalbull Shortcomings in context of English-Hindi MT
Part I
Part II
Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
[A domesticated bird wants to learn to fly]
I can do this I just have to work out the physics I have quadrated by vector angles I have adjusted for wind shear
This is it Lets fly Just keep it simple Thrust lift drag and wait Thrust lift drag wait[speeds up to get to the of the table]Thrust lift drag waiwaiwait [gets scared and tries to stop himself
but instead falls from the table]
Lines from Rio (2011)
- Blu the Macaw
ldquo
Image Source wwwfanpopcom
Use of BLEU metric [3]
bull Evaluating incremental system changesndash Neglecting actual examples
bull Systems ranked on basis of BLEU
bull Does minimizing error rate with respect to BLEU indeed guarantee translation improvements
Criticisms of BLEU [4]
bull Admits too much variation [3]
bull Admits too little variation [4] [3]
bull Poor correlation with human judgment [3]
Admits too much variation [3]
Permuting phrasesDrawing different items from reference set
Admits too much variation
bull BLEU relies on n-gram matching only
bull Puts very few constraints on how n-gram matches can be drawn from multiple reference translations
Brevity Penalty(Incorporating recall)
Modified precision
Set to 1N
Modified n-gram precision in BLEU [2]
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujhpar teraa suroor chhaayame-on your spell has-been-castYour spell has been cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spellYour your your spell
matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Permuting phrases [3]
bull Reordering of unmatched phrases does not affect precision
bull Bigram mismatch sites can be freely permuted
Possible to randomly produce other hypothesis translations that have the same BLEU score
B1 B2 B3 B4
B4 B2 B1 B3Bigram Bigrammismatch
Issues with precision (12)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate राजा और रानी शकार करन जगल गय म raaja aur raani shikaar karne gaye jungle meinKing and queen to-do-hunting went jungle to (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Issues with precision (22)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate शकार करन जगल राजा और रानी म गयshikaar karne jungle raaja aur raani mein gayeto-do hunting jungle raja and rani in went (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Permuting phrases in generalbull For lsquobrsquo bi-gram matches in a candidate translation of length
lsquokrsquo(k ndash b) possible ways to generate similarly score items using only the words
in this translation
In our example (8-4) = 24 candidate translations
In sentence of length ktotal bigrams = k ndash 1
matched bigrams = b
no of mismatched bigrams = k ndash 1 ndash bno of matched chunks = k ndash 1 ndash b + 1
= k ndash b
These (k-b) chunks can be reordered in (k ndash b) ways
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
BLEU vs human BLEU vs human judgment judgment [2][2]
Target language EnglishSource language Chinese
Setup
BLEU scores obtained for each system
Five systems perform translation3 automatic MT systems2 human translators
Human judgment (on scale of 5) obtained for each system
bull Group 1 Ten Monolingual speakers of target language (English)
bull Group 2 Ten Bilingual speakers of Chinese and English
BLEU vs human judgment
bull Monolingual speakersCorrelation co-efficient 099
bull Bilingual speakersCorrelation co-efficient 096
BLEU and human evaluation for S2 and S3 Graph from [2]
Comparison of normalized values
bull High correlation between monolingual group and BLEU score
bull Bilingual group were lenient on lsquofluencyrsquo for H1
bull Demarcation between S1-S3 and H1-H2 is captured by BLEU
Graph from [2]
Conclusion
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Part II Shortcomings of BLEU(continued from 7th March 2013)
Presented by(As a part of CS 712)
Aditya JoshiKashyap PopatShubham Gautam
IIT Bombay
GuideProf Pushpak Bhattacharyya
IIT Bombay
BLEU score revisitedPrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Outline
bull Evaluationbull Formulating BLEU Metricbull Understanding BLEU formulabull Shortcomings of BLEU in generalbull Shortcomings in context of English-Hindi MT
Part I
Part II
Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
[A domesticated bird wants to learn to fly]
I can do this I just have to work out the physics I have quadrated by vector angles I have adjusted for wind shear
This is it Lets fly Just keep it simple Thrust lift drag and wait Thrust lift drag wait[speeds up to get to the of the table]Thrust lift drag waiwaiwait [gets scared and tries to stop himself
but instead falls from the table]
Lines from Rio (2011)
- Blu the Macaw
ldquo
Image Source wwwfanpopcom
Use of BLEU metric [3]
bull Evaluating incremental system changesndash Neglecting actual examples
bull Systems ranked on basis of BLEU
bull Does minimizing error rate with respect to BLEU indeed guarantee translation improvements
Criticisms of BLEU [4]
bull Admits too much variation [3]
bull Admits too little variation [4] [3]
bull Poor correlation with human judgment [3]
Admits too much variation [3]
Permuting phrasesDrawing different items from reference set
Admits too much variation
bull BLEU relies on n-gram matching only
bull Puts very few constraints on how n-gram matches can be drawn from multiple reference translations
Brevity Penalty(Incorporating recall)
Modified precision
Set to 1N
Modified n-gram precision in BLEU [2]
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujhpar teraa suroor chhaayame-on your spell has-been-castYour spell has been cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spellYour your your spell
matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Permuting phrases [3]
bull Reordering of unmatched phrases does not affect precision
bull Bigram mismatch sites can be freely permuted
Possible to randomly produce other hypothesis translations that have the same BLEU score
B1 B2 B3 B4
B4 B2 B1 B3Bigram Bigrammismatch
Issues with precision (12)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate राजा और रानी शकार करन जगल गय म raaja aur raani shikaar karne gaye jungle meinKing and queen to-do-hunting went jungle to (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Issues with precision (22)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate शकार करन जगल राजा और रानी म गयshikaar karne jungle raaja aur raani mein gayeto-do hunting jungle raja and rani in went (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Permuting phrases in generalbull For lsquobrsquo bi-gram matches in a candidate translation of length
lsquokrsquo(k ndash b) possible ways to generate similarly score items using only the words
in this translation
In our example (8-4) = 24 candidate translations
In sentence of length ktotal bigrams = k ndash 1
matched bigrams = b
no of mismatched bigrams = k ndash 1 ndash bno of matched chunks = k ndash 1 ndash b + 1
= k ndash b
These (k-b) chunks can be reordered in (k ndash b) ways
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Setup
BLEU scores obtained for each system
Five systems perform translation3 automatic MT systems2 human translators
Human judgment (on scale of 5) obtained for each system
bull Group 1 Ten Monolingual speakers of target language (English)
bull Group 2 Ten Bilingual speakers of Chinese and English
BLEU vs human judgment
bull Monolingual speakersCorrelation co-efficient 099
bull Bilingual speakersCorrelation co-efficient 096
BLEU and human evaluation for S2 and S3 Graph from [2]
Comparison of normalized values
bull High correlation between monolingual group and BLEU score
bull Bilingual group were lenient on lsquofluencyrsquo for H1
bull Demarcation between S1-S3 and H1-H2 is captured by BLEU
Graph from [2]
Conclusion
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Part II Shortcomings of BLEU(continued from 7th March 2013)
Presented by(As a part of CS 712)
Aditya JoshiKashyap PopatShubham Gautam
IIT Bombay
GuideProf Pushpak Bhattacharyya
IIT Bombay
BLEU score revisitedPrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Outline
bull Evaluationbull Formulating BLEU Metricbull Understanding BLEU formulabull Shortcomings of BLEU in generalbull Shortcomings in context of English-Hindi MT
Part I
Part II
Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
[A domesticated bird wants to learn to fly]
I can do this I just have to work out the physics I have quadrated by vector angles I have adjusted for wind shear
This is it Lets fly Just keep it simple Thrust lift drag and wait Thrust lift drag wait[speeds up to get to the of the table]Thrust lift drag waiwaiwait [gets scared and tries to stop himself
but instead falls from the table]
Lines from Rio (2011)
- Blu the Macaw
ldquo
Image Source wwwfanpopcom
Use of BLEU metric [3]
bull Evaluating incremental system changesndash Neglecting actual examples
bull Systems ranked on basis of BLEU
bull Does minimizing error rate with respect to BLEU indeed guarantee translation improvements
Criticisms of BLEU [4]
bull Admits too much variation [3]
bull Admits too little variation [4] [3]
bull Poor correlation with human judgment [3]
Admits too much variation [3]
Permuting phrasesDrawing different items from reference set
Admits too much variation
bull BLEU relies on n-gram matching only
bull Puts very few constraints on how n-gram matches can be drawn from multiple reference translations
Brevity Penalty(Incorporating recall)
Modified precision
Set to 1N
Modified n-gram precision in BLEU [2]
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujhpar teraa suroor chhaayame-on your spell has-been-castYour spell has been cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spellYour your your spell
matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Permuting phrases [3]
bull Reordering of unmatched phrases does not affect precision
bull Bigram mismatch sites can be freely permuted
Possible to randomly produce other hypothesis translations that have the same BLEU score
B1 B2 B3 B4
B4 B2 B1 B3Bigram Bigrammismatch
Issues with precision (12)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate राजा और रानी शकार करन जगल गय म raaja aur raani shikaar karne gaye jungle meinKing and queen to-do-hunting went jungle to (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Issues with precision (22)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate शकार करन जगल राजा और रानी म गयshikaar karne jungle raaja aur raani mein gayeto-do hunting jungle raja and rani in went (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Permuting phrases in generalbull For lsquobrsquo bi-gram matches in a candidate translation of length
lsquokrsquo(k ndash b) possible ways to generate similarly score items using only the words
in this translation
In our example (8-4) = 24 candidate translations
In sentence of length ktotal bigrams = k ndash 1
matched bigrams = b
no of mismatched bigrams = k ndash 1 ndash bno of matched chunks = k ndash 1 ndash b + 1
= k ndash b
These (k-b) chunks can be reordered in (k ndash b) ways
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
BLEU vs human judgment
bull Monolingual speakersCorrelation co-efficient 099
bull Bilingual speakersCorrelation co-efficient 096
BLEU and human evaluation for S2 and S3 Graph from [2]
Comparison of normalized values
bull High correlation between monolingual group and BLEU score
bull Bilingual group were lenient on lsquofluencyrsquo for H1
bull Demarcation between S1-S3 and H1-H2 is captured by BLEU
Graph from [2]
Conclusion
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Part II Shortcomings of BLEU(continued from 7th March 2013)
Presented by(As a part of CS 712)
Aditya JoshiKashyap PopatShubham Gautam
IIT Bombay
GuideProf Pushpak Bhattacharyya
IIT Bombay
BLEU score revisitedPrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Outline
bull Evaluationbull Formulating BLEU Metricbull Understanding BLEU formulabull Shortcomings of BLEU in generalbull Shortcomings in context of English-Hindi MT
Part I
Part II
Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
[A domesticated bird wants to learn to fly]
I can do this I just have to work out the physics I have quadrated by vector angles I have adjusted for wind shear
This is it Lets fly Just keep it simple Thrust lift drag and wait Thrust lift drag wait[speeds up to get to the of the table]Thrust lift drag waiwaiwait [gets scared and tries to stop himself
but instead falls from the table]
Lines from Rio (2011)
- Blu the Macaw
ldquo
Image Source wwwfanpopcom
Use of BLEU metric [3]
bull Evaluating incremental system changesndash Neglecting actual examples
bull Systems ranked on basis of BLEU
bull Does minimizing error rate with respect to BLEU indeed guarantee translation improvements
Criticisms of BLEU [4]
bull Admits too much variation [3]
bull Admits too little variation [4] [3]
bull Poor correlation with human judgment [3]
Admits too much variation [3]
Permuting phrasesDrawing different items from reference set
Admits too much variation
bull BLEU relies on n-gram matching only
bull Puts very few constraints on how n-gram matches can be drawn from multiple reference translations
Brevity Penalty(Incorporating recall)
Modified precision
Set to 1N
Modified n-gram precision in BLEU [2]
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujhpar teraa suroor chhaayame-on your spell has-been-castYour spell has been cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spellYour your your spell
matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Permuting phrases [3]
bull Reordering of unmatched phrases does not affect precision
bull Bigram mismatch sites can be freely permuted
Possible to randomly produce other hypothesis translations that have the same BLEU score
B1 B2 B3 B4
B4 B2 B1 B3Bigram Bigrammismatch
Issues with precision (12)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate राजा और रानी शकार करन जगल गय म raaja aur raani shikaar karne gaye jungle meinKing and queen to-do-hunting went jungle to (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Issues with precision (22)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate शकार करन जगल राजा और रानी म गयshikaar karne jungle raaja aur raani mein gayeto-do hunting jungle raja and rani in went (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Permuting phrases in generalbull For lsquobrsquo bi-gram matches in a candidate translation of length
lsquokrsquo(k ndash b) possible ways to generate similarly score items using only the words
in this translation
In our example (8-4) = 24 candidate translations
In sentence of length ktotal bigrams = k ndash 1
matched bigrams = b
no of mismatched bigrams = k ndash 1 ndash bno of matched chunks = k ndash 1 ndash b + 1
= k ndash b
These (k-b) chunks can be reordered in (k ndash b) ways
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Comparison of normalized values
bull High correlation between monolingual group and BLEU score
bull Bilingual group were lenient on lsquofluencyrsquo for H1
bull Demarcation between S1-S3 and H1-H2 is captured by BLEU
Graph from [2]
Conclusion
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Part II Shortcomings of BLEU(continued from 7th March 2013)
Presented by(As a part of CS 712)
Aditya JoshiKashyap PopatShubham Gautam
IIT Bombay
GuideProf Pushpak Bhattacharyya
IIT Bombay
BLEU score revisitedPrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Outline
bull Evaluationbull Formulating BLEU Metricbull Understanding BLEU formulabull Shortcomings of BLEU in generalbull Shortcomings in context of English-Hindi MT
Part I
Part II
Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
[A domesticated bird wants to learn to fly]
I can do this I just have to work out the physics I have quadrated by vector angles I have adjusted for wind shear
This is it Lets fly Just keep it simple Thrust lift drag and wait Thrust lift drag wait[speeds up to get to the of the table]Thrust lift drag waiwaiwait [gets scared and tries to stop himself
but instead falls from the table]
Lines from Rio (2011)
- Blu the Macaw
ldquo
Image Source wwwfanpopcom
Use of BLEU metric [3]
bull Evaluating incremental system changesndash Neglecting actual examples
bull Systems ranked on basis of BLEU
bull Does minimizing error rate with respect to BLEU indeed guarantee translation improvements
Criticisms of BLEU [4]
bull Admits too much variation [3]
bull Admits too little variation [4] [3]
bull Poor correlation with human judgment [3]
Admits too much variation [3]
Permuting phrasesDrawing different items from reference set
Admits too much variation
bull BLEU relies on n-gram matching only
bull Puts very few constraints on how n-gram matches can be drawn from multiple reference translations
Brevity Penalty(Incorporating recall)
Modified precision
Set to 1N
Modified n-gram precision in BLEU [2]
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujhpar teraa suroor chhaayame-on your spell has-been-castYour spell has been cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spellYour your your spell
matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Permuting phrases [3]
bull Reordering of unmatched phrases does not affect precision
bull Bigram mismatch sites can be freely permuted
Possible to randomly produce other hypothesis translations that have the same BLEU score
B1 B2 B3 B4
B4 B2 B1 B3Bigram Bigrammismatch
Issues with precision (12)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate राजा और रानी शकार करन जगल गय म raaja aur raani shikaar karne gaye jungle meinKing and queen to-do-hunting went jungle to (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Issues with precision (22)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate शकार करन जगल राजा और रानी म गयshikaar karne jungle raaja aur raani mein gayeto-do hunting jungle raja and rani in went (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Permuting phrases in generalbull For lsquobrsquo bi-gram matches in a candidate translation of length
lsquokrsquo(k ndash b) possible ways to generate similarly score items using only the words
in this translation
In our example (8-4) = 24 candidate translations
In sentence of length ktotal bigrams = k ndash 1
matched bigrams = b
no of mismatched bigrams = k ndash 1 ndash bno of matched chunks = k ndash 1 ndash b + 1
= k ndash b
These (k-b) chunks can be reordered in (k ndash b) ways
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Conclusion
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Part II Shortcomings of BLEU(continued from 7th March 2013)
Presented by(As a part of CS 712)
Aditya JoshiKashyap PopatShubham Gautam
IIT Bombay
GuideProf Pushpak Bhattacharyya
IIT Bombay
BLEU score revisitedPrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Outline
bull Evaluationbull Formulating BLEU Metricbull Understanding BLEU formulabull Shortcomings of BLEU in generalbull Shortcomings in context of English-Hindi MT
Part I
Part II
Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
[A domesticated bird wants to learn to fly]
I can do this I just have to work out the physics I have quadrated by vector angles I have adjusted for wind shear
This is it Lets fly Just keep it simple Thrust lift drag and wait Thrust lift drag wait[speeds up to get to the of the table]Thrust lift drag waiwaiwait [gets scared and tries to stop himself
but instead falls from the table]
Lines from Rio (2011)
- Blu the Macaw
ldquo
Image Source wwwfanpopcom
Use of BLEU metric [3]
bull Evaluating incremental system changesndash Neglecting actual examples
bull Systems ranked on basis of BLEU
bull Does minimizing error rate with respect to BLEU indeed guarantee translation improvements
Criticisms of BLEU [4]
bull Admits too much variation [3]
bull Admits too little variation [4] [3]
bull Poor correlation with human judgment [3]
Admits too much variation [3]
Permuting phrasesDrawing different items from reference set
Admits too much variation
bull BLEU relies on n-gram matching only
bull Puts very few constraints on how n-gram matches can be drawn from multiple reference translations
Brevity Penalty(Incorporating recall)
Modified precision
Set to 1N
Modified n-gram precision in BLEU [2]
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujhpar teraa suroor chhaayame-on your spell has-been-castYour spell has been cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spellYour your your spell
matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Permuting phrases [3]
bull Reordering of unmatched phrases does not affect precision
bull Bigram mismatch sites can be freely permuted
Possible to randomly produce other hypothesis translations that have the same BLEU score
B1 B2 B3 B4
B4 B2 B1 B3Bigram Bigrammismatch
Issues with precision (12)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate राजा और रानी शकार करन जगल गय म raaja aur raani shikaar karne gaye jungle meinKing and queen to-do-hunting went jungle to (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Issues with precision (22)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate शकार करन जगल राजा और रानी म गयshikaar karne jungle raaja aur raani mein gayeto-do hunting jungle raja and rani in went (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Permuting phrases in generalbull For lsquobrsquo bi-gram matches in a candidate translation of length
lsquokrsquo(k ndash b) possible ways to generate similarly score items using only the words
in this translation
In our example (8-4) = 24 candidate translations
In sentence of length ktotal bigrams = k ndash 1
matched bigrams = b
no of mismatched bigrams = k ndash 1 ndash bno of matched chunks = k ndash 1 ndash b + 1
= k ndash b
These (k-b) chunks can be reordered in (k ndash b) ways
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Part II Shortcomings of BLEU(continued from 7th March 2013)
Presented by(As a part of CS 712)
Aditya JoshiKashyap PopatShubham Gautam
IIT Bombay
GuideProf Pushpak Bhattacharyya
IIT Bombay
BLEU score revisitedPrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Outline
bull Evaluationbull Formulating BLEU Metricbull Understanding BLEU formulabull Shortcomings of BLEU in generalbull Shortcomings in context of English-Hindi MT
Part I
Part II
Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
[A domesticated bird wants to learn to fly]
I can do this I just have to work out the physics I have quadrated by vector angles I have adjusted for wind shear
This is it Lets fly Just keep it simple Thrust lift drag and wait Thrust lift drag wait[speeds up to get to the of the table]Thrust lift drag waiwaiwait [gets scared and tries to stop himself
but instead falls from the table]
Lines from Rio (2011)
- Blu the Macaw
ldquo
Image Source wwwfanpopcom
Use of BLEU metric [3]
bull Evaluating incremental system changesndash Neglecting actual examples
bull Systems ranked on basis of BLEU
bull Does minimizing error rate with respect to BLEU indeed guarantee translation improvements
Criticisms of BLEU [4]
bull Admits too much variation [3]
bull Admits too little variation [4] [3]
bull Poor correlation with human judgment [3]
Admits too much variation [3]
Permuting phrasesDrawing different items from reference set
Admits too much variation
bull BLEU relies on n-gram matching only
bull Puts very few constraints on how n-gram matches can be drawn from multiple reference translations
Brevity Penalty(Incorporating recall)
Modified precision
Set to 1N
Modified n-gram precision in BLEU [2]
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujhpar teraa suroor chhaayame-on your spell has-been-castYour spell has been cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spellYour your your spell
matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Permuting phrases [3]
bull Reordering of unmatched phrases does not affect precision
bull Bigram mismatch sites can be freely permuted
Possible to randomly produce other hypothesis translations that have the same BLEU score
B1 B2 B3 B4
B4 B2 B1 B3Bigram Bigrammismatch
Issues with precision (12)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate राजा और रानी शकार करन जगल गय म raaja aur raani shikaar karne gaye jungle meinKing and queen to-do-hunting went jungle to (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Issues with precision (22)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate शकार करन जगल राजा और रानी म गयshikaar karne jungle raaja aur raani mein gayeto-do hunting jungle raja and rani in went (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Permuting phrases in generalbull For lsquobrsquo bi-gram matches in a candidate translation of length
lsquokrsquo(k ndash b) possible ways to generate similarly score items using only the words
in this translation
In our example (8-4) = 24 candidate translations
In sentence of length ktotal bigrams = k ndash 1
matched bigrams = b
no of mismatched bigrams = k ndash 1 ndash bno of matched chunks = k ndash 1 ndash b + 1
= k ndash b
These (k-b) chunks can be reordered in (k ndash b) ways
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
BLEU score revisitedPrecision -gt Modified n-
gram precisionRecall -gt Brevity Penalty
Formula from [2]
Outline
bull Evaluationbull Formulating BLEU Metricbull Understanding BLEU formulabull Shortcomings of BLEU in generalbull Shortcomings in context of English-Hindi MT
Part I
Part II
Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
[A domesticated bird wants to learn to fly]
I can do this I just have to work out the physics I have quadrated by vector angles I have adjusted for wind shear
This is it Lets fly Just keep it simple Thrust lift drag and wait Thrust lift drag wait[speeds up to get to the of the table]Thrust lift drag waiwaiwait [gets scared and tries to stop himself
but instead falls from the table]
Lines from Rio (2011)
- Blu the Macaw
ldquo
Image Source wwwfanpopcom
Use of BLEU metric [3]
bull Evaluating incremental system changesndash Neglecting actual examples
bull Systems ranked on basis of BLEU
bull Does minimizing error rate with respect to BLEU indeed guarantee translation improvements
Criticisms of BLEU [4]
bull Admits too much variation [3]
bull Admits too little variation [4] [3]
bull Poor correlation with human judgment [3]
Admits too much variation [3]
Permuting phrasesDrawing different items from reference set
Admits too much variation
bull BLEU relies on n-gram matching only
bull Puts very few constraints on how n-gram matches can be drawn from multiple reference translations
Brevity Penalty(Incorporating recall)
Modified precision
Set to 1N
Modified n-gram precision in BLEU [2]
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujhpar teraa suroor chhaayame-on your spell has-been-castYour spell has been cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spellYour your your spell
matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Permuting phrases [3]
bull Reordering of unmatched phrases does not affect precision
bull Bigram mismatch sites can be freely permuted
Possible to randomly produce other hypothesis translations that have the same BLEU score
B1 B2 B3 B4
B4 B2 B1 B3Bigram Bigrammismatch
Issues with precision (12)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate राजा और रानी शकार करन जगल गय म raaja aur raani shikaar karne gaye jungle meinKing and queen to-do-hunting went jungle to (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Issues with precision (22)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate शकार करन जगल राजा और रानी म गयshikaar karne jungle raaja aur raani mein gayeto-do hunting jungle raja and rani in went (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Permuting phrases in generalbull For lsquobrsquo bi-gram matches in a candidate translation of length
lsquokrsquo(k ndash b) possible ways to generate similarly score items using only the words
in this translation
In our example (8-4) = 24 candidate translations
In sentence of length ktotal bigrams = k ndash 1
matched bigrams = b
no of mismatched bigrams = k ndash 1 ndash bno of matched chunks = k ndash 1 ndash b + 1
= k ndash b
These (k-b) chunks can be reordered in (k ndash b) ways
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Outline
bull Evaluationbull Formulating BLEU Metricbull Understanding BLEU formulabull Shortcomings of BLEU in generalbull Shortcomings in context of English-Hindi MT
Part I
Part II
Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
[A domesticated bird wants to learn to fly]
I can do this I just have to work out the physics I have quadrated by vector angles I have adjusted for wind shear
This is it Lets fly Just keep it simple Thrust lift drag and wait Thrust lift drag wait[speeds up to get to the of the table]Thrust lift drag waiwaiwait [gets scared and tries to stop himself
but instead falls from the table]
Lines from Rio (2011)
- Blu the Macaw
ldquo
Image Source wwwfanpopcom
Use of BLEU metric [3]
bull Evaluating incremental system changesndash Neglecting actual examples
bull Systems ranked on basis of BLEU
bull Does minimizing error rate with respect to BLEU indeed guarantee translation improvements
Criticisms of BLEU [4]
bull Admits too much variation [3]
bull Admits too little variation [4] [3]
bull Poor correlation with human judgment [3]
Admits too much variation [3]
Permuting phrasesDrawing different items from reference set
Admits too much variation
bull BLEU relies on n-gram matching only
bull Puts very few constraints on how n-gram matches can be drawn from multiple reference translations
Brevity Penalty(Incorporating recall)
Modified precision
Set to 1N
Modified n-gram precision in BLEU [2]
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujhpar teraa suroor chhaayame-on your spell has-been-castYour spell has been cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spellYour your your spell
matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Permuting phrases [3]
bull Reordering of unmatched phrases does not affect precision
bull Bigram mismatch sites can be freely permuted
Possible to randomly produce other hypothesis translations that have the same BLEU score
B1 B2 B3 B4
B4 B2 B1 B3Bigram Bigrammismatch
Issues with precision (12)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate राजा और रानी शकार करन जगल गय म raaja aur raani shikaar karne gaye jungle meinKing and queen to-do-hunting went jungle to (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Issues with precision (22)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate शकार करन जगल राजा और रानी म गयshikaar karne jungle raaja aur raani mein gayeto-do hunting jungle raja and rani in went (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Permuting phrases in generalbull For lsquobrsquo bi-gram matches in a candidate translation of length
lsquokrsquo(k ndash b) possible ways to generate similarly score items using only the words
in this translation
In our example (8-4) = 24 candidate translations
In sentence of length ktotal bigrams = k ndash 1
matched bigrams = b
no of mismatched bigrams = k ndash 1 ndash bno of matched chunks = k ndash 1 ndash b + 1
= k ndash b
These (k-b) chunks can be reordered in (k ndash b) ways
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
[A domesticated bird wants to learn to fly]
I can do this I just have to work out the physics I have quadrated by vector angles I have adjusted for wind shear
This is it Lets fly Just keep it simple Thrust lift drag and wait Thrust lift drag wait[speeds up to get to the of the table]Thrust lift drag waiwaiwait [gets scared and tries to stop himself
but instead falls from the table]
Lines from Rio (2011)
- Blu the Macaw
ldquo
Image Source wwwfanpopcom
Use of BLEU metric [3]
bull Evaluating incremental system changesndash Neglecting actual examples
bull Systems ranked on basis of BLEU
bull Does minimizing error rate with respect to BLEU indeed guarantee translation improvements
Criticisms of BLEU [4]
bull Admits too much variation [3]
bull Admits too little variation [4] [3]
bull Poor correlation with human judgment [3]
Admits too much variation [3]
Permuting phrasesDrawing different items from reference set
Admits too much variation
bull BLEU relies on n-gram matching only
bull Puts very few constraints on how n-gram matches can be drawn from multiple reference translations
Brevity Penalty(Incorporating recall)
Modified precision
Set to 1N
Modified n-gram precision in BLEU [2]
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujhpar teraa suroor chhaayame-on your spell has-been-castYour spell has been cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spellYour your your spell
matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Permuting phrases [3]
bull Reordering of unmatched phrases does not affect precision
bull Bigram mismatch sites can be freely permuted
Possible to randomly produce other hypothesis translations that have the same BLEU score
B1 B2 B3 B4
B4 B2 B1 B3Bigram Bigrammismatch
Issues with precision (12)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate राजा और रानी शकार करन जगल गय म raaja aur raani shikaar karne gaye jungle meinKing and queen to-do-hunting went jungle to (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Issues with precision (22)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate शकार करन जगल राजा और रानी म गयshikaar karne jungle raaja aur raani mein gayeto-do hunting jungle raja and rani in went (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Permuting phrases in generalbull For lsquobrsquo bi-gram matches in a candidate translation of length
lsquokrsquo(k ndash b) possible ways to generate similarly score items using only the words
in this translation
In our example (8-4) = 24 candidate translations
In sentence of length ktotal bigrams = k ndash 1
matched bigrams = b
no of mismatched bigrams = k ndash 1 ndash bno of matched chunks = k ndash 1 ndash b + 1
= k ndash b
These (k-b) chunks can be reordered in (k ndash b) ways
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Use of BLEU metric [3]
bull Evaluating incremental system changesndash Neglecting actual examples
bull Systems ranked on basis of BLEU
bull Does minimizing error rate with respect to BLEU indeed guarantee translation improvements
Criticisms of BLEU [4]
bull Admits too much variation [3]
bull Admits too little variation [4] [3]
bull Poor correlation with human judgment [3]
Admits too much variation [3]
Permuting phrasesDrawing different items from reference set
Admits too much variation
bull BLEU relies on n-gram matching only
bull Puts very few constraints on how n-gram matches can be drawn from multiple reference translations
Brevity Penalty(Incorporating recall)
Modified precision
Set to 1N
Modified n-gram precision in BLEU [2]
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujhpar teraa suroor chhaayame-on your spell has-been-castYour spell has been cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spellYour your your spell
matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Permuting phrases [3]
bull Reordering of unmatched phrases does not affect precision
bull Bigram mismatch sites can be freely permuted
Possible to randomly produce other hypothesis translations that have the same BLEU score
B1 B2 B3 B4
B4 B2 B1 B3Bigram Bigrammismatch
Issues with precision (12)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate राजा और रानी शकार करन जगल गय म raaja aur raani shikaar karne gaye jungle meinKing and queen to-do-hunting went jungle to (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Issues with precision (22)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate शकार करन जगल राजा और रानी म गयshikaar karne jungle raaja aur raani mein gayeto-do hunting jungle raja and rani in went (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Permuting phrases in generalbull For lsquobrsquo bi-gram matches in a candidate translation of length
lsquokrsquo(k ndash b) possible ways to generate similarly score items using only the words
in this translation
In our example (8-4) = 24 candidate translations
In sentence of length ktotal bigrams = k ndash 1
matched bigrams = b
no of mismatched bigrams = k ndash 1 ndash bno of matched chunks = k ndash 1 ndash b + 1
= k ndash b
These (k-b) chunks can be reordered in (k ndash b) ways
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Criticisms of BLEU [4]
bull Admits too much variation [3]
bull Admits too little variation [4] [3]
bull Poor correlation with human judgment [3]
Admits too much variation [3]
Permuting phrasesDrawing different items from reference set
Admits too much variation
bull BLEU relies on n-gram matching only
bull Puts very few constraints on how n-gram matches can be drawn from multiple reference translations
Brevity Penalty(Incorporating recall)
Modified precision
Set to 1N
Modified n-gram precision in BLEU [2]
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujhpar teraa suroor chhaayame-on your spell has-been-castYour spell has been cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spellYour your your spell
matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Permuting phrases [3]
bull Reordering of unmatched phrases does not affect precision
bull Bigram mismatch sites can be freely permuted
Possible to randomly produce other hypothesis translations that have the same BLEU score
B1 B2 B3 B4
B4 B2 B1 B3Bigram Bigrammismatch
Issues with precision (12)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate राजा और रानी शकार करन जगल गय म raaja aur raani shikaar karne gaye jungle meinKing and queen to-do-hunting went jungle to (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Issues with precision (22)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate शकार करन जगल राजा और रानी म गयshikaar karne jungle raaja aur raani mein gayeto-do hunting jungle raja and rani in went (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Permuting phrases in generalbull For lsquobrsquo bi-gram matches in a candidate translation of length
lsquokrsquo(k ndash b) possible ways to generate similarly score items using only the words
in this translation
In our example (8-4) = 24 candidate translations
In sentence of length ktotal bigrams = k ndash 1
matched bigrams = b
no of mismatched bigrams = k ndash 1 ndash bno of matched chunks = k ndash 1 ndash b + 1
= k ndash b
These (k-b) chunks can be reordered in (k ndash b) ways
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Admits too much variation [3]
Permuting phrasesDrawing different items from reference set
Admits too much variation
bull BLEU relies on n-gram matching only
bull Puts very few constraints on how n-gram matches can be drawn from multiple reference translations
Brevity Penalty(Incorporating recall)
Modified precision
Set to 1N
Modified n-gram precision in BLEU [2]
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujhpar teraa suroor chhaayame-on your spell has-been-castYour spell has been cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spellYour your your spell
matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Permuting phrases [3]
bull Reordering of unmatched phrases does not affect precision
bull Bigram mismatch sites can be freely permuted
Possible to randomly produce other hypothesis translations that have the same BLEU score
B1 B2 B3 B4
B4 B2 B1 B3Bigram Bigrammismatch
Issues with precision (12)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate राजा और रानी शकार करन जगल गय म raaja aur raani shikaar karne gaye jungle meinKing and queen to-do-hunting went jungle to (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Issues with precision (22)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate शकार करन जगल राजा और रानी म गयshikaar karne jungle raaja aur raani mein gayeto-do hunting jungle raja and rani in went (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Permuting phrases in generalbull For lsquobrsquo bi-gram matches in a candidate translation of length
lsquokrsquo(k ndash b) possible ways to generate similarly score items using only the words
in this translation
In our example (8-4) = 24 candidate translations
In sentence of length ktotal bigrams = k ndash 1
matched bigrams = b
no of mismatched bigrams = k ndash 1 ndash bno of matched chunks = k ndash 1 ndash b + 1
= k ndash b
These (k-b) chunks can be reordered in (k ndash b) ways
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Admits too much variation
bull BLEU relies on n-gram matching only
bull Puts very few constraints on how n-gram matches can be drawn from multiple reference translations
Brevity Penalty(Incorporating recall)
Modified precision
Set to 1N
Modified n-gram precision in BLEU [2]
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujhpar teraa suroor chhaayame-on your spell has-been-castYour spell has been cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spellYour your your spell
matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Permuting phrases [3]
bull Reordering of unmatched phrases does not affect precision
bull Bigram mismatch sites can be freely permuted
Possible to randomly produce other hypothesis translations that have the same BLEU score
B1 B2 B3 B4
B4 B2 B1 B3Bigram Bigrammismatch
Issues with precision (12)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate राजा और रानी शकार करन जगल गय म raaja aur raani shikaar karne gaye jungle meinKing and queen to-do-hunting went jungle to (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Issues with precision (22)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate शकार करन जगल राजा और रानी म गयshikaar karne jungle raaja aur raani mein gayeto-do hunting jungle raja and rani in went (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Permuting phrases in generalbull For lsquobrsquo bi-gram matches in a candidate translation of length
lsquokrsquo(k ndash b) possible ways to generate similarly score items using only the words
in this translation
In our example (8-4) = 24 candidate translations
In sentence of length ktotal bigrams = k ndash 1
matched bigrams = b
no of mismatched bigrams = k ndash 1 ndash bno of matched chunks = k ndash 1 ndash b + 1
= k ndash b
These (k-b) chunks can be reordered in (k ndash b) ways
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Modified n-gram precision in BLEU [2]
bull Clip the total count of each candidate word with its maximum reference count
bull Countclip(n-gram) = min (count max_ref_count)
Reference मझपर तरा स र छायाmujhpar teraa suroor chhaayame-on your spell has-been-castYour spell has been cast on me
Candidate 2 तरा तरा तरा स रtera tera tera surooryour your your spellYour your your spell
matching unigrams (तरा min(3 1) = 1 ) (स र min (1 1) = 1)
Modified unigram precision 24 = 05
Permuting phrases [3]
bull Reordering of unmatched phrases does not affect precision
bull Bigram mismatch sites can be freely permuted
Possible to randomly produce other hypothesis translations that have the same BLEU score
B1 B2 B3 B4
B4 B2 B1 B3Bigram Bigrammismatch
Issues with precision (12)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate राजा और रानी शकार करन जगल गय म raaja aur raani shikaar karne gaye jungle meinKing and queen to-do-hunting went jungle to (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Issues with precision (22)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate शकार करन जगल राजा और रानी म गयshikaar karne jungle raaja aur raani mein gayeto-do hunting jungle raja and rani in went (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Permuting phrases in generalbull For lsquobrsquo bi-gram matches in a candidate translation of length
lsquokrsquo(k ndash b) possible ways to generate similarly score items using only the words
in this translation
In our example (8-4) = 24 candidate translations
In sentence of length ktotal bigrams = k ndash 1
matched bigrams = b
no of mismatched bigrams = k ndash 1 ndash bno of matched chunks = k ndash 1 ndash b + 1
= k ndash b
These (k-b) chunks can be reordered in (k ndash b) ways
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Permuting phrases [3]
bull Reordering of unmatched phrases does not affect precision
bull Bigram mismatch sites can be freely permuted
Possible to randomly produce other hypothesis translations that have the same BLEU score
B1 B2 B3 B4
B4 B2 B1 B3Bigram Bigrammismatch
Issues with precision (12)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate राजा और रानी शकार करन जगल गय म raaja aur raani shikaar karne gaye jungle meinKing and queen to-do-hunting went jungle to (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Issues with precision (22)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate शकार करन जगल राजा और रानी म गयshikaar karne jungle raaja aur raani mein gayeto-do hunting jungle raja and rani in went (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Permuting phrases in generalbull For lsquobrsquo bi-gram matches in a candidate translation of length
lsquokrsquo(k ndash b) possible ways to generate similarly score items using only the words
in this translation
In our example (8-4) = 24 candidate translations
In sentence of length ktotal bigrams = k ndash 1
matched bigrams = b
no of mismatched bigrams = k ndash 1 ndash bno of matched chunks = k ndash 1 ndash b + 1
= k ndash b
These (k-b) chunks can be reordered in (k ndash b) ways
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Issues with precision (12)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate राजा और रानी शकार करन जगल गय म raaja aur raani shikaar karne gaye jungle meinKing and queen to-do-hunting went jungle to (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Issues with precision (22)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate शकार करन जगल राजा और रानी म गयshikaar karne jungle raaja aur raani mein gayeto-do hunting jungle raja and rani in went (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Permuting phrases in generalbull For lsquobrsquo bi-gram matches in a candidate translation of length
lsquokrsquo(k ndash b) possible ways to generate similarly score items using only the words
in this translation
In our example (8-4) = 24 candidate translations
In sentence of length ktotal bigrams = k ndash 1
matched bigrams = b
no of mismatched bigrams = k ndash 1 ndash bno of matched chunks = k ndash 1 ndash b + 1
= k ndash b
These (k-b) chunks can be reordered in (k ndash b) ways
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Issues with precision (22)
Candidate राजा और रानी शकार करन जगल म गय
raaja aur raani shikaar karne jungal mein chale gayeKing and queen to-do-hunting to-jungle went
Reference 1 राजा और रानी जगल को शकार क लए गय
raaja aur raani jangal ko shikaar ke liye gayeKing and queen to-jungle for-hunting went
Reference 2 राजा और उनक बीवी शकार करन जगल गय raaja aur unki biwi shikaar karne jangal gayeking and his wife to-do-hunting jungle went
Matching bi-grams = 4 8
Candidate शकार करन जगल राजा और रानी म गयshikaar karne jungle raaja aur raani mein gayeto-do hunting jungle raja and rani in went (grammatically incorrect)
The king and the queen went to the jungle to hunt
Matching bi-grams = 4 8
Permuting phrases in generalbull For lsquobrsquo bi-gram matches in a candidate translation of length
lsquokrsquo(k ndash b) possible ways to generate similarly score items using only the words
in this translation
In our example (8-4) = 24 candidate translations
In sentence of length ktotal bigrams = k ndash 1
matched bigrams = b
no of mismatched bigrams = k ndash 1 ndash bno of matched chunks = k ndash 1 ndash b + 1
= k ndash b
These (k-b) chunks can be reordered in (k ndash b) ways
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Permuting phrases in generalbull For lsquobrsquo bi-gram matches in a candidate translation of length
lsquokrsquo(k ndash b) possible ways to generate similarly score items using only the words
in this translation
In our example (8-4) = 24 candidate translations
In sentence of length ktotal bigrams = k ndash 1
matched bigrams = b
no of mismatched bigrams = k ndash 1 ndash bno of matched chunks = k ndash 1 ndash b + 1
= k ndash b
These (k-b) chunks can be reordered in (k ndash b) ways
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Permuting phrases Evaluation
bull 2005 NIST evaluation databull MT system (2nd best in the task) produced candidate
translation
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Drawing different items from the reference set
bull If two systems lsquorecallrsquo x words each from a reference sentence
bull Precision may remain the samebull Brevity penalty remains the same
bull Translation quality need not remain the same
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Drawing different items Example
You may omit lexical items altogether
Candidate 2 घर क मग दालGhar murgi daal hotiHouse chicken daal is
Unigram precision 1 1Bigram precision 2 3
Referenceघर क मग मतलब दाल होती ह Ghar ki murgi matlab daal jaisi hoti hai
House-of chicken means daal-like isChicken at home is like daal
Candidate 1मग मतलब होती ह Ki matlab jaisi haiOf means like is
Unigram precision 1 1Bigram precision 2 3
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Drawing different items In general
bull If there are x unigram matches y bigram matches in candidate C1 and C2 bothndash BLEU score remains the same
ndash What words are actually matched is not accounted for
ndash Synonyms also indicate lsquomismatchrsquo
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Failures in practice Evaluation 1 (12) [3]
bull NIST MT Evaluation exercisebull Failure to correlate with human judgment
ndash In 2005 exercise System ranked 1st in human judgment ranked 6th by BLEU score
bull Seven systems compared for adequacy and fluency
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Failures in practice Evaluation 1 (22) [3]
bull System with high human judgment score
Graph
Outlierincluded
Pearson correlation (R2)
Yes 014
No 087
Outlierincluded
Pearson correlation (R2)
Yes 0002
No 0742
Graph
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Failures in practice Evaluation 2 (12) [3]
bull French-Englishbull System 1 Rule-based system using Systranbull System 23 SMT system using two sets of
Europarl data (28 million words per language)bull 300 test sentences
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Failures in practice Evaluation 2 (22) [3]
bull BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Admits too much and too Admits too much and too less variation less variation [[44]]
Indicative translationsLinguistic divergences amp evaluation
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Indicative translation
bull Draft quality translationsbull For assimilation rather than dissemination
bull Virtually all general-purpose MT systems today produce indicative translations
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Basic steps to obtain indicative translation English-Hindi
bull Structural transferndash S V O to S O V
bull Lexical transferndash Looking up corresponding wordndash Adding gender aspect tense etc information
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Indicative translation ExampleRohan goes to school by bus
Structural transfer SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer Word translation
(Rohan) (paathshaala) (bus) (jaana)Lexical transfer Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata haiRohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Questions in context of indicative translations [4]
bull Can systems be compared with each other using human reference translations
bull Is it wise to track the progress of a system by comparing its output with human translations in case of indicative translations
Is ldquofailure of MTrdquo (defined using any measure) simply because of ldquofailure in relation to inappropriate goalsrdquo (translating like a human)
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Linguistic Divergences
bull Categorical divergencebull Noun-noun compoundsbull Cultural differencesbull Pleonastic divergencebull Stylistic differencesbull WSD errors
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Categorical divergence
bull Change in lexical category
Unigram precision 0Bigram precision 0BLEU score 0
E I am feeling hungry H मझ भख लग रह ह
mujhe bhookh lag rahi haito-me hunger feeling is
I म भखा महसस कर रहा ह main bhookha mehsoos kar raha hoon I hungry feel doing am
E I am sweating
H मझ पसीना आ रहा हMujhe paseena aa raha hai
To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Noun-noun compounds
The ten best Aamir Khan performances H
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performancesAamir khan-of ten best performances
I
दस सव तम आ मर ख़ान पफॉम ससDus sarvottam aamir khaan performancesTen best Aamir Khan performances
Unigrams precision 55Bi-grams precision 24
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Cultural differences
Unigram precision 810Bigram precision 69bull lsquomaamaarsquorsquotaaursquo
E Food clothing and shelter are a mans basic needs H रोट कपड़ा और मकान एक मन य क ब नयाद ज़ रत ह
roti kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread clothing and house a man of basic needs are
I खाना कपड़ा और आ य एक मन य क ब नयाद ज़ रत हkhaana kapdaa aur aashray ek manushya ki buniyaadi zarooratein hainfood clothing and shelter a man of basic needs are
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Pleonastic divergence
bull Words with no semantic content in target language
Unigram precision 45Bigram precision 34
E It is raining H बा रश हो रह ह
baarish ho rahi hairain happening is
I यह बा रश हो रह ह yeh baarish ho rahi haiit rain happening is
E One should not trouble the weak
H दबल को परशान नह करना चा हए durbalon ko pareshaan nahi krana
chahiyeto-weak trouble not do shouldShould not trouble the weak
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Stylistic differences
bull Typical styles in a language
Unigram precision 57Bigram precision 36
E The Lok Sabha has 545 membersH लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hainLok Sabha in 545 members are
I लोक सभा क पास 545 सद य हlok sabha ke paas 545 sadasya hainLok Sabha hasnear 545 members are
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
WSD errors
bull Synonyms vs incorrect sensesE I purchased a batH मन एक ब ला खर दा (reference)
maine ek ballaa kharidaaI a cricket-bat boughtI bought a cricket bat
I मन एक चमगादड़ खर दाmaine ek chamgaadaD kharidaaI a bat (mammal) bought I bought a bat (mammal)
E The thieves were heldH चोर को गर तार कया
choron ko giraftaar kiyaathieves arrest doneThe thieves were arrested
I1 चोर को पकड़ाchoron ko pakdaathieves caughtThe thieves were caught
I2 चोर को आयोिजत कयाchoron ko aayojit kiyathieves organized doneThe thieves were organized
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Evaluation Precision
bull Acceptable translations are rejected by BLEUbull Unacceptable translations are accepted by BLEU
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Future of BLEU
bull Failed correlation with human judgment [3]
bull Suggested changes to BLEU score
bull Not be overly reliant on BLEU Use it only as an lsquoevaluation understudyrsquo
Re-defining a match [4]Allowing synonyms
Allowing root forms of wordsIncorporating specific language
divergences
Limiting use of BLEU [3] [4]Not use BLEU for radically
different systems
Evaluate on the basis of nature of indicative translation
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Conclusion
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Part III Overview of MT Part III Overview of MT Evaluation MetricsEvaluation Metrics
Presented as a part of CS712 byAditya JoshiKashyap PopatShubham Gautam
21st March 2013
GuideProf Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Lord Ganesha and Lord Karthikeya set out on a race to go around the world
Source wwwjaishreeganeshacom
Lord Karthikeya ldquoI won I went around the earth the world oncerdquo
Lord Ganesha ldquoI won I went around my parents They are my worldrdquo
Who among them performed better Who won the race
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
OutlineOutline
Manual Evaluation
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Preliminaries
bull Candidate Translation(s) Translation returned by an MT systemndash Also called hypothesis translation
bull Reference Translation(s) lsquoPerfectrsquo translation by humans
bull Comparing metrics with BLEU
Handling incorrect words
Handling incorrect word order
Handling recall
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
OutlineOutline
Automatic Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Manual evaluation [11]
Common techniques1 Assigning fluency and adequacy scores on
five (Absolute)
2 Ranking translated sentences relative to each other (Relative)
3 Ranking translations of syntactic constituents drawn from the source sentence (Relative)
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Manual evaluation Assigning Adequacy and fluency
bull Evaluators use their own perception to ratebull Often adequacyfluency scores correlate undesirable
Adequacyis the meaning translated correctly5 = All4 = Most3 = Much2 = Little1 = None
FluencyIs the sentence grammatically valid
5 = Flawless English4 = Good English
3 = Non-native English2 = Disfluent English
1 = Incomprehensible
म एक या यान बठा ह Main ek vyaakhyan baitha hoon
I a lecture sit (Present-first person)I sit a lecture
Adequate but not fluent
म या यान ह Main vyakhyan hoon
I lecture amI am lecture
Fluent but not adequate
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
BLEU [1]
bull Proposed by IBM A popular metricbull Uses modified n-gram precision and brevity
penalty (for recall)
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Translation edit rate[5] (TER)
bull Introduced in GALE MT task
Central idea Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Prime Minister of India will address the nation today
Reference translationभारत क धान-म ी आज रा को सबो धत करग Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation धान-म ी भारत क आज को अ स करग आजPradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = Edits Avg number of reference wordsbull Cost of shift lsquodistancersquo not incorporatedbull Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 8
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
HTER Human TER
bull Semi-supervised techniquebull Human annotators make new reference for
each translation based on system outputndash Must consider fluency and adequacy while
generating the closest target translation
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
TER vs BLEUTER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Shift or delete + insertincorporates this error
N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Word Error Rate (WER) [9]
bull Based on the Levenshtein distance (Levenshtein 1966)
bull Minimum substitutions deletions and insertions that have to be performed to convert the generated text hyp into the reference text ref
bull Also position-independent word error rate (PER)
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
WER Example
bull Order of words is importantbull Dynamic programming-based implementation
to find lsquominimumrsquo errors
Reference translationThis looks like the correct sentence
Candidate translation This seems the right sentence
errors 3
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
WER vs BLEUWER BLEU
Handling incorrect words Substitution N-gram mismatch
Handling incorrect word order
Delete + Insert N-gram mismatch
Handling recall Missed words becomedeleted words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
WER vs TER
WER TER
Shift is incorporatedHowever not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on editsrequired for humans
TER = Edits Avg number of ref words
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
ROUGE [6]
bull Recall-Oriented Understudy for Gisting Evaluation
bull ROUGE is a package of metrics ROUGE-N ROUGE-L ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences
Reference translationया ल लब वा य क गणव ता को समझ पाएगा
Kya bloo lambe waakya ki guNvatta ko samajh paaega
Candidate translation लब वा यLambe vaakya ROUGE-N 1 8
Modified n-gram Precision 1
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Other ROUGEs
bull ROUGE-Lndash Considers longest common subsequence
bull ROUGE-Wndash Weighted ROUGE-L All common subsequences
are considered with weight based on length
bull ROUGE-Sndash PrecisionRecall by matching skip bigrams
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
ROUGE vs BLEUROUGE
(suite of metrics)BLEU
Handling incorrect words Skip bigrams ROUGE-N N-gram mismatch
Handling incorrect word order
Longest common sub-sequence
N-gram mismatch
Handling recall ROUGE-N incorporates missing words
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
METEOR [7]
Aims to do better than BLEU
Central idea Have a good unigram matching strategy
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
METEOR Criticisms of BLEU
bull Brevity penalty is punitivebull Higher order n-grams may not indicate
grammatical correctness of a sentencebull BLEU is often zero Should a score be zero
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of lsquoalignment
crossesoverlapsrsquo
Repeat with different matching strategies
Incorporate stemmersConsider synonyms etc
ReferenceThe intelligent and excited boy jumped
The excited and intelligent guy pounced
CandidateThe intelligent and excited dude jumped
METEOR Process
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
METEOR The score
bull Using unigram mappings precision and recall are calculated Then
harmonic mean Penalty Find lsquoas many chunksrsquo that matchThe bright boy sits on the black bench
The intelligent guy sat on the dark bench
More accurate -gt Less chunks Less penaltyLess accurate -gt More chunks more penalty
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
METEOR vs BLEUMETEOR BLEU
Handling incorrect words Alignment chunks Matching can be done using different techniques Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner METEOR does not capture this
N-gram mismatch
Handling recall Idea of alignment incorporates missing word handling
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
NIST Metric Introduction [8]
bull MT Evaluation metric proposed by National Institute of Standards and Technology
bull Are all n-gram matches the samebull Weights more heavily those n-grams that are
more informative (ie rarer ones)bull Matching lsquoAmendment Actrsquo is better than lsquoof thersquo
bull When a correct n-gram is found the rarer that n-gram is the more weight it is given
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
NIST Metric
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
NIST vs BLEUNIST BLEU
Handling incorrect words Information gain additional level of lsquomatchingrsquo
N-gram mismatch
Handling incorrect word order
No additional provisionN-gram mismatch (based on info Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
GTM [10]
bull General Text Matcherbull F-score uses precision and recallbull Does not rely on lsquohuman judgmentrsquo
correlationndash What does BLEU score of 0006 mean
bull Comparison is easier
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
GTM Scores Precision and Recall
bull MMS Maximum Match Size
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
GTM vs BLEUGTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch
Handling incorrect word order
By considering maximum runs
N-gram mismatch
Handling recall Recall based on maximum match size
Precision cannot detect lsquomissingrsquo words Hence brevity penalty
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
OutlineOutline
Manual Evaluation
BLEUTERWERROUGEMETEORNISTGTMEntailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Entailment-based evaluation [12]
Why entailment
E I am feeling hungry H(Ref) मझ भख लग रह ह
mujhe bhookh lag rahi haiCandidate म भखा महसस कर रहा ह
main bhookha mehsoos kar raha hoon BLEU Score 0Clearly candidate is entailed by reference
translation
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (ie MT system output) should entail the reference translation and vice versa
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
The Stanford RTE system
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
RTE for MT evaluation
AnalysisPOS
TaggingDependencyNER etc
Alignment
Alignment of words and phrases
Entailment
Predicting whether entailment or not
Directionality The standard entailment
recognition task is asymmetric For MT evaluation
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Experimental Setup Single reference translation Two MT datasets with English as target language
NIST MT dataset NIST06 The NIST MT-06 Arabic-to-English dataset NIST08A The NIST MT-08 Arabic-to-English dataset NIST08C The NIST MT-08 Chinese-to-English dataset NIST08U The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset SMT06E The NAACL 2006 SMT workshop EUROPARL dataset SMT06C The NAACL 2006 SMT workshop Czech-English dataset SMT07E The ACL 2007 SMT workshop EUROPARL dataset SMT07C The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Experimental Setup
4 systems are used BLEU-4 MT
RTE
RTE + MT
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Experiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Experiment 2
Combined corpora
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Conclusion Formulation of BLEU (13)
bull Introduced different evaluation methodsbull Formulated BLEU scorebull Analyzed the BLEU score by
ndash Considering constituents of formulandash Calculating BLEU for a dummy example
bull Compared BLEU with human judgment
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Conclusion Shortcomings of BLEU (23)
bull Permutation of phrases possible [3]
bull Different items may be drawn from reference sets [3]
bull Linguistic divergences in English-Hindi MT [4]
bull BLEUrsquos aim of correlating with human judgment does not go well with goals of indicative translation [4]
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
Conclusion (33)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based Comparatively weights
content and function words
METEOR
Considers alignments Allows addition of different
types of matching
ROUGE
Suite of metrics incorporating n-gram
recall skip bigrams etc
WER
Error in words based on word order
Also PER
TER
Based on edits required to get reference translation
Also Human TER
Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
References[1] Doug Arnold Louisa Sadler and R Lee Humphreys Evaluation an
assessment Machine Translation Volume 8 pages 1ndash27 1993
[2] K Papineni S Roukos T Ward and W Zhu Bleu a method for automatic evaluation of machine translation IBM research report rc22176 (w0109-022) Technical report IBM Research Division Thomas J Watson Research Center 2001
[3] Chris Callison-Burch Miles Osborne Phillipp Koehn Re-evaluating the role of Bleu in Machine Translation Research European ACL (EACL) 2006 2006
[4] R Ananthakrishnan Pushpak Bhattacharyya M Sasikumar and Ritesh M Shah Some Issues in Automatic Evaluation of English-Hindi MT More Blues for BLEU ICON 2007 Hyderabad India Jan 2007
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008
References[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul A
study of translation edit rate with targeted human annotationIn Proceedings of Association for Machine Translation in the Americas2006
[6] Chin-yew Lin Rouge a package for automatic evaluation of summaries 2004 [7] Satanjeev Banerjee and Alon Lavie METEOR An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT andor Summarization 2005
[8] Doddington George Automatic evaluation of machine translation quality using n-gram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research HLT 2002
[9] Maja and Ney Hermann Word error rates decomposition over Pos classes and applications for error analysis Proceedings of the Second Workshop on Statistical Machine Translation StatMT2007
[10] Joseph Turian and Luke Shen and I Dan Melamed Evaluation of Machine Translation and its Evaluation In Proceedings of MT Summit IX pages 386-393 2003
[11] Chris Callison-Burch Cameron Fordyce Philipp Koehn Christof Monz and Josh Schroeder (Meta-) Evaluation of Machine Translation ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley Dan Jurafsky Chris Manning Evaluating MT output with entailment technology 2008