IQMT
Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work
Machine Translation Evaluation Inside QARLAInternational Workshop on Spoken Language Translation (IWSLT’05)
Jesus Gimenez (UPC)Enrique Amigo (UNED)
Chiori Hori (CMU)
October 24-25, 2005
IQMT
Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work
1 IntroductionThe ’2008 MT Evaluation Challenge’What can we do?
2 ApproachThe QARLA FrameworkQARLA for MT evaluation
3 Experimental WorkMT Evaluation Metrics outside QARLAMT Evaluation Metrics inside QARLAMetric Combinations
4 Results for IWSLT 2005
5 Conclusions and Further Work
IQMT
Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work
The ’2008 MT Evaluation Challenge’
1 IntroductionThe ’2008 MT Evaluation Challenge’What can we do?
2 Approach
3 Experimental Work
4 Results for IWSLT 2005
5 Conclusions and Further Work
IQMT
Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work
The ’2008 MT Evaluation Challenge’
“By 2008 there will probably exist MT systemsscoring higher than humans, according to currentMT evaluation metrics”.
−− FranzOch(ACL′05)
Why?Most metrics consider only lexical similarities(BLEU, NIST, WER, PER, GTM)
Little effort has been devoted to combining different metrics
IQMT
Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work
The ’2008 MT Evaluation Challenge’
Some Exceptions
Use of additional knowledge
ROUGE (stemming)METEOR (WordNet lookup)WNM (frequency weighting)
Use of syntactic knowledge
Liu and Gildea (2005)
Combining different metrics
Kulesza and Shieber (2004)
IQMT
Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work
What can we do?
1 IntroductionThe ’2008 MT Evaluation Challenge’What can we do?
2 Approach
3 Experimental Work
4 Results for IWSLT 2005
5 Conclusions and Further Work
IQMT
Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work
What can we do?
Trust a new SuperMetric XXX?
Build a new metric XXX which considersinformation at different levels:
lexical
syntactic
semantic
IQMT
Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work
What can we do?
Or... divide and conquer
Why don’t we build a set of specialized metricswhich work at different levels and combine theirscores into a single measure of MT quality?
IQMT
Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work
1 Introduction
2 ApproachThe QARLA FrameworkQARLA for MT evaluation
3 Experimental Work
4 Results for IWSLT 2005
5 Conclusions and Further Work
IQMT
Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work
The QARLA Framework
3 measures
QUEEN determines the quality of a set of systems
KING determines the quality of a set of metrics
JACK determines the quality of a test set
(Amigo et al., ACL 2005)
IQMT
Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work
The QARLA Framework
The QUEEN assumption
“A good candidate must be similar to all modelsaccording to all metrics.”...Given a set of metrics X , a set of models M, and a candidate a:
QUEENX (a,M) ≡ Prob(∀x ∈ X : x(a,m) ≥ x(m′,m′′))
IQMT
Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work
The QARLA Framework
The QUEEN properties
(i) it is able to combine different similarity metrics into asingle evaluation measure.
(ii) it is not affected by the scale properties of individualmetrics, i.e. it does not require metric normalisationand it is not affected by metric weighting.
(iii) Candidates which are very far from the set of modelsall receive QUEEN=0.
(iv) The value of QUEEN is maximised for candidatesthat “merge” with the references according to allmetrics in X .
(v) The universal quantifier on the metric parameter ximplies that adding redundant metrics does not biasthe result of QUEEN.
IQMT
Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work
QARLA for MT evaluation
Goals
To study the applicability of the QARLA Frameworkfor MT:
Study of the QUEEN measure:1 Test individual metrics inside QARLA2 Try combinations of metrics
IQMT
Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work
1 Introduction
2 Approach
3 Experimental WorkMT Evaluation Metrics outside QARLAMT Evaluation Metrics inside QARLAMetric Combinations
4 Results for IWSLT 2005
5 Conclusions and Further Work
IQMT
Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work
Experimental Setting
IWSLT 2004 (Akiba et al., 2004)
BTEC Corpus
Chinese-to-English track500 short sentences
Outputs by 20 systems16 human referencesHuman assessments for adequacy and fluency
IQMT
Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work
Metric Set
BLEU accumulated BLEU scores for several n-gram levels(n = 1, 2, 3, 4).
NIST accumulated NIST scores for several n-gram levels(n = 1, 2, 3, 4, 5).
GTM for several values of the e parameter (e = 1, 2, 3).
mWER (default).
mPER (default).
METEOR We used 4 variants running all different modules(exact, porter stemming, WordNet stemming andWordNet synonymy).
ROUGE for several n-grams (n = 1, 2, 3, 4), and 4 othervariants at the 4-gram level, always with stemming(ROUGE-L, ROUGE-S*, ROUGE-SU*, ROUGE-W).
IQMT
Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work
MT Evaluation Metrics outside QARLA
1 Introduction
2 Approach
3 Experimental WorkMT Evaluation Metrics outside QARLAMT Evaluation Metrics inside QARLAMetric Combinations
4 Results for IWSLT 2005
5 Conclusions and Further Work
IQMT
Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work
MT Evaluation Metrics outside QARLA
Metric Adequacy Fluency
BLEU.n3 0.8449 0.8326
BLEU.n4 0.7407 0.8600
GTM.e3 0.7022 0.6906
METEOR.exact 0.8899 0.7463
NIST.n5 0.6820 0.5950
ROUGE.n2 0.9287 0.8435
ROUGE.n3 0.9190 0.8646ROUGE-S* 0.9376 0.8164
mWER -0.6427 -0.7214
mPER -0.5779 -0.6010
IQMT
Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work
MT Evaluation Metrics inside QARLA
1 Introduction
2 Approach
3 Experimental WorkMT Evaluation Metrics outside QARLAMT Evaluation Metrics inside QARLAMetric Combinations
4 Results for IWSLT 2005
5 Conclusions and Further Work
IQMT
Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work
MT Evaluation Metrics inside QARLA
Using QUEEN
Metric Ad. Fl. Ad.Q Fl.Q
BLEU.n2 0.8442 0.8002 0.8770 0.8215
GTM.e2 0.6784 0.6566 0.6687 0.6140
METEOR.exact 0.8899 0.7463 0.7836 0.6888
METEOR.wn1 0.8784 0.7147 0.7886 0.6709
NIST.n5 0.6820 0.5950 0.8440 0.7036
ROUGE.n2 0.9287 0.8435 0.9238 0.8421
ROUGE.n3 0.9190 0.8646 0.9076 0.8630ROUGE-L 0.9153 0.7644 0.9325 0.8112
ROUGE-S* 0.9376 0.8164 0.9357 0.8119
mPER/1-PER -0.5779 -0.6010 0.4212 0.3662
mWER/1-WER -0.6427 -0.7214 0.4507 0.4209
IQMT
Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work
MT Evaluation Metrics inside QARLA
Refining the QUEEN measure for MT
“A good translation must be at least as similar to one of thereferences as the rest of references are to each other.”
IQX (a,M) ≡ maxm∈M iqX ,M(a,m)
iqX ,M(a,m) =
1 if ∀x ∈ X : ∀m′,m′′ ∈ M :
x(a,m) ≥ x(m′,m′′)
0 otherwise
IQMT
Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work
MT Evaluation Metrics inside QARLA
IQ: Innovated QUEEN
IQMT
Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work
MT Evaluation Metrics inside QARLA
Using IQ
Metric Ad. Fl. Ad.Q Fl.Q Ad.IQ Fl.IQ
BLEU.n3 0.8449 0.8326 0.8499 0.8212 0.3923 0.4064GTM.e1 0.5136 0.5214 0.6204 0.5452 0.8293 0.7715GTM.e2 0.6784 0.6566 0.6687 0.6140 0.8015 0.8126MET.porter 0.8837 0.7265 0.7800 0.6706 0.9494 0.8599NIST.n5 0.6820 0.5950 0.8440 0.7036 0.8768 0.8650ROUGE.n2 0.9287 0.8435 0.9238 0.8421 0.9673 0.9142ROUGE.n3 0.9190 0.8646 0.9076 0.8630 0.9588 0.9180ROUGE-L 0.9153 0.7644 0.9325 0.8112 0.9713 0.8979ROUGE-S* 0.9376 0.8164 0.9357 0.8119 0.9663 0.9062mPER/1-PER -0.5779 -0.6010 0.4212 0.3662 0.0242 0.0421mWER/1-WER -0.6427 -0.7214 0.4507 0.4209 0.0880 0.0770
IQMT
Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work
Metric Combinations
1 Introduction
2 Approach
3 Experimental WorkMT Evaluation Metrics outside QARLAMT Evaluation Metrics inside QARLAMetric Combinations
4 Results for IWSLT 2005
5 Conclusions and Further Work
IQMT
Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work
Metric Combinations
Clustering
Cluster id Metrics
1 {BLEU.n1, GTM.e1}2 {BLEU.n2 ROUGE.n2}3 {BLEU.n3, ROUGE.n3}4 {BLEU.n4, ROUGE.n4}5 {1-WER, 1-PER}6 {METEOR.exact, METEOR.porter,
METEOR.wn1, METEOR.wn2}7 {NIST.n1, NIST.n2, NIST.n3, NIST.n4, NIST.n5}8 {GTM.e2, GTM.e3}9 {ROUGE.n1, ROUGE-L, ROUGE-S*,
ROUGE-SU*, ROUGE-W}
IQMT
Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work
Metric Combinations
Adequacy
Metric Combination AdequacyIQ
ROUGE-L 0.9713ROUGE.n2 ROUGE-L 0.9701ROUGE.n3 ROUGE-L 0.9681
ROUGE.n2 ROUGE.n3 ROUGE-L 0.9661
ROUGE.n3 ROUGE.n4 ROUGE-L 0.9621
ROUGE.n2 ROUGE.n4 ROUGE-L 0.9619
ROUGE.n2 ROUGE.n3 0.9608
ROUGE.n2 ROUGE.n3 ROUGE.n4 ROUGE-L 0.9593
ROUGE.n3 METEOR.porter ROUGE-L 0.9525
ROUGE.n3 ROUGE.n4 METEOR.porter ROUGE-L 0.9495
IQMT
Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work
Metric Combinations
Fluency
Metric Combination FluencyIQ
ROUGE.n3 ROUGE-SU* 0.9251ROUGE.n2 ROUGE.n3 0.9206
ROUGE.n2 ROUGE-SU* 0.9184
ROUGE.n3 0.9180ROUGE.n4 ROUGE-SU* 0.9124
ROUGE.n2 ROUGE-L 0.9121
ROUGE.n3 ROUGE.n4 ROUGE-SU* 0.9096
ROUGE.n2 ROUGE.n4 0.9090
ROUGE.n3 METEOR.porter 0.8951
ROUGE.n3 METEOR.porter ROUGE-SU* 0.8916
IQMT
Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work
1 Introduction
2 Approach
3 Experimental Work
4 Results for IWSLT 2005
5 Conclusions and Further Work
IQMT
Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work
Setting
Chinese-to-English track506 very short sentences
Outputs by 11 systems16 human referencesHuman assessments for adequacy, fluency and meaningmaintenance
IQMT
Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work
Outside QARLA
Metric Adequacy Fluency Meaning
BLEU.n4 0.70 0.95 0.75
GTM.e3 0.73 0.95 0.77
METEOR.exact 0.98 0.59 0.98NIST.n5 0.90 0.48 0.86
ROUGE.n1 0.98 0.63 0.99ROUGE-W 0.97 0.72 0.99mPER -0.90 0.83 0.93
mWER -0.72 0.90 0.79
IQMT
Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work
IQ: Fluency
Metric Combination FluencyIQ
BLEU.n4 0.8549GTM.e2 METEOR.exact NIST.n5 0.8512BLEU.n4 ROUGE.n4 0.8505
METEOR.exact NIST.n5 ROUGE.n4 0.8458BLEU.n4 GTM.e2 0.8446
BLEU.n4 METEOR.exact NIST.n5 0.8430
METEOR.exact NIST.n5 0.8418
GTM.e2 METEOR.exact NIST.n5 ROUGE.n4 0.8414
GTM.e2 ROUGE.n4 0.8408
GTM.e2 METEOR.exact NIST.n5 1-WER 0.8407
IQMT
Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work
IQ: Adequacy
Metric Combination AdequacyIQ
NIST.n1 0.9826NIST.n1 ROUGE.n1 0.9805GTM.e1 NIST.n1 ROUGE.n1 0.9794
NIST.n1 1-PER 0.9786
BLEU.n1 NIST.n1 0.9784
GTM.e1 NIST.n1 0.9775
GTM.e1 NIST.n1 1-PER 0.9757
BLEU.n1 GTM.e1 NIST.n1 0.9755
BLEU.n1 NIST.n1 ROUGE.n1 0.9751
NIST.n1 ROUGE.n1 1-PER 0.9727
IQMT
Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work
IQ: Meaning Maintenance
MeaningMetric Combination MaintenanceIQ
NIST.n1 1-PER 0.9766NIST.n1 0.9745GTM.e1 NIST.n1 1-PER 0.9740
BLEU.n1 NIST.n1 0.9739
NIST.n1 ROUGE.n1 1-PER 0.9733
BLEU.n1 NIST.n1 ROUGE.n1 0.9728
NIST.n1 ROUGE.n1 0.9725
BLEU.n1 GTM.e1 NIST.n1 0.9710
GTM.e1 NIST.n1 ROUGE.n1 0.9708
IQMT
Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work
1 Introduction
2 Approach
3 Experimental Work
4 Results for IWSLT 2005
5 Conclusions and Further Work
IQMT
Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work
Conclusions
Most individual metrics improvedsubstantially inside QARLA
Metric combinations did not furthersignificantly improve results
IWSLT sentences are too short
Current MT evaluation metrics do not seem tomeet the QARLA conjectures
IQMT
Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work
Further Work
What if fewer references are available?
Shouldn’t we work on longer sentences?
What about sentence level evaluation?IQMT is not yet properly a framework!!
KING? (quality of a metric set)JACK? (quality of a test set)
IQMT
Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work
Thanks
coming very soon...
http://www.lsi.upc.edu/~nlp/IQMT