Download pdf - Machine Translation Evaluation Inside QARLA€¦ · ROUGE (stemming) METEOR (WordNet lookup) WNM (frequency weighting) Use of syntactic knowledge Liu and Gildea (2005) Combining diﬀerent

IQMT

Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work

Machine Translation Evaluation Inside QARLAInternational Workshop on Spoken Language Translation (IWSLT’05)

Jesus Gimenez (UPC)Enrique Amigo (UNED)

Chiori Hori (CMU)

October 24-25, 2005

IQMT


1 IntroductionThe ’2008 MT Evaluation Challenge’What can we do?

2 ApproachThe QARLA FrameworkQARLA for MT evaluation

3 Experimental WorkMT Evaluation Metrics outside QARLAMT Evaluation Metrics inside QARLAMetric Combinations

4 Results for IWSLT 2005

5 Conclusions and Further Work

IQMT


The ’2008 MT Evaluation Challenge’


2 Approach

3 Experimental Work



IQMT



“By 2008 there will probably exist MT systemsscoring higher than humans, according to currentMT evaluation metrics”.

−− FranzOch(ACL′05)

Why?Most metrics consider only lexical similarities(BLEU, NIST, WER, PER, GTM)

Little effort has been devoted to combining different metrics

IQMT



Some Exceptions

Use of additional knowledge

ROUGE (stemming)METEOR (WordNet lookup)WNM (frequency weighting)

Use of syntactic knowledge

Liu and Gildea (2005)

Combining different metrics

Kulesza and Shieber (2004)

IQMT


What can we do?


2 Approach

3 Experimental Work



IQMT


What can we do?

Trust a new SuperMetric XXX?

Build a new metric XXX which considersinformation at different levels:

lexical

syntactic

semantic

IQMT


What can we do?

Or... divide and conquer

Why don’t we build a set of specialized metricswhich work at different levels and combine theirscores into a single measure of MT quality?

IQMT


1 Introduction

2 ApproachThe QARLA FrameworkQARLA for MT evaluation

3 Experimental Work



IQMT


The QARLA Framework

3 measures

QUEEN determines the quality of a set of systems

KING determines the quality of a set of metrics

JACK determines the quality of a test set

(Amigo et al., ACL 2005)

IQMT


The QARLA Framework

The QUEEN assumption

“A good candidate must be similar to all modelsaccording to all metrics.”...Given a set of metrics X , a set of models M, and a candidate a:

QUEENX (a,M) ≡ Prob(∀x ∈ X : x(a,m) ≥ x(m′,m′′))

IQMT


The QARLA Framework

The QUEEN properties

(i) it is able to combine different similarity metrics into asingle evaluation measure.

(ii) it is not affected by the scale properties of individualmetrics, i.e. it does not require metric normalisationand it is not affected by metric weighting.

(iii) Candidates which are very far from the set of modelsall receive QUEEN=0.

(iv) The value of QUEEN is maximised for candidatesthat “merge” with the references according to allmetrics in X .

(v) The universal quantifier on the metric parameter ximplies that adding redundant metrics does not biasthe result of QUEEN.

IQMT


QARLA for MT evaluation

Goals

To study the applicability of the QARLA Frameworkfor MT:

Study of the QUEEN measure:1 Test individual metrics inside QARLA2 Try combinations of metrics

IQMT


1 Introduction

2 Approach




IQMT


Experimental Setting

IWSLT 2004 (Akiba et al., 2004)

BTEC Corpus

Chinese-to-English track500 short sentences

Outputs by 20 systems16 human referencesHuman assessments for adequacy and fluency

IQMT


Metric Set

BLEU accumulated BLEU scores for several n-gram levels(n = 1, 2, 3, 4).

NIST accumulated NIST scores for several n-gram levels(n = 1, 2, 3, 4, 5).

GTM for several values of the e parameter (e = 1, 2, 3).

mWER (default).

mPER (default).

METEOR We used 4 variants running all different modules(exact, porter stemming, WordNet stemming andWordNet synonymy).

ROUGE for several n-grams (n = 1, 2, 3, 4), and 4 othervariants at the 4-gram level, always with stemming(ROUGE-L, ROUGE-S*, ROUGE-SU*, ROUGE-W).

IQMT


MT Evaluation Metrics outside QARLA

1 Introduction

2 Approach




IQMT


MT Evaluation Metrics outside QARLA

Metric Adequacy Fluency

BLEU.n3 0.8449 0.8326

BLEU.n4 0.7407 0.8600

GTM.e3 0.7022 0.6906

METEOR.exact 0.8899 0.7463

NIST.n5 0.6820 0.5950

ROUGE.n2 0.9287 0.8435

ROUGE.n3 0.9190 0.8646ROUGE-S* 0.9376 0.8164

mWER -0.6427 -0.7214

mPER -0.5779 -0.6010

IQMT


MT Evaluation Metrics inside QARLA

1 Introduction

2 Approach




IQMT



Using QUEEN

Metric Ad. Fl. Ad.Q Fl.Q

BLEU.n2 0.8442 0.8002 0.8770 0.8215

GTM.e2 0.6784 0.6566 0.6687 0.6140

METEOR.exact 0.8899 0.7463 0.7836 0.6888

METEOR.wn1 0.8784 0.7147 0.7886 0.6709

NIST.n5 0.6820 0.5950 0.8440 0.7036

ROUGE.n2 0.9287 0.8435 0.9238 0.8421

ROUGE.n3 0.9190 0.8646 0.9076 0.8630ROUGE-L 0.9153 0.7644 0.9325 0.8112

ROUGE-S* 0.9376 0.8164 0.9357 0.8119

mPER/1-PER -0.5779 -0.6010 0.4212 0.3662

mWER/1-WER -0.6427 -0.7214 0.4507 0.4209

IQMT



Refining the QUEEN measure for MT

“A good translation must be at least as similar to one of thereferences as the rest of references are to each other.”

IQX (a,M) ≡ maxm∈M iqX ,M(a,m)

iqX ,M(a,m) =

1 if ∀x ∈ X : ∀m′,m′′ ∈ M :

x(a,m) ≥ x(m′,m′′)

0 otherwise

IQMT



IQ: Innovated QUEEN

IQMT



Using IQ

Metric Ad. Fl. Ad.Q Fl.Q Ad.IQ Fl.IQ

BLEU.n3 0.8449 0.8326 0.8499 0.8212 0.3923 0.4064GTM.e1 0.5136 0.5214 0.6204 0.5452 0.8293 0.7715GTM.e2 0.6784 0.6566 0.6687 0.6140 0.8015 0.8126MET.porter 0.8837 0.7265 0.7800 0.6706 0.9494 0.8599NIST.n5 0.6820 0.5950 0.8440 0.7036 0.8768 0.8650ROUGE.n2 0.9287 0.8435 0.9238 0.8421 0.9673 0.9142ROUGE.n3 0.9190 0.8646 0.9076 0.8630 0.9588 0.9180ROUGE-L 0.9153 0.7644 0.9325 0.8112 0.9713 0.8979ROUGE-S* 0.9376 0.8164 0.9357 0.8119 0.9663 0.9062mPER/1-PER -0.5779 -0.6010 0.4212 0.3662 0.0242 0.0421mWER/1-WER -0.6427 -0.7214 0.4507 0.4209 0.0880 0.0770

IQMT


Metric Combinations

1 Introduction

2 Approach




IQMT


Metric Combinations

Clustering

Cluster id Metrics

1 {BLEU.n1, GTM.e1}2 {BLEU.n2 ROUGE.n2}3 {BLEU.n3, ROUGE.n3}4 {BLEU.n4, ROUGE.n4}5 {1-WER, 1-PER}6 {METEOR.exact, METEOR.porter,

METEOR.wn1, METEOR.wn2}7 {NIST.n1, NIST.n2, NIST.n3, NIST.n4, NIST.n5}8 {GTM.e2, GTM.e3}9 {ROUGE.n1, ROUGE-L, ROUGE-S*,

ROUGE-SU*, ROUGE-W}

IQMT


Metric Combinations

Adequacy

Metric Combination AdequacyIQ

ROUGE-L 0.9713ROUGE.n2 ROUGE-L 0.9701ROUGE.n3 ROUGE-L 0.9681

ROUGE.n2 ROUGE.n3 ROUGE-L 0.9661



ROUGE.n2 ROUGE.n3 0.9608

ROUGE.n2 ROUGE.n3 ROUGE.n4 ROUGE-L 0.9593

ROUGE.n3 METEOR.porter ROUGE-L 0.9525

ROUGE.n3 ROUGE.n4 METEOR.porter ROUGE-L 0.9495

IQMT


Metric Combinations

Fluency

Metric Combination FluencyIQ

ROUGE.n3 ROUGE-SU* 0.9251ROUGE.n2 ROUGE.n3 0.9206

ROUGE.n2 ROUGE-SU* 0.9184

ROUGE.n3 0.9180ROUGE.n4 ROUGE-SU* 0.9124

ROUGE.n2 ROUGE-L 0.9121

ROUGE.n3 ROUGE.n4 ROUGE-SU* 0.9096

ROUGE.n2 ROUGE.n4 0.9090

ROUGE.n3 METEOR.porter 0.8951

ROUGE.n3 METEOR.porter ROUGE-SU* 0.8916

IQMT


1 Introduction

2 Approach

3 Experimental Work



IQMT


Setting

Chinese-to-English track506 very short sentences

Outputs by 11 systems16 human referencesHuman assessments for adequacy, fluency and meaningmaintenance

IQMT


Outside QARLA

Metric Adequacy Fluency Meaning

BLEU.n4 0.70 0.95 0.75

GTM.e3 0.73 0.95 0.77

METEOR.exact 0.98 0.59 0.98NIST.n5 0.90 0.48 0.86

ROUGE.n1 0.98 0.63 0.99ROUGE-W 0.97 0.72 0.99mPER -0.90 0.83 0.93

mWER -0.72 0.90 0.79

IQMT


IQ: Fluency

Metric Combination FluencyIQ

BLEU.n4 0.8549GTM.e2 METEOR.exact NIST.n5 0.8512BLEU.n4 ROUGE.n4 0.8505

METEOR.exact NIST.n5 ROUGE.n4 0.8458BLEU.n4 GTM.e2 0.8446

BLEU.n4 METEOR.exact NIST.n5 0.8430

METEOR.exact NIST.n5 0.8418

GTM.e2 METEOR.exact NIST.n5 ROUGE.n4 0.8414

GTM.e2 ROUGE.n4 0.8408

GTM.e2 METEOR.exact NIST.n5 1-WER 0.8407

IQMT


IQ: Adequacy

Metric Combination AdequacyIQ

NIST.n1 0.9826NIST.n1 ROUGE.n1 0.9805GTM.e1 NIST.n1 ROUGE.n1 0.9794

NIST.n1 1-PER 0.9786

BLEU.n1 NIST.n1 0.9784

GTM.e1 NIST.n1 0.9775

GTM.e1 NIST.n1 1-PER 0.9757

BLEU.n1 GTM.e1 NIST.n1 0.9755

BLEU.n1 NIST.n1 ROUGE.n1 0.9751

NIST.n1 ROUGE.n1 1-PER 0.9727

IQMT


IQ: Meaning Maintenance

MeaningMetric Combination MaintenanceIQ

NIST.n1 1-PER 0.9766NIST.n1 0.9745GTM.e1 NIST.n1 1-PER 0.9740

BLEU.n1 NIST.n1 0.9739

NIST.n1 ROUGE.n1 1-PER 0.9733

BLEU.n1 NIST.n1 ROUGE.n1 0.9728

NIST.n1 ROUGE.n1 0.9725

BLEU.n1 GTM.e1 NIST.n1 0.9710

GTM.e1 NIST.n1 ROUGE.n1 0.9708

IQMT


1 Introduction

2 Approach

3 Experimental Work



IQMT


Conclusions

Most individual metrics improvedsubstantially inside QARLA

Metric combinations did not furthersignificantly improve results

IWSLT sentences are too short

Current MT evaluation metrics do not seem tomeet the QARLA conjectures

IQMT


Further Work

What if fewer references are available?

Shouldn’t we work on longer sentences?

What about sentence level evaluation?IQMT is not yet properly a framework!!

KING? (quality of a metric set)JACK? (quality of a test set)

IQMT


Thanks

coming very soon...

http://www.lsi.upc.edu/~nlp/IQMT