37
IQMT Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work Machine Translation Evaluation Inside QARLA International Workshop on Spoken Language Translation (IWSLT’05) Jes´ us Gim´ enez (UPC) Enrique Amig´ o (UNED) Chiori Hori (CMU) October 24-25, 2005

Machine Translation Evaluation Inside QARLA€¦ · ROUGE (stemming) METEOR (WordNet lookup) WNM (frequency weighting) Use of syntactic knowledge Liu and Gildea (2005) Combining different

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Machine Translation Evaluation Inside QARLA€¦ · ROUGE (stemming) METEOR (WordNet lookup) WNM (frequency weighting) Use of syntactic knowledge Liu and Gildea (2005) Combining different

IQMT

Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work

Machine Translation Evaluation Inside QARLAInternational Workshop on Spoken Language Translation (IWSLT’05)

Jesus Gimenez (UPC)Enrique Amigo (UNED)

Chiori Hori (CMU)

October 24-25, 2005

Page 2: Machine Translation Evaluation Inside QARLA€¦ · ROUGE (stemming) METEOR (WordNet lookup) WNM (frequency weighting) Use of syntactic knowledge Liu and Gildea (2005) Combining different

IQMT

Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work

1 IntroductionThe ’2008 MT Evaluation Challenge’What can we do?

2 ApproachThe QARLA FrameworkQARLA for MT evaluation

3 Experimental WorkMT Evaluation Metrics outside QARLAMT Evaluation Metrics inside QARLAMetric Combinations

4 Results for IWSLT 2005

5 Conclusions and Further Work

Page 3: Machine Translation Evaluation Inside QARLA€¦ · ROUGE (stemming) METEOR (WordNet lookup) WNM (frequency weighting) Use of syntactic knowledge Liu and Gildea (2005) Combining different

IQMT

Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work

The ’2008 MT Evaluation Challenge’

1 IntroductionThe ’2008 MT Evaluation Challenge’What can we do?

2 Approach

3 Experimental Work

4 Results for IWSLT 2005

5 Conclusions and Further Work

Page 4: Machine Translation Evaluation Inside QARLA€¦ · ROUGE (stemming) METEOR (WordNet lookup) WNM (frequency weighting) Use of syntactic knowledge Liu and Gildea (2005) Combining different

IQMT

Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work

The ’2008 MT Evaluation Challenge’

“By 2008 there will probably exist MT systemsscoring higher than humans, according to currentMT evaluation metrics”.

−− FranzOch(ACL′05)

Why?Most metrics consider only lexical similarities(BLEU, NIST, WER, PER, GTM)

Little effort has been devoted to combining different metrics

Page 5: Machine Translation Evaluation Inside QARLA€¦ · ROUGE (stemming) METEOR (WordNet lookup) WNM (frequency weighting) Use of syntactic knowledge Liu and Gildea (2005) Combining different

IQMT

Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work

The ’2008 MT Evaluation Challenge’

Some Exceptions

Use of additional knowledge

ROUGE (stemming)METEOR (WordNet lookup)WNM (frequency weighting)

Use of syntactic knowledge

Liu and Gildea (2005)

Combining different metrics

Kulesza and Shieber (2004)

Page 6: Machine Translation Evaluation Inside QARLA€¦ · ROUGE (stemming) METEOR (WordNet lookup) WNM (frequency weighting) Use of syntactic knowledge Liu and Gildea (2005) Combining different

IQMT

Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work

What can we do?

1 IntroductionThe ’2008 MT Evaluation Challenge’What can we do?

2 Approach

3 Experimental Work

4 Results for IWSLT 2005

5 Conclusions and Further Work

Page 7: Machine Translation Evaluation Inside QARLA€¦ · ROUGE (stemming) METEOR (WordNet lookup) WNM (frequency weighting) Use of syntactic knowledge Liu and Gildea (2005) Combining different

IQMT

Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work

What can we do?

Trust a new SuperMetric XXX?

Build a new metric XXX which considersinformation at different levels:

lexical

syntactic

semantic

Page 8: Machine Translation Evaluation Inside QARLA€¦ · ROUGE (stemming) METEOR (WordNet lookup) WNM (frequency weighting) Use of syntactic knowledge Liu and Gildea (2005) Combining different

IQMT

Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work

What can we do?

Or... divide and conquer

Why don’t we build a set of specialized metricswhich work at different levels and combine theirscores into a single measure of MT quality?

Page 9: Machine Translation Evaluation Inside QARLA€¦ · ROUGE (stemming) METEOR (WordNet lookup) WNM (frequency weighting) Use of syntactic knowledge Liu and Gildea (2005) Combining different

IQMT

Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work

1 Introduction

2 ApproachThe QARLA FrameworkQARLA for MT evaluation

3 Experimental Work

4 Results for IWSLT 2005

5 Conclusions and Further Work

Page 10: Machine Translation Evaluation Inside QARLA€¦ · ROUGE (stemming) METEOR (WordNet lookup) WNM (frequency weighting) Use of syntactic knowledge Liu and Gildea (2005) Combining different

IQMT

Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work

The QARLA Framework

3 measures

QUEEN determines the quality of a set of systems

KING determines the quality of a set of metrics

JACK determines the quality of a test set

(Amigo et al., ACL 2005)

Page 11: Machine Translation Evaluation Inside QARLA€¦ · ROUGE (stemming) METEOR (WordNet lookup) WNM (frequency weighting) Use of syntactic knowledge Liu and Gildea (2005) Combining different

IQMT

Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work

The QARLA Framework

The QUEEN assumption

“A good candidate must be similar to all modelsaccording to all metrics.”...Given a set of metrics X , a set of models M, and a candidate a:

QUEENX (a,M) ≡ Prob(∀x ∈ X : x(a,m) ≥ x(m′,m′′))

Page 12: Machine Translation Evaluation Inside QARLA€¦ · ROUGE (stemming) METEOR (WordNet lookup) WNM (frequency weighting) Use of syntactic knowledge Liu and Gildea (2005) Combining different

IQMT

Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work

The QARLA Framework

The QUEEN properties

(i) it is able to combine different similarity metrics into asingle evaluation measure.

(ii) it is not affected by the scale properties of individualmetrics, i.e. it does not require metric normalisationand it is not affected by metric weighting.

(iii) Candidates which are very far from the set of modelsall receive QUEEN=0.

(iv) The value of QUEEN is maximised for candidatesthat “merge” with the references according to allmetrics in X .

(v) The universal quantifier on the metric parameter ximplies that adding redundant metrics does not biasthe result of QUEEN.

Page 13: Machine Translation Evaluation Inside QARLA€¦ · ROUGE (stemming) METEOR (WordNet lookup) WNM (frequency weighting) Use of syntactic knowledge Liu and Gildea (2005) Combining different

IQMT

Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work

QARLA for MT evaluation

Goals

To study the applicability of the QARLA Frameworkfor MT:

Study of the QUEEN measure:1 Test individual metrics inside QARLA2 Try combinations of metrics

Page 14: Machine Translation Evaluation Inside QARLA€¦ · ROUGE (stemming) METEOR (WordNet lookup) WNM (frequency weighting) Use of syntactic knowledge Liu and Gildea (2005) Combining different

IQMT

Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work

1 Introduction

2 Approach

3 Experimental WorkMT Evaluation Metrics outside QARLAMT Evaluation Metrics inside QARLAMetric Combinations

4 Results for IWSLT 2005

5 Conclusions and Further Work

Page 15: Machine Translation Evaluation Inside QARLA€¦ · ROUGE (stemming) METEOR (WordNet lookup) WNM (frequency weighting) Use of syntactic knowledge Liu and Gildea (2005) Combining different

IQMT

Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work

Experimental Setting

IWSLT 2004 (Akiba et al., 2004)

BTEC Corpus

Chinese-to-English track500 short sentences

Outputs by 20 systems16 human referencesHuman assessments for adequacy and fluency

Page 16: Machine Translation Evaluation Inside QARLA€¦ · ROUGE (stemming) METEOR (WordNet lookup) WNM (frequency weighting) Use of syntactic knowledge Liu and Gildea (2005) Combining different

IQMT

Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work

Metric Set

BLEU accumulated BLEU scores for several n-gram levels(n = 1, 2, 3, 4).

NIST accumulated NIST scores for several n-gram levels(n = 1, 2, 3, 4, 5).

GTM for several values of the e parameter (e = 1, 2, 3).

mWER (default).

mPER (default).

METEOR We used 4 variants running all different modules(exact, porter stemming, WordNet stemming andWordNet synonymy).

ROUGE for several n-grams (n = 1, 2, 3, 4), and 4 othervariants at the 4-gram level, always with stemming(ROUGE-L, ROUGE-S*, ROUGE-SU*, ROUGE-W).

Page 17: Machine Translation Evaluation Inside QARLA€¦ · ROUGE (stemming) METEOR (WordNet lookup) WNM (frequency weighting) Use of syntactic knowledge Liu and Gildea (2005) Combining different

IQMT

Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work

MT Evaluation Metrics outside QARLA

1 Introduction

2 Approach

3 Experimental WorkMT Evaluation Metrics outside QARLAMT Evaluation Metrics inside QARLAMetric Combinations

4 Results for IWSLT 2005

5 Conclusions and Further Work

Page 18: Machine Translation Evaluation Inside QARLA€¦ · ROUGE (stemming) METEOR (WordNet lookup) WNM (frequency weighting) Use of syntactic knowledge Liu and Gildea (2005) Combining different

IQMT

Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work

MT Evaluation Metrics outside QARLA

Metric Adequacy Fluency

BLEU.n3 0.8449 0.8326

BLEU.n4 0.7407 0.8600

GTM.e3 0.7022 0.6906

METEOR.exact 0.8899 0.7463

NIST.n5 0.6820 0.5950

ROUGE.n2 0.9287 0.8435

ROUGE.n3 0.9190 0.8646ROUGE-S* 0.9376 0.8164

mWER -0.6427 -0.7214

mPER -0.5779 -0.6010

Page 19: Machine Translation Evaluation Inside QARLA€¦ · ROUGE (stemming) METEOR (WordNet lookup) WNM (frequency weighting) Use of syntactic knowledge Liu and Gildea (2005) Combining different

IQMT

Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work

MT Evaluation Metrics inside QARLA

1 Introduction

2 Approach

3 Experimental WorkMT Evaluation Metrics outside QARLAMT Evaluation Metrics inside QARLAMetric Combinations

4 Results for IWSLT 2005

5 Conclusions and Further Work

Page 20: Machine Translation Evaluation Inside QARLA€¦ · ROUGE (stemming) METEOR (WordNet lookup) WNM (frequency weighting) Use of syntactic knowledge Liu and Gildea (2005) Combining different

IQMT

Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work

MT Evaluation Metrics inside QARLA

Using QUEEN

Metric Ad. Fl. Ad.Q Fl.Q

BLEU.n2 0.8442 0.8002 0.8770 0.8215

GTM.e2 0.6784 0.6566 0.6687 0.6140

METEOR.exact 0.8899 0.7463 0.7836 0.6888

METEOR.wn1 0.8784 0.7147 0.7886 0.6709

NIST.n5 0.6820 0.5950 0.8440 0.7036

ROUGE.n2 0.9287 0.8435 0.9238 0.8421

ROUGE.n3 0.9190 0.8646 0.9076 0.8630ROUGE-L 0.9153 0.7644 0.9325 0.8112

ROUGE-S* 0.9376 0.8164 0.9357 0.8119

mPER/1-PER -0.5779 -0.6010 0.4212 0.3662

mWER/1-WER -0.6427 -0.7214 0.4507 0.4209

Page 21: Machine Translation Evaluation Inside QARLA€¦ · ROUGE (stemming) METEOR (WordNet lookup) WNM (frequency weighting) Use of syntactic knowledge Liu and Gildea (2005) Combining different

IQMT

Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work

MT Evaluation Metrics inside QARLA

Refining the QUEEN measure for MT

“A good translation must be at least as similar to one of thereferences as the rest of references are to each other.”

IQX (a,M) ≡ maxm∈M iqX ,M(a,m)

iqX ,M(a,m) =

1 if ∀x ∈ X : ∀m′,m′′ ∈ M :

x(a,m) ≥ x(m′,m′′)

0 otherwise

Page 22: Machine Translation Evaluation Inside QARLA€¦ · ROUGE (stemming) METEOR (WordNet lookup) WNM (frequency weighting) Use of syntactic knowledge Liu and Gildea (2005) Combining different

IQMT

Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work

MT Evaluation Metrics inside QARLA

IQ: Innovated QUEEN

Page 23: Machine Translation Evaluation Inside QARLA€¦ · ROUGE (stemming) METEOR (WordNet lookup) WNM (frequency weighting) Use of syntactic knowledge Liu and Gildea (2005) Combining different

IQMT

Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work

MT Evaluation Metrics inside QARLA

Using IQ

Metric Ad. Fl. Ad.Q Fl.Q Ad.IQ Fl.IQ

BLEU.n3 0.8449 0.8326 0.8499 0.8212 0.3923 0.4064GTM.e1 0.5136 0.5214 0.6204 0.5452 0.8293 0.7715GTM.e2 0.6784 0.6566 0.6687 0.6140 0.8015 0.8126MET.porter 0.8837 0.7265 0.7800 0.6706 0.9494 0.8599NIST.n5 0.6820 0.5950 0.8440 0.7036 0.8768 0.8650ROUGE.n2 0.9287 0.8435 0.9238 0.8421 0.9673 0.9142ROUGE.n3 0.9190 0.8646 0.9076 0.8630 0.9588 0.9180ROUGE-L 0.9153 0.7644 0.9325 0.8112 0.9713 0.8979ROUGE-S* 0.9376 0.8164 0.9357 0.8119 0.9663 0.9062mPER/1-PER -0.5779 -0.6010 0.4212 0.3662 0.0242 0.0421mWER/1-WER -0.6427 -0.7214 0.4507 0.4209 0.0880 0.0770

Page 24: Machine Translation Evaluation Inside QARLA€¦ · ROUGE (stemming) METEOR (WordNet lookup) WNM (frequency weighting) Use of syntactic knowledge Liu and Gildea (2005) Combining different

IQMT

Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work

Metric Combinations

1 Introduction

2 Approach

3 Experimental WorkMT Evaluation Metrics outside QARLAMT Evaluation Metrics inside QARLAMetric Combinations

4 Results for IWSLT 2005

5 Conclusions and Further Work

Page 25: Machine Translation Evaluation Inside QARLA€¦ · ROUGE (stemming) METEOR (WordNet lookup) WNM (frequency weighting) Use of syntactic knowledge Liu and Gildea (2005) Combining different

IQMT

Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work

Metric Combinations

Clustering

Cluster id Metrics

1 {BLEU.n1, GTM.e1}2 {BLEU.n2 ROUGE.n2}3 {BLEU.n3, ROUGE.n3}4 {BLEU.n4, ROUGE.n4}5 {1-WER, 1-PER}6 {METEOR.exact, METEOR.porter,

METEOR.wn1, METEOR.wn2}7 {NIST.n1, NIST.n2, NIST.n3, NIST.n4, NIST.n5}8 {GTM.e2, GTM.e3}9 {ROUGE.n1, ROUGE-L, ROUGE-S*,

ROUGE-SU*, ROUGE-W}

Page 26: Machine Translation Evaluation Inside QARLA€¦ · ROUGE (stemming) METEOR (WordNet lookup) WNM (frequency weighting) Use of syntactic knowledge Liu and Gildea (2005) Combining different

IQMT

Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work

Metric Combinations

Adequacy

Metric Combination AdequacyIQ

ROUGE-L 0.9713ROUGE.n2 ROUGE-L 0.9701ROUGE.n3 ROUGE-L 0.9681

ROUGE.n2 ROUGE.n3 ROUGE-L 0.9661

ROUGE.n3 ROUGE.n4 ROUGE-L 0.9621

ROUGE.n2 ROUGE.n4 ROUGE-L 0.9619

ROUGE.n2 ROUGE.n3 0.9608

ROUGE.n2 ROUGE.n3 ROUGE.n4 ROUGE-L 0.9593

ROUGE.n3 METEOR.porter ROUGE-L 0.9525

ROUGE.n3 ROUGE.n4 METEOR.porter ROUGE-L 0.9495

Page 27: Machine Translation Evaluation Inside QARLA€¦ · ROUGE (stemming) METEOR (WordNet lookup) WNM (frequency weighting) Use of syntactic knowledge Liu and Gildea (2005) Combining different

IQMT

Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work

Metric Combinations

Fluency

Metric Combination FluencyIQ

ROUGE.n3 ROUGE-SU* 0.9251ROUGE.n2 ROUGE.n3 0.9206

ROUGE.n2 ROUGE-SU* 0.9184

ROUGE.n3 0.9180ROUGE.n4 ROUGE-SU* 0.9124

ROUGE.n2 ROUGE-L 0.9121

ROUGE.n3 ROUGE.n4 ROUGE-SU* 0.9096

ROUGE.n2 ROUGE.n4 0.9090

ROUGE.n3 METEOR.porter 0.8951

ROUGE.n3 METEOR.porter ROUGE-SU* 0.8916

Page 28: Machine Translation Evaluation Inside QARLA€¦ · ROUGE (stemming) METEOR (WordNet lookup) WNM (frequency weighting) Use of syntactic knowledge Liu and Gildea (2005) Combining different

IQMT

Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work

1 Introduction

2 Approach

3 Experimental Work

4 Results for IWSLT 2005

5 Conclusions and Further Work

Page 29: Machine Translation Evaluation Inside QARLA€¦ · ROUGE (stemming) METEOR (WordNet lookup) WNM (frequency weighting) Use of syntactic knowledge Liu and Gildea (2005) Combining different

IQMT

Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work

Setting

Chinese-to-English track506 very short sentences

Outputs by 11 systems16 human referencesHuman assessments for adequacy, fluency and meaningmaintenance

Page 30: Machine Translation Evaluation Inside QARLA€¦ · ROUGE (stemming) METEOR (WordNet lookup) WNM (frequency weighting) Use of syntactic knowledge Liu and Gildea (2005) Combining different

IQMT

Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work

Outside QARLA

Metric Adequacy Fluency Meaning

BLEU.n4 0.70 0.95 0.75

GTM.e3 0.73 0.95 0.77

METEOR.exact 0.98 0.59 0.98NIST.n5 0.90 0.48 0.86

ROUGE.n1 0.98 0.63 0.99ROUGE-W 0.97 0.72 0.99mPER -0.90 0.83 0.93

mWER -0.72 0.90 0.79

Page 31: Machine Translation Evaluation Inside QARLA€¦ · ROUGE (stemming) METEOR (WordNet lookup) WNM (frequency weighting) Use of syntactic knowledge Liu and Gildea (2005) Combining different

IQMT

Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work

IQ: Fluency

Metric Combination FluencyIQ

BLEU.n4 0.8549GTM.e2 METEOR.exact NIST.n5 0.8512BLEU.n4 ROUGE.n4 0.8505

METEOR.exact NIST.n5 ROUGE.n4 0.8458BLEU.n4 GTM.e2 0.8446

BLEU.n4 METEOR.exact NIST.n5 0.8430

METEOR.exact NIST.n5 0.8418

GTM.e2 METEOR.exact NIST.n5 ROUGE.n4 0.8414

GTM.e2 ROUGE.n4 0.8408

GTM.e2 METEOR.exact NIST.n5 1-WER 0.8407

Page 32: Machine Translation Evaluation Inside QARLA€¦ · ROUGE (stemming) METEOR (WordNet lookup) WNM (frequency weighting) Use of syntactic knowledge Liu and Gildea (2005) Combining different

IQMT

Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work

IQ: Adequacy

Metric Combination AdequacyIQ

NIST.n1 0.9826NIST.n1 ROUGE.n1 0.9805GTM.e1 NIST.n1 ROUGE.n1 0.9794

NIST.n1 1-PER 0.9786

BLEU.n1 NIST.n1 0.9784

GTM.e1 NIST.n1 0.9775

GTM.e1 NIST.n1 1-PER 0.9757

BLEU.n1 GTM.e1 NIST.n1 0.9755

BLEU.n1 NIST.n1 ROUGE.n1 0.9751

NIST.n1 ROUGE.n1 1-PER 0.9727

Page 33: Machine Translation Evaluation Inside QARLA€¦ · ROUGE (stemming) METEOR (WordNet lookup) WNM (frequency weighting) Use of syntactic knowledge Liu and Gildea (2005) Combining different

IQMT

Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work

IQ: Meaning Maintenance

MeaningMetric Combination MaintenanceIQ

NIST.n1 1-PER 0.9766NIST.n1 0.9745GTM.e1 NIST.n1 1-PER 0.9740

BLEU.n1 NIST.n1 0.9739

NIST.n1 ROUGE.n1 1-PER 0.9733

BLEU.n1 NIST.n1 ROUGE.n1 0.9728

NIST.n1 ROUGE.n1 0.9725

BLEU.n1 GTM.e1 NIST.n1 0.9710

GTM.e1 NIST.n1 ROUGE.n1 0.9708

Page 34: Machine Translation Evaluation Inside QARLA€¦ · ROUGE (stemming) METEOR (WordNet lookup) WNM (frequency weighting) Use of syntactic knowledge Liu and Gildea (2005) Combining different

IQMT

Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work

1 Introduction

2 Approach

3 Experimental Work

4 Results for IWSLT 2005

5 Conclusions and Further Work

Page 35: Machine Translation Evaluation Inside QARLA€¦ · ROUGE (stemming) METEOR (WordNet lookup) WNM (frequency weighting) Use of syntactic knowledge Liu and Gildea (2005) Combining different

IQMT

Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work

Conclusions

Most individual metrics improvedsubstantially inside QARLA

Metric combinations did not furthersignificantly improve results

IWSLT sentences are too short

Current MT evaluation metrics do not seem tomeet the QARLA conjectures

Page 36: Machine Translation Evaluation Inside QARLA€¦ · ROUGE (stemming) METEOR (WordNet lookup) WNM (frequency weighting) Use of syntactic knowledge Liu and Gildea (2005) Combining different

IQMT

Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work

Further Work

What if fewer references are available?

Shouldn’t we work on longer sentences?

What about sentence level evaluation?IQMT is not yet properly a framework!!

KING? (quality of a metric set)JACK? (quality of a test set)

Page 37: Machine Translation Evaluation Inside QARLA€¦ · ROUGE (stemming) METEOR (WordNet lookup) WNM (frequency weighting) Use of syntactic knowledge Liu and Gildea (2005) Combining different

IQMT

Outline Introduction Approach Experimental Work Results for IWSLT 2005 Conclusions and Further Work

Thanks

coming very soon...

http://www.lsi.upc.edu/~nlp/IQMT