25
Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation Chin-Yew Lin & Franz Josef Och (presented by Bilmes) or Orange: a Method for Automatically Evaluating Automatic Evaluation Metrics for Machine Translation

Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation

  • Upload
    selina

  • View
    30

  • Download
    0

Embed Size (px)

DESCRIPTION

Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation. or Orange: a Method for Automatically Evaluating Automatic Evaluation Metrics for Machine Translation. Chin-Yew Lin & Franz Josef Och (presented by Bilmes). Summary. - PowerPoint PPT Presentation

Citation preview

Page 1: Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation

Orange: a Method for Evaluating Automatic Evaluation Metrics for

Machine Translation

Chin-Yew Lin & Franz Josef Och

(presented by Bilmes)

orOrange: a Method for Automatically Evaluating Automatic

Evaluation Metrics for Machine Translation

Page 2: Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation

Summary

• 1. Introduces ORANGE, a way to automatically evaluate automatic evaluation methods

• 2. Introduces 3 new ways to automatically evaluate MT systems: ROUGE-L, ROUGE-W, and ROUGE-S

• 3. Uses ORANGE to evaluate many different evaluation methods, and finds that their new one, ROUGE-S4 is the best evaluator

Page 3: Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation

Reminder: Adequacy & Fluency

• Adequacy refers to the degree to which the translation communicates information present in the original. Roughly, a translation using the same words (1-grams) as the reference tends to satisfy adequecy.

• Fluency refers to the degree to which the translation is well-formed according to the grammar of the target language. Roughly, the longer the n-gram matches of a translation with a reference, tends to improve fluency.

Page 4: Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation

Reminder: BLEU• unigram precision = num translation unigrams that appear

in reference translation/candidate translation length

• modified unigram precision = clipped(num trans. unigrams in ref translation)/cand. trans. length– clipping maxes out to the max count in any ref. translation

• modified n-gram precision (same thing)

• On blocks of text:

• Brevity penalty (since short sentences can get high low-gram precision)

• Finally:

Page 5: Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation

Still Other Reminders• Person’s product moment correlation coefficient:

– just the normal correlation coefficient r2 = (EXY)2/EX2EY2

• Spearman’s rank order correlation coefficient– same thing for normal orders, or otherwise:

– Di = rankA(i) – rankB(i)

• Bootstrap method to compute confidence intervals– resample with replacement from data N times, compute mean and get

confidence interval of val +/- 2se(val) for 95% confidence interval.

2

2

61

( 1)

ii

s

Dr

N N

Page 6: Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation

On to the paper:Lots of ways to evaluate MT quality

• BLEU

• RED

• WER – length-normalized edit distance

• PER – position independent word error rate (bag of words approach)

• GTM – general text matcher, based on a balance of recall, precision, & their F-measure combination (we should do this paper)

• This paper now introduces still three more such metrics: ROUGE-L, ROUGE-W, and ROUGE-S (which we shall define).

Page 7: Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation

Corr Coeff & 95% CIs of 8 MT systems in NIST03 Chinese-English, using various MT evaluation methods

Page 8: Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation

Problem is we need a way to automatically evaluate these automatic

evaluation methods.• Since we don’t know which one is best, which one to use, how

and when to choose, etc.• Try to break out of the region of insignificant difference.• Question: what about meta regress: do we need a way to

automatically evaluate automatic evaluations of automatic evaluation methods?

• Anyway, goal of this paper (other than introducing new automatic evaluation methods) is to introduce ORANGE: Oracle Ranking for Gisting Evaluation (or the first automatic evaluation of automatic MT evaluation methods).

Page 9: Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation

ORANGE• Intuitively: uses translations “rank” as scored by MT evaluation

system (so good translations should have high rank, poor ones should have low rank)

• reference translations should have higher rank.

• Key quantity: average rank of reference translations within combined machine and reference translation list.

• ORANGE = average rank / N in N-best list.1. The bank was visited by me yesterday.

2. I went to the bank yesterday

3. Yesterday, I went to the bank.

4. Yesterday, the bank had the opportunity to be visited by me, and in fact this did indeed occur.

5. There was once this back that at least as of yesterday existed, and so did I, and a funny thing happened …

So, reference translations were ranked 2 and 3 in this list. Avg rank = 2.5. Smaller the better.

Page 10: Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation

ORANGE• The way they calculate ORANGE in this work:

• Oraclei = reference transcription I

• N = size of N-best list

• S = number of sentences in corpus

• Rank(Oraclei) = average rank of source sentence i’s reference translations in n-best list i.

Page 11: Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation

Three new metrics

• ROUGE-L

• ROUGE-W

• ROUGE-S

Page 12: Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation

LCS: Longest Common Subsequences

Page 13: Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation

Computing: Longest Common Subsequences

1

Key thing: This does not require consecutive matches in strings.

Ex: LCS(X,Y) = 3

- police killed the gunman

- police kill the gunman

Page 14: Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation

ROUGE-L• Basically, an “F-measure” (or combination) of two normalized

LCSs

/lcs lcsR P when

1. Again, no consecutive matches necessary2. automatically includes longest in-sequence common n-gram.

Page 15: Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation

ROUGE-L

Reference

two candidates

ROUGE-L = 3/5

ROUGE-L = 1/2

Page 16: Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation

ROUGE-L• Basically, an “F-measure” (or combination) of two normalized

LCSs

/lcs lcsR P when

1. Again, no consecutive matches necessary2. automatically includes longest in-sequence common n-gram.

3. problem: counts only main in-sequence words, other LCSs and shorter CSs are not counted

Page 17: Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation

Computing: ROUGE-W score

( )f k kso that consecutive matches should be

awarded more than non-consecutive matches.

Page 18: Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation

ROUGE-S• Another “F-measure” but here using skip-bigram co-occurance

statistics (i.e., non-consecutive but same order bi-grams). Goal is to measure overlap of skip-bigrams.

• We use function SKIP2(X,Y) to measure number of common skip-bigrams in X and Y.

Page 19: Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation

ROUGE-S• Using the SKIP2() function:

• No consecutive matches required, but still respects word order

• counts *all* in-order matching word pairs (LCS only counts longest common subsequence)

• Can impose limit on max skip distance– ROUGE-Sn, has max skip distance of n (e.g., ROUGE-S4)

mwhere ( , )

nC m n

Page 20: Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation

Setup

• ISI’s A1Temp System

• 2002 NIST Chinese-English evaluation corpus

• 872 source sentences, 4 ref trans. each

• 1024-best lists used

Page 21: Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation

Evaluating BLEU with ORANGE{ 1}

{ }

{ 1}' { } ' '

( ) 1

( ) 1

clip nC Candidates n gram C

nn

C Candidates n gram C

Count n gram

pCount n gram

• smoothed BLEU:

Page 22: Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation

Evaluating BLEU with CC

Page 23: Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation

Evaluating ROUGE-L/W with

Page 24: Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation

Evaluating ROUGE-S with

Page 25: Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation

Summary of metrics with