Measuring Confidence Intervals for MT Evaluation Metrics

June 2004DARPA TIDES MT Workshop

Measuring Confidence Intervals for MT Evaluation Metrics

Ying ZhangStephan Vogel

Language Technologies InstituteCarnegie Mellon University


MT Evaluation Metrics

• Human Evaluations (LDC)– Fluency and Adequacy

• Automatic Evaluation Metrics– mWER: edit distance between the hypothesis and the closest reference

translation

– mPER: position independent error rate

– BLEU:

– Modified BLEU:

– NIST:

)logexp(1

N

nnn pwBPBLEU

N

nnn pwBPBLEUM

1

N

nhypinwwall

occurcothatwwalln

n

n

wwInfo

BPNIST1

__..._

__..._1

1

1

)1(

)...(


Measuring the Confidence Intervals

• One score per test set

• How accurate is this score?

• To measure the confidence interval a population is required

• Building a test set with multiple human reference translations is expensive

• Bootstrapping (Efron 1986)– Introduced in 1979 as a computer-based method for estimating

the standard errors of a statistical estimation

– Resampling: creating an artificial population by sampling with replacement

– Proposed by Franz Och (2003) to measure the confidence intervals for automatic MT evaluation metrics


A Schematic of the Bootstrapping Process

Score0


An Efficient Implementation

• Translate and evaluate on 2,000 test sets?– No Way!

• Resample the n-gram precision information for the sentences– Most MT systems are context independent at the sentence level;– MT evaluation metrics are based on information collected for each testing

sentences– E.g. for BLEU and NIST

RefLen: 61 52 56 59ClosestRefLen 561-gram: 56 46 428.41

– Similar for human judgment and other MT metrics

• Approximation for NIST information gain• Scripts available at: http://projectile.is.cs.cmu.edu

/research/public/tools/bootStrap/tutorial.htm

http://projectile.is.cs.cmu.edu/research/public/tools/bootStrap/tutorial.htm











Confidence Intervals

• 7 MT systems from June 2002 evaluation

• Observations:– Relative confidence interval: NIST<M-Bleu<Bleu

– I.e. NIST scores have more discriminative powers than BLEU


Are Two MT Systems Different?

• Comparing two MT systems’ performance– Using the similar method as for single system

– E.g. Diff(Sys1-Sys2):Median=-1.7355 [-1.5453,-1.9056]

– If the confidence intervals overlap with 0, two systems are not significantly different

– M-Bleu and NIST have more discriminative power than Bleu

– Automatic metrics have pretty high correlations with the human ranking

– Human judges like system E (Syntactic system) more than B (Statistical system), but automatic metrics do not


How much testing data is needed

NIST Scores

3.5

4

4.5

5

5.5

6

6.5

7

7.5

8

10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Percentage of Testing Data Size

NIS

T S

co

re

A B C D E F G

BLEU Scores

0

0.05

0.1

0.15

0.2

0.25

0.3

10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Percentage of testing data size

BL

EU

Sc

ore

A B C D E F G

M-Bleu Scores

0.05

0.07

0.09

0.11

0.13

0.15

0.17

10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Percentage of testing data size

M-B

leu

sc

ore

A B C D E F G

F+A Human Judgments based on Different Size of Testing Data

4

4.2

4.4

4.6

4.8

5

5.2

5.4

5.6

5.8

6

10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Percentage of Testing Data

Hu

ma

n J

ud

gm

en

t

A B C D E F G


How much testing data is needed

• NIST scores increase steadily with the growing test set size

• The distance between the scores of the different systems remains stable when using 40% or more of the test set

• The confidence intervals become narrower for larger test set

* System A, (Bootstrap Size B=2000)


How many reference translations are sufficient?

• Confidence intervals become narrower with more reference translations

• [100%](1-ref)~[80~90%](2-ref)~[70~80%](3-ref)~[60%~70%](4-ref)

• One additional reference translation compensates for 10~15% of testing data

* System A, (Bootstrap Size B=2000)


Bootstrap-t interval vs. normal/t interval

• Normal distribution / t-distribution

• Student’s t-interval (when n is small)

• Bootstrap-t interval– For each bootstrap sample, calculate

– The alpha-th percentile is estimated by the value , such that

– Bootstrap-t interval is

– e.g. if B=1000, the 50th largest value and the 950th largest value gives the bootstrap-t interval

)1,0(~ˆ .

^ Nse

Z

]ˆ,ˆ[^

)(^

)1( sezsez Assuming that

1

.

^ ~ˆ

ntse

Z]ˆ,ˆ[

^

1

^

1

)()1(

setset nn

Assuming that

^*

**

)(

)(ˆ)(

bse

bbZ

)(ˆ t BtbZ /}ˆ)({# )(*

]ˆ,ˆ[^^ )()1(

setset


Bootstrap-t interval vs. Normal/t interval (Cont.)

• Bootstrap-t intervals assumes no distribution, but– It can give erratic results

– It can be heavily influenced by a few outlying data points

• When B is large, the bootstrap sample scores are pretty close to normal distribution

• Assume normal distribution gives more reliable intervals, e.g. for BLEU relative confidence interval (B=500)– STDEV=0.27 for bootstrap-t interval

– STDEV=0.14 for normal/student-t interval

Historgram of 2000 BLEU Scores

0

20

40

60

80

100

120

140

160

BLEU Score

Fre

q


The Number of Bootstrap Replications B

• Ideal bootstrap estimate of the confidence interval takes• Computational time increases linearly with B • The greater the B, the smaller of the standard deviation of the estimated confidence intervals. E.g. for BLEU’s relative

confidence interval– STDEV = 0.60 when B=100– STDEV = 0.27 when B=500

• Two rules of thumb:– Even a small B, say B=100 is usually informative– B>1000 gives quite satisfactory results

B


References

• Efron, B. and R. Tibshirani : 1986, Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy, Statistical Science 1, p. 54-77.

• F. Och. 2003. Minimum Error Rate Training in Statistical Machine Translation. In Proc. Of ACL, Sapporo, Japan.

• M. Bisani and H. Ney : 2004, 'Bootstrap Estimates for Confidence Intervals in ASR Performance Evaluation', In Proc. of ICASP, Montreal, Canada, Vol. 1, pp. 409-412.

• G. Leusch, N. Ueffing, H. Ney : 2003, 'A Novel String-to-String Distance Measure with Applications to Machine Translation Evaluation', In Proc. 9th MT Summit, New Orleans, LO.

• I Dan Melamed, Ryan Green and Joseph P. Turian : 2003, 'Precision and Recall of Machine Translation', In Proc. of NAACL/HLT 2003, Edmonton, Canada.

• King M., Popescu-Belis A. & Hovy E. : 2003, 'FEMTI: creating and using a framework for MT evaluation', In Proc. of 9th Machine Translation Summit, New Orleans, LO, USA.

• S. Nießen, F.J. Och, G. Leusch, H. Ney : 2000, 'An Evaluation Tool for Machine Translation: Fast Evaluation for MT Research', In Proc. LREC 2000, Athens, Greece.

• NIST Report : 2002, Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics, http://www.nist.gov/speech/tests/mt/doc/ngram-study.pdf

• Papineni, Kishore & Roukos, Salim et al. : 2002, 'BLEU: A Method for Automatic Evaluation of Machine Translation', In Proc. of the 20th ACL.

• Ying Zhang, Stephan Vogel, Alex Waibel : 2004, 'Interpreting BLEU/NIST scores: How much improvement do we need to have a better system?,' In: Proc. of LREC 2004, Lisbon, Portugal.

Documents

Measuring Confidence Intervals for MT Evaluation Metrics