23
BLEU, Its Variants & Its Critics Arthur Chan Prepared for Advanced MT Seminar

BLEU, Its Variants & Its Critics

  • Upload
    gordon

  • View
    54

  • Download
    1

Embed Size (px)

DESCRIPTION

BLEU, Its Variants & Its Critics. Arthur Chan Prepared for Advanced MT Seminar. This Talk. Original BLEU scores (Papineni 2002) Motivation Procedure NIST: as a major BLEU variant Critics of BLEU From alternate evaluation metrics METEOR: (Lavie 2004, Banerjee 2005) - PowerPoint PPT Presentation

Citation preview

Page 1: BLEU, Its Variants & Its Critics

BLEU, Its Variants & Its Critics

Arthur Chan

Prepared for Advanced MT Seminar

Page 2: BLEU, Its Variants & Its Critics

This Talk

Original BLEU scores (Papineni 2002) Motivation Procedure

NIST: as a major BLEU variant Critics of BLEU

From alternate evaluation metrics METEOR: (Lavie 2004, Banerjee 2005)

From analysis of BLEU (Culy 2002) METEOR will be covered by Alon (next

talk)

Page 3: BLEU, Its Variants & Its Critics

Bilingual Evaluation Understudy (BLEU)

Page 4: BLEU, Its Variants & Its Critics

Motivation of Automatic Evaluation in MT

Human evaluations of MT weigh many aspects such as Adequacy Fidelity Fluency

Human evaluation are expensive Human evaluation could take a long time

While system need daily change Good automatic evaluation could

save human

Page 5: BLEU, Its Variants & Its Critics

BLEU – Why is it Important?

Some reasons: It is proposed by IBM

IBM has a long history of proposing evaluation standards

Verified and Improved by NIST So, its variant is used in evaluation

Widely used Appear everywhere in MT literature after 2001

It is quite useful does give good feedback to the adequacy and

fluency for translation results It is not perfect

It is a subject of criticism (the critics make some sense in this case)

It is a subject of extension

Page 6: BLEU, Its Variants & Its Critics

BLEU – Its Motivation

Central Idea: “The closer a machine translation is to a

professional human translation, the better it is.”

Implication A evaluation metric could be evaluated

If it correlates with human evaluation, it would be a useful metric

BLEU was proposed as an aid as a quick substitute of humans when needed

Page 7: BLEU, Its Variants & Its Critics

BLEU – What is it? A Big Picture

Require multiple good reference translations

Depends on modified n-gram precision (or co-occurrence) Co-occurrence: if translated sentence hit n-

gram in any reference sentences Per-corpus n-gram co-occurrence is

computed n can has several values and a weighted

sum is computed Brevity of translation is penalized

Page 8: BLEU, Its Variants & Its Critics

BLEU – N-gram Precision: a Motivating Example

Candidate 1: It is a guide to action which ensures that the military always obey the commands the party.

Candidate 2: It is to insure the troops forever hearing the activity guidebook that party direct.

Reference 1: It is a guide to action that ensures that the military will forever heed Party commands.

Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party.

Reference 3: It is the practical guide for the army always to heed directions of the party.

Page 9: BLEU, Its Variants & Its Critics

BLEU – Modified N-gram Precision

Issues with N-gram precision Give a very good score for over

generated n-gram

Page 10: BLEU, Its Variants & Its Critics

BLEU – Brevity Penalty

Page 11: BLEU, Its Variants & Its Critics

BLEU – The “Trouble” with Recall

Page 12: BLEU, Its Variants & Its Critics

BLEU – Recall and Brevity Penalty

Page 13: BLEU, Its Variants & Its Critics

BLEU – Paradigm of Evaluation

Page 14: BLEU, Its Variants & Its Critics

BLEU – Evaluation of the Metric

Page 15: BLEU, Its Variants & Its Critics

BLEU – The Human Evaluation

Page 16: BLEU, Its Variants & Its Critics

BLEU – BLEU vs Human Evaluation

Page 17: BLEU, Its Variants & Its Critics

NIST – As a BLEU’s Variant

Page 18: BLEU, Its Variants & Its Critics

Usage of BLEU on Character-based Language

Page 19: BLEU, Its Variants & Its Critics

Critics of BLEU – From Analysis of BLEU

Page 20: BLEU, Its Variants & Its Critics

Critics of BLEU – A Glance of Metrics Beyond BLEU

Page 21: BLEU, Its Variants & Its Critics

Critics of BLEU – Summary of BLEU’s Issues

Page 22: BLEU, Its Variants & Its Critics

Discussion - Should BLEU be the Standard Metric of MT?

Page 23: BLEU, Its Variants & Its Critics

References

Kishore Panineni, Salim Roukos, Todd Ward and Wei Jing Zhu, BLEU, a Method for Automatic Evaluation of Machine Translation. In ACL-02. 2002

George Doddington, Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics.

Etiene Denoual, Yves Lepage, BLEU in Characters: Towards Automatic MT Evaluation in Languages without Word Delimiters.

Alon Lavie, Kenji Sagae, Shyamsundar Jayaraman, The Significance of Recall in Automatic Metrics for MT Evaluation.

Christopher Culy, Susanne Z. Riechemann, The Limits of N-Gram Translation Evaluation Metrics.

Santanjeev Banerjee, Alon Lavie, METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments.