62
Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions Translation Quality Assessment: Evaluation and Estimation Lucia Specia University of Sheffield [email protected] 9 September 2013 Translation Quality Assessment: Evaluation and Estimation 1 / 33

Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Translation Quality Assessment:

Evaluation and Estimation

Lucia Specia

University of [email protected]

9 September 2013

Translation Quality Assessment: Evaluation and Estimation 1 / 33

Page 2: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Overview

“MT Evaluation is better understood than MT” (Carbonelland Wilks, 1991)

Translation Quality Assessment: Evaluation and Estimation 2 / 33

Page 3: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Outline

1 Translation Quality

2 QTLaunchPad Project

3 Quality Estimation

4 State of the art in QE

5 Open issues

6 Conclusions

Translation Quality Assessment: Evaluation and Estimation 3 / 33

Page 4: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Outline

1 Translation Quality

2 QTLaunchPad Project

3 Quality Estimation

4 State of the art in QE

5 Open issues

6 Conclusions

Translation Quality Assessment: Evaluation and Estimation 4 / 33

Page 5: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Overview

What does quality mean?Fluent?Adequate?Easy to post-edit?

Quality for whom?End-userMT-system (tuning)Post-editorOther applications (e.g. CLIR)

Quality for what?Internal communicationsDissemination (publishing)Gisting (Google Translate)Draft translations (light vs heavy post-editing)MT system improvement (diagnosis)

Translation Quality Assessment: Evaluation and Estimation 5 / 33

Page 6: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Overview

What does quality mean?Fluent?Adequate?Easy to post-edit?

Quality for whom?End-userMT-system (tuning)Post-editorOther applications (e.g. CLIR)

Quality for what?Internal communicationsDissemination (publishing)Gisting (Google Translate)Draft translations (light vs heavy post-editing)MT system improvement (diagnosis)

Translation Quality Assessment: Evaluation and Estimation 5 / 33

Page 7: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Overview

What does quality mean?Fluent?Adequate?Easy to post-edit?

Quality for whom?End-userMT-system (tuning)Post-editorOther applications (e.g. CLIR)

Quality for what?Internal communicationsDissemination (publishing)Gisting (Google Translate)Draft translations (light vs heavy post-editing)MT system improvement (diagnosis)

Translation Quality Assessment: Evaluation and Estimation 5 / 33

Page 8: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Overview

ref Do not buy this product, it’s their craziest invention!

sys Do buy this product, it’s their craziest invention!

Severe if end-user does not speak source language

Trivial to post-edit by translators

ref The battery lasts 6 hours and it can be fully rechargedin 30 minutes.

sys Six-hours battery, 30 minutes to full charge last.

Ok for gisting - meaning preserved

Very costly for post-editing if style is to be preserved

Translation Quality Assessment: Evaluation and Estimation 6 / 33

Page 9: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Overview

ref Do not buy this product, it’s their craziest invention!

sys Do buy this product, it’s their craziest invention!

Severe if end-user does not speak source language

Trivial to post-edit by translators

ref The battery lasts 6 hours and it can be fully rechargedin 30 minutes.

sys Six-hours battery, 30 minutes to full charge last.

Ok for gisting - meaning preserved

Very costly for post-editing if style is to be preserved

Translation Quality Assessment: Evaluation and Estimation 6 / 33

Page 10: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Overview

ref Do not buy this product, it’s their craziest invention!

sys Do buy this product, it’s their craziest invention!

Severe if end-user does not speak source language

Trivial to post-edit by translators

ref The battery lasts 6 hours and it can be fully rechargedin 30 minutes.

sys Six-hours battery, 30 minutes to full charge last.

Ok for gisting - meaning preserved

Very costly for post-editing if style is to be preserved

Translation Quality Assessment: Evaluation and Estimation 6 / 33

Page 11: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Overview

ref Do not buy this product, it’s their craziest invention!

sys Do buy this product, it’s their craziest invention!

Severe if end-user does not speak source language

Trivial to post-edit by translators

ref The battery lasts 6 hours and it can be fully rechargedin 30 minutes.

sys Six-hours battery, 30 minutes to full charge last.

Ok for gisting - meaning preserved

Very costly for post-editing if style is to be preserved

Translation Quality Assessment: Evaluation and Estimation 6 / 33

Page 12: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Overview

How do we measure quality?

Human metrics: error counts (which?), ranking,acceptability, 1-N fluency/adequacy

Automatic metrics based on human references: (BLEU,METEOR, TER, etc.

Semi-automatic metrics based on post-editions: HTER,PE time, eye-tracking, etc.

Automatic metrics without references: qualityestimation

Translation Quality Assessment: Evaluation and Estimation 7 / 33

Page 13: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Overview

How do we measure quality?

Human metrics: error counts (which?), ranking,acceptability, 1-N fluency/adequacy

Automatic metrics based on human references: (BLEU,METEOR, TER, etc.

Semi-automatic metrics based on post-editions: HTER,PE time, eye-tracking, etc.

Automatic metrics without references: qualityestimation

Translation Quality Assessment: Evaluation and Estimation 7 / 33

Page 14: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Outline

1 Translation Quality

2 QTLaunchPad Project

3 Quality Estimation

4 State of the art in QE

5 Open issues

6 Conclusions

Translation Quality Assessment: Evaluation and Estimation 8 / 33

Page 15: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

QTLaunchPad project

http://www.qt21.eu/launchpad/

Multidimensional Quality Metrics (MQM) based on aspecification

Machine and human translation quality

Manual and (semi-)automatic assessment

Takes quality of source text into account

MT system improvement, gisting, dissemination, etc.

Translation Quality Assessment: Evaluation and Estimation 9 / 33

Page 16: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Multidimensional Quality Metrics (MQM)

Issues selected based on a given specification (dimensions):

Language/locale

Subject field/domain

Terminology

Text Type

Audience

Purpose

Register

Style

Content correspondence

Output modality, ...

Translation Quality Assessment: Evaluation and Estimation 10 / 33

Page 17: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Multidimensional Quality Metrics (MQM)

Issue types (core):

Translation Quality Assessment: Evaluation and Estimation 11 / 33

Page 18: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Multidimensional Quality Metrics (MQM)

Issue types: http://www.qt21.eu/launchpad/content/

high-level-structure-0

Combining issue types:TQ = 100− AccP − (FluPT − FluPS)− (VerPT − VerPS)

Translation Quality Assessment: Evaluation and Estimation 12 / 33

Page 19: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Outline

1 Translation Quality

2 QTLaunchPad Project

3 Quality Estimation

4 State of the art in QE

5 Open issues

6 Conclusions

Translation Quality Assessment: Evaluation and Estimation 13 / 33

Page 20: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Overview

Quality estimation (QE): metrics that provide anestimate on the quality of unseen translations

Quality defined by the data

Quality = Can we publish it as is?

Quality = Can a reader get the gist?

Quality = Is it worth post-editing it?

Quality = How much effort to fix it?

Quality = What’s this translation’s MQM score?

Translation Quality Assessment: Evaluation and Estimation 14 / 33

Page 21: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Overview

Quality estimation (QE): metrics that provide anestimate on the quality of unseen translations

Quality defined by the data

Quality = Can we publish it as is?

Quality = Can a reader get the gist?

Quality = Is it worth post-editing it?

Quality = How much effort to fix it?

Quality = What’s this translation’s MQM score?

Translation Quality Assessment: Evaluation and Estimation 14 / 33

Page 22: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Overview

Quality estimation (QE): metrics that provide anestimate on the quality of unseen translations

Quality defined by the data

Quality = Can we publish it as is?

Quality = Can a reader get the gist?

Quality = Is it worth post-editing it?

Quality = How much effort to fix it?

Quality = What’s this translation’s MQM score?

Translation Quality Assessment: Evaluation and Estimation 14 / 33

Page 23: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Overview

Quality estimation (QE): metrics that provide anestimate on the quality of unseen translations

Quality defined by the data

Quality = Can we publish it as is?

Quality = Can a reader get the gist?

Quality = Is it worth post-editing it?

Quality = How much effort to fix it?

Quality = What’s this translation’s MQM score?

Translation Quality Assessment: Evaluation and Estimation 14 / 33

Page 24: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Overview

Quality estimation (QE): metrics that provide anestimate on the quality of unseen translations

Quality defined by the data

Quality = Can we publish it as is?

Quality = Can a reader get the gist?

Quality = Is it worth post-editing it?

Quality = How much effort to fix it?

Quality = What’s this translation’s MQM score?

Translation Quality Assessment: Evaluation and Estimation 14 / 33

Page 25: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Overview

Quality estimation (QE): metrics that provide anestimate on the quality of unseen translations

Quality defined by the data

Quality = Can we publish it as is?

Quality = Can a reader get the gist?

Quality = Is it worth post-editing it?

Quality = How much effort to fix it?

Quality = What’s this translation’s MQM score?

Translation Quality Assessment: Evaluation and Estimation 14 / 33

Page 26: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Overview

Quality estimation (QE): metrics that provide anestimate on the quality of unseen translations

Quality defined by the data

Quality = Can we publish it as is?

Quality = Can a reader get the gist?

Quality = Is it worth post-editing it?

Quality = How much effort to fix it?

Quality = What’s this translation’s MQM score?

Translation Quality Assessment: Evaluation and Estimation 14 / 33

Page 27: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Framework

MachineLearning

X: examples of source &

translations

QE modelY: Quality scores for

examples in X

Feature extraction

Features

Translation Quality Assessment: Evaluation and Estimation 15 / 33

Page 28: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Framework

MT systemTranslation

for xt'

QE modelQuality score

y'

Features

Feature extraction

SourceText x

s'

Translation Quality Assessment: Evaluation and Estimation 15 / 33

Page 29: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Framework

Main components to build a QE system:

1 Definition of quality: what to predict

2 (Human) labelled data (for quality)

3 Features

4 Machine learning algorithm

Translation Quality Assessment: Evaluation and Estimation 16 / 33

Page 30: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Definition of quality

Predict 1-N absolute scores for adequacy/fluency

Predict 1-N absolute scores for post-editing effort

Predict average post-editing time per word

Predict relative rankings

Predict relative rankings for same source

Predict percentage of edits needed for sentence

Predict word-level edits and its types

Predict BLEU, etc. scores for document

Translation Quality Assessment: Evaluation and Estimation 17 / 33

Page 31: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Datasets

SHEF (several): http://staffwww.dcs.shef.ac.uk/

people/L.Specia/resources.html

LIG (10K, fr-en): http://www-clips.imag.fr/geod/

User/marion.potet/index.php?page=download

LMSI (14K, fr-en, en-fr, 2 post-editors):http://web.limsi.fr/Individu/wisniews/

recherche/index.html

Translation Quality Assessment: Evaluation and Estimation 18 / 33

Page 32: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Features

Source text TranslationMT system

Confidence indicators

Complexity indicators

Fluency indicators

Adequacyindicators

Translation Quality Assessment: Evaluation and Estimation 19 / 33

Page 33: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

QuEst

Goal: framework to explore features for QE

Feature extractors for 150+ features of all types: Java

Machine learning: Gaussian Processes & scikit-learntoolkit (Python), with wrappers for a number ofalgorithms, grid search, feature selection

Open source:http://www.quest.dcs.shef.ac.uk/

Translation Quality Assessment: Evaluation and Estimation 20 / 33

Page 34: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Outline

1 Translation Quality

2 QTLaunchPad Project

3 Quality Estimation

4 State of the art in QE

5 Open issues

6 Conclusions

Translation Quality Assessment: Evaluation and Estimation 21 / 33

Page 35: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Shared Task

WMT12-13 – with Radu Soricut & Christian Buck

Sentence- and word-level estimation of PE effortDatasets and language pairs:

Quality Year Languages1-5 subjective scores WMT12 en-esRanking all sentences best-worst WMT12/13 en-esHTER scores WMT13 en-esPost-editing time WMT13 en-esWord-level edits: change/keep WMT13 en-esWord-level edits: keep/delete/replace WMT13 en-esRanking 5 MTs per source WMT13 en-es; de-en

Evaluation metric:

MAE =

∑Ni=1 |H(si)− V (si)|

N

Translation Quality Assessment: Evaluation and Estimation 22 / 33

Page 36: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Shared Task

WMT12-13 – with Radu Soricut & Christian BuckSentence- and word-level estimation of PE effort

Datasets and language pairs:

Quality Year Languages1-5 subjective scores WMT12 en-esRanking all sentences best-worst WMT12/13 en-esHTER scores WMT13 en-esPost-editing time WMT13 en-esWord-level edits: change/keep WMT13 en-esWord-level edits: keep/delete/replace WMT13 en-esRanking 5 MTs per source WMT13 en-es; de-en

Evaluation metric:

MAE =

∑Ni=1 |H(si)− V (si)|

N

Translation Quality Assessment: Evaluation and Estimation 22 / 33

Page 37: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Shared Task

WMT12-13 – with Radu Soricut & Christian BuckSentence- and word-level estimation of PE effortDatasets and language pairs:

Quality Year Languages1-5 subjective scores WMT12 en-esRanking all sentences best-worst WMT12/13 en-esHTER scores WMT13 en-esPost-editing time WMT13 en-esWord-level edits: change/keep WMT13 en-esWord-level edits: keep/delete/replace WMT13 en-esRanking 5 MTs per source WMT13 en-es; de-en

Evaluation metric:

MAE =

∑Ni=1 |H(si)− V (si)|

N

Translation Quality Assessment: Evaluation and Estimation 22 / 33

Page 38: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Shared Task

WMT12-13 – with Radu Soricut & Christian BuckSentence- and word-level estimation of PE effortDatasets and language pairs:

Quality Year Languages1-5 subjective scores WMT12 en-esRanking all sentences best-worst WMT12/13 en-esHTER scores WMT13 en-esPost-editing time WMT13 en-esWord-level edits: change/keep WMT13 en-esWord-level edits: keep/delete/replace WMT13 en-esRanking 5 MTs per source WMT13 en-es; de-en

Evaluation metric:

MAE =

∑Ni=1 |H(si)− V (si)|

NTranslation Quality Assessment: Evaluation and Estimation 22 / 33

Page 39: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Baseline system

Features:

number of tokens in the source and target sentences

average source token length

average number of occurrences of words in the target

number of punctuation marks in source and target sentences

LM probability of source and target sentences

average number of translations per source word

% of source 1-grams, 2-grams and 3-grams in frequencyquartiles 1 and 4

% of seen source unigrams

SVM regression with RBF kernel with the parameters γ, ε and Coptimised using a grid-search and 5-fold cross validation on thetraining set

Translation Quality Assessment: Evaluation and Estimation 23 / 33

Page 40: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Baseline system

Features:

number of tokens in the source and target sentences

average source token length

average number of occurrences of words in the target

number of punctuation marks in source and target sentences

LM probability of source and target sentences

average number of translations per source word

% of source 1-grams, 2-grams and 3-grams in frequencyquartiles 1 and 4

% of seen source unigrams

SVM regression with RBF kernel with the parameters γ, ε and Coptimised using a grid-search and 5-fold cross validation on thetraining set

Translation Quality Assessment: Evaluation and Estimation 23 / 33

Page 41: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Results - scoring sub-task (WMT12)

System ID MAE RMSE• SDLLW M5PbestDeltaAvg 0.61 0.75

UU best 0.64 0.79SDLLW SVM 0.64 0.78

UU bltk 0.64 0.79Loria SVMlinear 0.68 0.82

UEdin 0.68 0.82TCD M5P-resources-only* 0.68 0.82

Baseline bb17 SVR 0.69 0.82Loria SVMrbf 0.69 0.83

SJTU 0.69 0.83WLV-SHEF FS 0.69 0.85

PRHLT-UPV 0.70 0.85WLV-SHEF BL 0.72 0.86

DCU-SYMC unconstrained 0.75 0.97DFKI grcfs-mars 0.82 0.98DFKI cfs-plsreg 0.82 0.99

UPC 1 0.84 1.01DCU-SYMC constrained 0.86 1.12

UPC 2 0.87 1.04TCD M5P-all 2.09 2.32

Translation Quality Assessment: Evaluation and Estimation 24 / 33

Page 42: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Results - scoring sub-task (WMT13)

System ID MAE RMSE• SHEF FS 12.42 15.74

SHEF FS-AL 13.02 17.03CNGL SVRPLS 13.26 16.82

LIMSI 13.32 17.22DCU-SYMC combine 13.45 16.64DCU-SYMC alltypes 13.51 17.14

CMU noB 13.84 17.46CNGL SVR 13.85 17.28

FBK-UEdin extra 14.38 17.68FBK-UEdin rand-svr 14.50 17.73

LORIA inctrain 14.79 18.34Baseline bb17 SVR 14.81 18.22

TCD-CNGL open 14.81 19.00LORIA inctraincont 14.83 18.17

TCD-CNGL restricted 15.20 19.59CMU full 15.25 18.97

UMAC 16.97 21.94

Translation Quality Assessment: Evaluation and Estimation 25 / 33

Page 43: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Outline

1 Translation Quality

2 QTLaunchPad Project

3 Quality Estimation

4 State of the art in QE

5 Open issues

6 Conclusions

Translation Quality Assessment: Evaluation and Estimation 26 / 33

Page 44: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Agreement between annotators

Absolute value judgements: difficult to achieveconsistency even in highly controlled settings

30% of initial dataset discardedRemaining annotations had to be scaled

More objective absolute scores

Post-editing time, HTER, editsAlso subject to huge variance (WPTP 2013, Wisniewskiet. al, MT-Summit 2013)Multi-task learning to address this variance (Cohn andSpecia, ACL 2013)

Relative scores

Different task altogetherWMT13: better results than reference-based metrics

Translation Quality Assessment: Evaluation and Estimation 27 / 33

Page 45: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Agreement between annotators

Absolute value judgements: difficult to achieveconsistency even in highly controlled settings

30% of initial dataset discardedRemaining annotations had to be scaled

More objective absolute scores

Post-editing time, HTER, editsAlso subject to huge variance (WPTP 2013, Wisniewskiet. al, MT-Summit 2013)Multi-task learning to address this variance (Cohn andSpecia, ACL 2013)

Relative scores

Different task altogetherWMT13: better results than reference-based metrics

Translation Quality Assessment: Evaluation and Estimation 27 / 33

Page 46: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Agreement between annotators

Absolute value judgements: difficult to achieveconsistency even in highly controlled settings

30% of initial dataset discardedRemaining annotations had to be scaled

More objective absolute scores

Post-editing time, HTER, editsAlso subject to huge variance (WPTP 2013, Wisniewskiet. al, MT-Summit 2013)Multi-task learning to address this variance (Cohn andSpecia, ACL 2013)

Relative scores

Different task altogetherWMT13: better results than reference-based metrics

Translation Quality Assessment: Evaluation and Estimation 27 / 33

Page 47: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Annotation costs

Active learning to select subset of instances to be annotated(Beck et al., ACL 2013)

Translation Quality Assessment: Evaluation and Estimation 28 / 33

Page 48: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Curse of dimensionality

Feature selection to identify relevant info for dataset (Shahet al., MT Summit 2013)

0.68

21'

0.48

57'

0.44

01'

0.5313'

0.4614' 0.

5339'

0.35

91'

0.5462'

0.554'

0.67

17'

0.47

19'

0.4292

'

0.52

65'

0.4741' 0.5437'

0.35

78'

0.5399'

0.5401'

0.629'

0.449'

0.4183'

0.5123'

0.44

93' 0.51

13'

0.34

01'

0.5301'

0.5401'

0.63

24'

0.4471'

0.41

69'

0.51

09'

0.46

09' 0.5309'

0.34

09'

0.52

49'

0.52

49'

0.61

31'

0.43

2'

0.41

1'

0.5025'

0.441'

0.506'

0.33

7'

0.52

1'

0.51

94'

0.3'

0.35'

0.4'

0.45'

0.5'

0.55'

0.6'

0.65'

0.7'

WMT12'

EAMT211'(en2es)'

EAMT211'(fr2en)'

EAMT2092s1'

EAMT2092s2'

EAMT2092s3'

EAMT2092s4'

GALE112s1'

GALE112s2''

BL' AF' BL+PR' AF+PR' FS'

Common feature set identified, but nuanced subsets forspecific datasets

Translation Quality Assessment: Evaluation and Estimation 29 / 33

Page 49: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Curse of dimensionality

Feature selection to identify relevant info for dataset (Shahet al., MT Summit 2013)

0.68

21'

0.48

57'

0.44

01'

0.5313'

0.4614' 0.

5339'

0.35

91'

0.5462'

0.554'

0.67

17'

0.47

19'

0.4292

'

0.52

65'

0.4741' 0.5437'

0.35

78'

0.5399'

0.5401'

0.629'

0.449'

0.4183'

0.5123'

0.44

93' 0.51

13'

0.34

01'

0.5301'

0.5401'

0.63

24'

0.4471'

0.41

69'

0.51

09'

0.46

09' 0.5309'

0.34

09'

0.52

49'

0.52

49'

0.61

31'

0.43

2'

0.41

1'

0.5025'

0.441'

0.506'

0.33

7'

0.52

1'

0.51

94'

0.3'

0.35'

0.4'

0.45'

0.5'

0.55'

0.6'

0.65'

0.7'

WMT12'

EAMT211'(en2es)'

EAMT211'(fr2en)'

EAMT2092s1'

EAMT2092s2'

EAMT2092s3'

EAMT2092s4'

GALE112s1'

GALE112s2''

BL' AF' BL+PR' AF+PR' FS'

Common feature set identified, but nuanced subsets forspecific datasets

Translation Quality Assessment: Evaluation and Estimation 29 / 33

Page 50: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

How to use estimated PE effort scores?

Do users prefer detailed estimates (sub-sentence level) or anoverall estimate for the complete sentence or not seeingbad sentences at all?

Too much information vs hard-to-interpret scores

IBM’s Goodness metric

MATECAT project investigating it

Translation Quality Assessment: Evaluation and Estimation 30 / 33

Page 51: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

How to use estimated PE effort scores?

Do users prefer detailed estimates (sub-sentence level) or anoverall estimate for the complete sentence or not seeingbad sentences at all?

Too much information vs hard-to-interpret scores

IBM’s Goodness metric

MATECAT project investigating it

Translation Quality Assessment: Evaluation and Estimation 30 / 33

Page 52: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

How to use estimated PE effort scores?

Do users prefer detailed estimates (sub-sentence level) or anoverall estimate for the complete sentence or not seeingbad sentences at all?

Too much information vs hard-to-interpret scores

IBM’s Goodness metric

MATECAT project investigating it

Translation Quality Assessment: Evaluation and Estimation 30 / 33

Page 53: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Feature engineering

Two families of features missing in current work:

Can we benefit from contextual, document-wideinformation?

Can we predict human translation quality?

Don’t panic, you can help!

Two (sub-) QuEst projects at MTM-2013 :-)

Translation Quality Assessment: Evaluation and Estimation 31 / 33

Page 54: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Feature engineering

Two families of features missing in current work:

Can we benefit from contextual, document-wideinformation?

Can we predict human translation quality?

Don’t panic, you can help!

Two (sub-) QuEst projects at MTM-2013 :-)

Translation Quality Assessment: Evaluation and Estimation 31 / 33

Page 55: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Outline

1 Translation Quality

2 QTLaunchPad Project

3 Quality Estimation

4 State of the art in QE

5 Open issues

6 Conclusions

Translation Quality Assessment: Evaluation and Estimation 32 / 33

Page 56: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Conclusions

(Machine) Translation evaluation is still an open problem

Different purposes/users, different needs, different notionsof quality

Quality estimation: learning of these different notions

Estimates have been used in real applications

Ranking translations: filter out bad quality translationsSelecting translations from multiple MT systems

Commercial interest

SDL LW: TrustScoreMultilizer: MT-Qualifier

Interesting open issues: join the QuEst projects!

Translation Quality Assessment: Evaluation and Estimation 33 / 33

Page 57: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Conclusions

(Machine) Translation evaluation is still an open problem

Different purposes/users, different needs, different notionsof quality

Quality estimation: learning of these different notions

Estimates have been used in real applications

Ranking translations: filter out bad quality translationsSelecting translations from multiple MT systems

Commercial interest

SDL LW: TrustScoreMultilizer: MT-Qualifier

Interesting open issues: join the QuEst projects!

Translation Quality Assessment: Evaluation and Estimation 33 / 33

Page 58: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Conclusions

(Machine) Translation evaluation is still an open problem

Different purposes/users, different needs, different notionsof quality

Quality estimation: learning of these different notions

Estimates have been used in real applications

Ranking translations: filter out bad quality translationsSelecting translations from multiple MT systems

Commercial interest

SDL LW: TrustScoreMultilizer: MT-Qualifier

Interesting open issues: join the QuEst projects!

Translation Quality Assessment: Evaluation and Estimation 33 / 33

Page 59: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Conclusions

(Machine) Translation evaluation is still an open problem

Different purposes/users, different needs, different notionsof quality

Quality estimation: learning of these different notions

Estimates have been used in real applications

Ranking translations: filter out bad quality translationsSelecting translations from multiple MT systems

Commercial interest

SDL LW: TrustScoreMultilizer: MT-Qualifier

Interesting open issues: join the QuEst projects!

Translation Quality Assessment: Evaluation and Estimation 33 / 33

Page 60: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Conclusions

(Machine) Translation evaluation is still an open problem

Different purposes/users, different needs, different notionsof quality

Quality estimation: learning of these different notions

Estimates have been used in real applications

Ranking translations: filter out bad quality translationsSelecting translations from multiple MT systems

Commercial interest

SDL LW: TrustScoreMultilizer: MT-Qualifier

Interesting open issues: join the QuEst projects!

Translation Quality Assessment: Evaluation and Estimation 33 / 33

Page 61: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Conclusions

(Machine) Translation evaluation is still an open problem

Different purposes/users, different needs, different notionsof quality

Quality estimation: learning of these different notions

Estimates have been used in real applications

Ranking translations: filter out bad quality translationsSelecting translations from multiple MT systems

Commercial interest

SDL LW: TrustScoreMultilizer: MT-Qualifier

Interesting open issues: join the QuEst projects!

Translation Quality Assessment: Evaluation and Estimation 33 / 33

Page 62: Translation Quality Assessment: Evaluation and Estimationufal.mff.cuni.cz/mtm13/files/02-qe-lucia-specia.pdf · Translation QualityQTLaunchPad ProjectQuality EstimationState of the

Translation Quality QTLaunchPad Project Quality Estimation State of the art in QE Open issues Conclusions

Conclusions

(Machine) Translation evaluation is still an open problem

Different purposes/users, different needs, different notionsof quality

Quality estimation: learning of these different notions

Estimates have been used in real applications

Ranking translations: filter out bad quality translationsSelecting translations from multiple MT systems

Commercial interest

SDL LW: TrustScoreMultilizer: MT-Qualifier

Interesting open issues: join the QuEst projects!

Translation Quality Assessment: Evaluation and Estimation 33 / 33