101
Overview Quality Estimation Shared Task Open issues Conclusions Estimating post-editing effort State-of-the-art systems and open issues Lucia Specia University of Sheffield [email protected] 17 August 2012 Estimating post-editing effort 1 / 40

Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld [email protected] 17 August

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Estimating post-editing effort

State-of-the-art systems and open issues

Lucia Specia

University of [email protected]

17 August 2012

Estimating post-editing effort 1 / 40

Page 2: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Outline

1 Overview

2 Quality Estimation

3 Shared Task

4 Open issues

5 Conclusions

Estimating post-editing effort 2 / 40

Page 3: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Outline

1 Overview

2 Quality Estimation

3 Shared Task

4 Open issues

5 Conclusions

Estimating post-editing effort 3 / 40

Page 4: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Overview

Why don’t translators use (more) MT?

Translations are not good enough!What about TMs? Aren’t fuzzy matches useful?

Estimating post-editing effort 4 / 40

Page 5: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Overview

Why don’t translators use (more) MT?Translations are not good enough!

What about TMs? Aren’t fuzzy matches useful?

Estimating post-editing effort 4 / 40

Page 6: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Overview

Why don’t translators use (more) MT?Translations are not good enough!What about TMs? Aren’t fuzzy matches useful?

Estimating post-editing effort 4 / 40

Page 7: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Overview

Why don’t translators use (more) MT?Translations are not good enough!What about TMs? Aren’t fuzzy matches useful?

Estimating post-editing effort 4 / 40

Page 8: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Outline

1 Overview

2 Quality Estimation

3 Shared Task

4 Open issues

5 Conclusions

Estimating post-editing effort 5 / 40

Page 9: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Framework

Quality estimation (QE): provide an estimate ofquality for new translated text

Quality = post-editing effort

No access to reference translations: machine learningtechniques to predict post-editing effort scores

Sourcesentence

MT system

Translation sentence

QE system

PE effort

Examples of translations + PE scores

Qualityindicators

Estimating post-editing effort 6 / 40

Page 10: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Framework

Quality estimation (QE): provide an estimate ofquality for new translated text

Quality = post-editing effort

No access to reference translations: machine learningtechniques to predict post-editing effort scores

Sourcesentence

MT system

Translation sentence

QE system

PE effort

Examples of translations + PE scores

Qualityindicators

Estimating post-editing effort 6 / 40

Page 11: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Framework

Quality estimation (QE): provide an estimate ofquality for new translated text

Quality = post-editing effort

No access to reference translations: machine learningtechniques to predict post-editing effort scores

Sourcesentence

MT system

Translation sentence

QE system

PE effort

Examples of translations + PE scores

Qualityindicators

Estimating post-editing effort 6 / 40

Page 12: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Progress so far

QE, also called confidence estimation, started in 2003

Estimate BLEU: difficult to interpret outputHard to beat the baseline: MT is always badNo success in any application

Semi-dormant until 2009

Better MT systemsMT used in translation industryEstimate more interpretable metrics: post-editing effort(human scores, time, edit distance)Positive results

Estimating post-editing effort 7 / 40

Page 13: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Progress so far

QE, also called confidence estimation, started in 2003

Estimate BLEU: difficult to interpret outputHard to beat the baseline: MT is always badNo success in any application

Semi-dormant until 2009

Better MT systemsMT used in translation industryEstimate more interpretable metrics: post-editing effort(human scores, time, edit distance)Positive results

Estimating post-editing effort 7 / 40

Page 14: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Examples of positive results

Time to post-edit subset of sentences predicted as“good” (low effort) vs time to post-edit random subset ofsentences

Language no QE QEfr-en 0.75 words/sec 1.09 words/secen-es 0.32 words/sec 0.57 words/sec

Accuracy in selecting best translation among 4 MTsystems

Best MT system Highest QE score54% 77%

Estimating post-editing effort 8 / 40

Page 15: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Examples of positive results

Time to post-edit subset of sentences predicted as“good” (low effort) vs time to post-edit random subset ofsentences

Language no QE QEfr-en 0.75 words/sec 1.09 words/secen-es 0.32 words/sec 0.57 words/sec

Accuracy in selecting best translation among 4 MTsystems

Best MT system Highest QE score54% 77%

Estimating post-editing effort 8 / 40

Page 16: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Examples of positive results

Time to post-edit subset of sentences predicted as“good” (low effort) vs time to post-edit random subset ofsentences

Language no QE QEfr-en 0.75 words/sec 1.09 words/secen-es 0.32 words/sec 0.57 words/sec

Accuracy in selecting best translation among 4 MTsystems

Best MT system Highest QE score54% 77%

Estimating post-editing effort 8 / 40

Page 17: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

State-of-the-art (before WMT-12)

Quality indicators:

Source text TranslationMT system

Confidence indicators

Complexity indicators

Fluency indicators

Adequacyindicators

Learning algorithms: wide range

Datasets: few with absolute human scores (1-4 scores,PE time, edit distance), WMT data (relative scores)

Estimating post-editing effort 9 / 40

Page 18: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

State-of-the-art (before WMT-12)

Quality indicators:

Source text TranslationMT system

Confidence indicators

Complexity indicators

Fluency indicators

Adequacyindicators

Learning algorithms: wide range

Datasets: few with absolute human scores (1-4 scores,PE time, edit distance), WMT data (relative scores)

Estimating post-editing effort 9 / 40

Page 19: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

State-of-the-art (before WMT-12)

Quality indicators:

Source text TranslationMT system

Confidence indicators

Complexity indicators

Fluency indicators

Adequacyindicators

Learning algorithms: wide range

Datasets: few with absolute human scores (1-4 scores,PE time, edit distance), WMT data (relative scores)

Estimating post-editing effort 9 / 40

Page 20: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Outline

1 Overview

2 Quality Estimation

3 Shared Task

4 Open issues

5 Conclusions

Estimating post-editing effort 10 / 40

Page 21: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Goal

WMT-12, joint work with Radu Soricut

First common ground for development and comparison ofQE systems, focusing on sentence-level estimation ofPE effort:

Identify (new) effective quality indicatorsIdentify most suitable machine learning techniquesTest (new) automatic evaluation metricsEstablish the state of the art performance in the fieldContrast the performance of regression and rankingtechniques

Estimating post-editing effort 11 / 40

Page 22: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Goal

WMT-12, joint work with Radu Soricut

First common ground for development and comparison ofQE systems, focusing on sentence-level estimation ofPE effort:

Identify (new) effective quality indicatorsIdentify most suitable machine learning techniquesTest (new) automatic evaluation metricsEstablish the state of the art performance in the fieldContrast the performance of regression and rankingtechniques

Estimating post-editing effort 11 / 40

Page 23: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Goal

WMT-12, joint work with Radu Soricut

First common ground for development and comparison ofQE systems, focusing on sentence-level estimation ofPE effort:

Identify (new) effective quality indicators

Identify most suitable machine learning techniquesTest (new) automatic evaluation metricsEstablish the state of the art performance in the fieldContrast the performance of regression and rankingtechniques

Estimating post-editing effort 11 / 40

Page 24: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Goal

WMT-12, joint work with Radu Soricut

First common ground for development and comparison ofQE systems, focusing on sentence-level estimation ofPE effort:

Identify (new) effective quality indicatorsIdentify most suitable machine learning techniques

Test (new) automatic evaluation metricsEstablish the state of the art performance in the fieldContrast the performance of regression and rankingtechniques

Estimating post-editing effort 11 / 40

Page 25: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Goal

WMT-12, joint work with Radu Soricut

First common ground for development and comparison ofQE systems, focusing on sentence-level estimation ofPE effort:

Identify (new) effective quality indicatorsIdentify most suitable machine learning techniquesTest (new) automatic evaluation metrics

Establish the state of the art performance in the fieldContrast the performance of regression and rankingtechniques

Estimating post-editing effort 11 / 40

Page 26: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Goal

WMT-12, joint work with Radu Soricut

First common ground for development and comparison ofQE systems, focusing on sentence-level estimation ofPE effort:

Identify (new) effective quality indicatorsIdentify most suitable machine learning techniquesTest (new) automatic evaluation metricsEstablish the state of the art performance in the field

Contrast the performance of regression and rankingtechniques

Estimating post-editing effort 11 / 40

Page 27: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Goal

WMT-12, joint work with Radu Soricut

First common ground for development and comparison ofQE systems, focusing on sentence-level estimation ofPE effort:

Identify (new) effective quality indicatorsIdentify most suitable machine learning techniquesTest (new) automatic evaluation metricsEstablish the state of the art performance in the fieldContrast the performance of regression and rankingtechniques

Estimating post-editing effort 11 / 40

Page 28: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Datasets

English source sentences

Spanish MT outputs (PBSMT Moses)

Effort scores by 3 human judges, scale 1-5, averaged

Post-edited output

Human Spanish translation (original references)

Training set: 1832

Blind test set: 422

Estimating post-editing effort 12 / 40

Page 29: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Datasets

English source sentences

Spanish MT outputs (PBSMT Moses)

Effort scores by 3 human judges, scale 1-5, averaged

Post-edited output

Human Spanish translation (original references)

Training set: 1832

Blind test set: 422

Estimating post-editing effort 12 / 40

Page 30: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Datasets

English source sentences

Spanish MT outputs (PBSMT Moses)

Effort scores by 3 human judges, scale 1-5, averaged

Post-edited output

Human Spanish translation (original references)

Training set: 1832

Blind test set: 422

Estimating post-editing effort 12 / 40

Page 31: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Datasets

English source sentences

Spanish MT outputs (PBSMT Moses)

Effort scores by 3 human judges, scale 1-5, averaged

Post-edited output

Human Spanish translation (original references)

Training set: 1832

Blind test set: 422

Estimating post-editing effort 12 / 40

Page 32: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Datasets

English source sentences

Spanish MT outputs (PBSMT Moses)

Effort scores by 3 human judges, scale 1-5, averaged

Post-edited output

Human Spanish translation (original references)

Training set: 1832

Blind test set: 422

Estimating post-editing effort 12 / 40

Page 33: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Datasets

English source sentences

Spanish MT outputs (PBSMT Moses)

Effort scores by 3 human judges, scale 1-5, averaged

Post-edited output

Human Spanish translation (original references)

Training set: 1832

Blind test set: 422

Estimating post-editing effort 12 / 40

Page 34: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Datasets

English source sentences

Spanish MT outputs (PBSMT Moses)

Effort scores by 3 human judges, scale 1-5, averaged

Post-edited output

Human Spanish translation (original references)

Training set: 1832

Blind test set: 422

Estimating post-editing effort 12 / 40

Page 35: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Datasets

Annotation guidelines3 human judges for PE effort assigning 1-5 scores for〈source, MT output, PE output〉

[1] The MT output is incomprehensible, with little or no information transferredaccurately. It cannot be edited, needs to be translated from scratch.

[2] About 50-70% of the MT output needs to be edited. It requires a significantediting effort in order to reach publishable level.

[3] About 25-50% of the MT output needs to be edited. It contains different errorsand mistranslations that need to be corrected.

[4] About 10-25% of the MT output needs to be edited. It is generally clear andintelligible.

[5] The MT output is perfectly clear and intelligible. It is not necessarily a perfecttranslation, but requires little to no editing.

Estimating post-editing effort 13 / 40

Page 36: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Resources

SMT resources for training and test sets:

SMT training corpus (Europarl and News-bommentaries)

LMs: 5-gram LM; 3-gram LM and 1-3-gram counts

IBM Model 1 table (Giza)

Word-alignment file as produced by grow-diag-final

Phrase table with word alignment information

Moses configuration file used for decoding

Moses run-time log: model component values, wordgraph, etc.

Estimating post-editing effort 14 / 40

Page 37: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Evaluation metrics

Scoring metrics - standard MAE and RMSE

MAE =

∑Ni=1 |H(si)− V (si)|

N

RMSE =

√∑Ni=1(H(si)− V (si))2

N

N = |S |H(si) is the predicted score for siV (si) the is human score for si

Estimating post-editing effort 15 / 40

Page 38: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Evaluation metrics

Ranking metrics Spearman’s rank correlation and newmetric: DeltaAvg

For S1, S2, . . . , Sn quantiles:

DeltaAvgV [n] =

∑n−1k=1 V (S1,k)

n − 1− V (S)

V (S): extrinsic function measuring the “quality” of set S

Average human scores (1-5) of set S

Estimating post-editing effort 16 / 40

Page 39: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Evaluation metrics

Ranking metrics Spearman’s rank correlation and newmetric: DeltaAvg

For S1, S2, . . . , Sn quantiles:

DeltaAvgV [n] =

∑n−1k=1 V (S1,k)

n − 1− V (S)

V (S): extrinsic function measuring the “quality” of set S

Average human scores (1-5) of set S

Estimating post-editing effort 16 / 40

Page 40: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Evaluation metrics

Example 1: n=2, quantiles S1, S2

DeltaAvg[2] = V (S1)− V (S)“Quality of the top half compared to the overall quality”

Average human scores of top half compared to averagehuman scores of complete set

Example 2: n=3, quantiles S1, S2, S3

DeltaAvg[3] =(V (S1)−V (S))+(V (S1,2)−V (S))

2

Average human scores of top third compared to averagehuman scores of complete set; average human scores of top

two thirds compared to average human scores of completeset, averaged

Estimating post-editing effort 17 / 40

Page 41: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Evaluation metrics

Example 1: n=2, quantiles S1, S2

DeltaAvg[2] = V (S1)− V (S)“Quality of the top half compared to the overall quality”

Average human scores of top half compared to averagehuman scores of complete set

Example 2: n=3, quantiles S1, S2, S3

DeltaAvg[3] =(V (S1)−V (S))+(V (S1,2)−V (S))

2

Average human scores of top third compared to averagehuman scores of complete set; average human scores of top

two thirds compared to average human scores of completeset, averaged

Estimating post-editing effort 17 / 40

Page 42: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Evaluation metrics

Final DeltaAvg metric

DeltaAvgV =

∑Nn=2 DeltaAvgV [n]

N − 1

where N = |S |/2

Average DeltaAvg[n] for all n, 2 ≤ n ≤ |S |/2

Estimating post-editing effort 18 / 40

Page 43: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Evaluation metrics

Final DeltaAvg metric

DeltaAvgV =

∑Nn=2 DeltaAvgV [n]

N − 1

where N = |S |/2

Average DeltaAvg[n] for all n, 2 ≤ n ≤ |S |/2

Estimating post-editing effort 18 / 40

Page 44: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Participants

ID Participating teamPRHLT-UPV Universitat Politecnica de Valencia, Spain

UU Uppsala University, SwedenSDLLW SDL Language Weaver, USA

Loria LORIA Institute, FranceUPC Universitat Politecnica de Catalunya, Spain

DFKI DFKI, GermanyWLV-SHEF Univ of Wolverhampton & Univ of Sheffield, UK

SJTU Shanghai Jiao Tong University, ChinaDCU-SYMC Dublin City University, Ireland & Symantec, Ireland

UEdin University of Edinburgh, UKTCD Trinity College Dublin, Ireland

One or two systems per team, most teams submitting for rankingand scoring sub-tasks

Estimating post-editing effort 19 / 40

Page 45: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Baseline system

Feature extraction software – system-independent features:

number of tokens in the source and target sentences

average source token length

average number of occurrences of words in the target

number of punctuation marks in source and target sentences

LM probability of source and target sentences

average number of translations per source word

% of source 1-grams, 2-grams and 3-grams in frequencyquartiles 1 and 4

% of seen source unigrams

SVM regression with RBF kernel with the parameters γ, ε and Coptimized using a grid-search and 5-fold cross validation on thetraining set

Estimating post-editing effort 20 / 40

Page 46: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Baseline system

Feature extraction software – system-independent features:

number of tokens in the source and target sentences

average source token length

average number of occurrences of words in the target

number of punctuation marks in source and target sentences

LM probability of source and target sentences

average number of translations per source word

% of source 1-grams, 2-grams and 3-grams in frequencyquartiles 1 and 4

% of seen source unigrams

SVM regression with RBF kernel with the parameters γ, ε and Coptimized using a grid-search and 5-fold cross validation on thetraining set

Estimating post-editing effort 20 / 40

Page 47: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Results - ranking sub-task

System ID DeltaAvg Spearman Corr• SDLLW M5PbestDeltaAvg 0.63 0.64

• SDLLW SVM 0.61 0.60UU bltk 0.58 0.61UU best 0.56 0.62

TCD M5P-resources-only* 0.56 0.56Baseline (17FFs SVM) 0.55 0.58

PRHLT-UPV 0.55 0.55UEdin 0.54 0.58SJTU 0.53 0.53

WLV-SHEF FS 0.51 0.52WLV-SHEF BL 0.50 0.49

DFKI morphPOSibm1LM 0.46 0.46DCU-SYMC unconstrained 0.44 0.41

DCU-SYMC constrained 0.43 0.41TCD M5P-all* 0.42 0.41

UPC 1 0.22 0.26UPC 2 0.15 0.19

• = winning submissionsgray area = not different from baseline* = bug-fix was applied after the submission

Estimating post-editing effort 21 / 40

Page 48: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Results - ranking sub-task

Oracle methods: associate various metrics in a oraclemanner to the test input:

Oracle Effort: the gold-label Effort

Oracle HTER: the HTER metric against the post-editedtranslations as reference

System ID DeltaAvg Spearman CorrOracle Effort 0.95 1.00

Oracle HTER 0.77 0.70

Estimating post-editing effort 22 / 40

Page 49: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Results - scoring sub-task

System ID MAE RMSE• SDLLW M5PbestDeltaAvg 0.61 0.75

UU best 0.64 0.79SDLLW SVM 0.64 0.78

UU bltk 0.64 0.79Loria SVMlinear 0.68 0.82

UEdin 0.68 0.82TCD M5P-resources-only* 0.68 0.82

Baseline (17FFs SVM) 0.69 0.82Loria SVMrbf 0.69 0.83

SJTU 0.69 0.83WLV-SHEF FS 0.69 0.85

PRHLT-UPV 0.70 0.85WLV-SHEF BL 0.72 0.86

DCU-SYMC unconstrained 0.75 0.97DFKI grcfs-mars 0.82 0.98DFKI cfs-plsreg 0.82 0.99

UPC 1 0.84 1.01DCU-SYMC constrained 0.86 1.12

UPC 2 0.87 1.04TCD M5P-all 2.09 2.32

Estimating post-editing effort 23 / 40

Page 50: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Discussion

New and effective quality indicators (features)

Most participating systems use external resources:parsers, POS taggers, NER, etc. → wide variety offeatures

Many tried to exploit linguistically-oriented features

none or modest improvements (e.g. WLV-SHEF)high performance (e.g. “UU” with constituency anddependency trees)

Previously overlooked features: SMT decoder featurevalues (e.g. SDLLW)

A powerful single feature: agreement between twodifferent SMT systems (e.g. SDLLW)

Estimating post-editing effort 24 / 40

Page 51: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Discussion

New and effective quality indicators (features)

Most participating systems use external resources:parsers, POS taggers, NER, etc. → wide variety offeatures

Many tried to exploit linguistically-oriented features

none or modest improvements (e.g. WLV-SHEF)high performance (e.g. “UU” with constituency anddependency trees)

Previously overlooked features: SMT decoder featurevalues (e.g. SDLLW)

A powerful single feature: agreement between twodifferent SMT systems (e.g. SDLLW)

Estimating post-editing effort 24 / 40

Page 52: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Discussion

New and effective quality indicators (features)

Most participating systems use external resources:parsers, POS taggers, NER, etc. → wide variety offeatures

Many tried to exploit linguistically-oriented features

none or modest improvements (e.g. WLV-SHEF)

high performance (e.g. “UU” with constituency anddependency trees)

Previously overlooked features: SMT decoder featurevalues (e.g. SDLLW)

A powerful single feature: agreement between twodifferent SMT systems (e.g. SDLLW)

Estimating post-editing effort 24 / 40

Page 53: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Discussion

New and effective quality indicators (features)

Most participating systems use external resources:parsers, POS taggers, NER, etc. → wide variety offeatures

Many tried to exploit linguistically-oriented features

none or modest improvements (e.g. WLV-SHEF)high performance (e.g. “UU” with constituency anddependency trees)

Previously overlooked features: SMT decoder featurevalues (e.g. SDLLW)

A powerful single feature: agreement between twodifferent SMT systems (e.g. SDLLW)

Estimating post-editing effort 24 / 40

Page 54: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Discussion

New and effective quality indicators (features)

Most participating systems use external resources:parsers, POS taggers, NER, etc. → wide variety offeatures

Many tried to exploit linguistically-oriented features

none or modest improvements (e.g. WLV-SHEF)high performance (e.g. “UU” with constituency anddependency trees)

Previously overlooked features: SMT decoder featurevalues (e.g. SDLLW)

A powerful single feature: agreement between twodifferent SMT systems (e.g. SDLLW)

Estimating post-editing effort 24 / 40

Page 55: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Discussion

New and effective quality indicators (features)

Most participating systems use external resources:parsers, POS taggers, NER, etc. → wide variety offeatures

Many tried to exploit linguistically-oriented features

none or modest improvements (e.g. WLV-SHEF)high performance (e.g. “UU” with constituency anddependency trees)

Previously overlooked features: SMT decoder featurevalues (e.g. SDLLW)

A powerful single feature: agreement between twodifferent SMT systems (e.g. SDLLW)

Estimating post-editing effort 24 / 40

Page 56: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Discussion

Machine Learning techniques

Best performing: Regression Trees (M5P) and SVR

M5P Regression Trees: compact models, less overfitting,“readable”SVRs: easily overfit with small training data and largefeature set

Feature selection crucial in this setup

Structured learning techniques: “UU” submissions (treekernels)

Estimating post-editing effort 25 / 40

Page 57: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Discussion

Machine Learning techniques

Best performing: Regression Trees (M5P) and SVR

M5P Regression Trees: compact models, less overfitting,“readable”

SVRs: easily overfit with small training data and largefeature set

Feature selection crucial in this setup

Structured learning techniques: “UU” submissions (treekernels)

Estimating post-editing effort 25 / 40

Page 58: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Discussion

Machine Learning techniques

Best performing: Regression Trees (M5P) and SVR

M5P Regression Trees: compact models, less overfitting,“readable”SVRs: easily overfit with small training data and largefeature set

Feature selection crucial in this setup

Structured learning techniques: “UU” submissions (treekernels)

Estimating post-editing effort 25 / 40

Page 59: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Discussion

Machine Learning techniques

Best performing: Regression Trees (M5P) and SVR

M5P Regression Trees: compact models, less overfitting,“readable”SVRs: easily overfit with small training data and largefeature set

Feature selection crucial in this setup

Structured learning techniques: “UU” submissions (treekernels)

Estimating post-editing effort 25 / 40

Page 60: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Discussion

Machine Learning techniques

Best performing: Regression Trees (M5P) and SVR

M5P Regression Trees: compact models, less overfitting,“readable”SVRs: easily overfit with small training data and largefeature set

Feature selection crucial in this setup

Structured learning techniques: “UU” submissions (treekernels)

Estimating post-editing effort 25 / 40

Page 61: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Discussion

Evaluation metrics

DeltaAvg → suitable for the ranking task

automatic and deterministic (and therefore consistent)Extrinsic interpretability, e.g.:

Average quality in [1-5] = 2.5Quality of top 25% = 3.1Delta [1-5] = 0.6

Versatile: valuation function V can changeHigh correlation with Spearman, but less strict

MAE, RMSE → difficult task, values stubbornly high

Regression vs ranking

Most submissions: regression results to infer ranking

Ranking approach is simpler, directly useful in manyapplications

Estimating post-editing effort 26 / 40

Page 62: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Discussion

Evaluation metrics

DeltaAvg → suitable for the ranking taskautomatic and deterministic (and therefore consistent)

Extrinsic interpretability, e.g.:Average quality in [1-5] = 2.5Quality of top 25% = 3.1Delta [1-5] = 0.6

Versatile: valuation function V can changeHigh correlation with Spearman, but less strict

MAE, RMSE → difficult task, values stubbornly high

Regression vs ranking

Most submissions: regression results to infer ranking

Ranking approach is simpler, directly useful in manyapplications

Estimating post-editing effort 26 / 40

Page 63: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Discussion

Evaluation metrics

DeltaAvg → suitable for the ranking taskautomatic and deterministic (and therefore consistent)Extrinsic interpretability, e.g.:

Average quality in [1-5] = 2.5Quality of top 25% = 3.1Delta [1-5] = 0.6

Versatile: valuation function V can changeHigh correlation with Spearman, but less strict

MAE, RMSE → difficult task, values stubbornly high

Regression vs ranking

Most submissions: regression results to infer ranking

Ranking approach is simpler, directly useful in manyapplications

Estimating post-editing effort 26 / 40

Page 64: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Discussion

Evaluation metrics

DeltaAvg → suitable for the ranking taskautomatic and deterministic (and therefore consistent)Extrinsic interpretability, e.g.:

Average quality in [1-5] = 2.5Quality of top 25% = 3.1Delta [1-5] = 0.6

Versatile: valuation function V can change

High correlation with Spearman, but less strict

MAE, RMSE → difficult task, values stubbornly high

Regression vs ranking

Most submissions: regression results to infer ranking

Ranking approach is simpler, directly useful in manyapplications

Estimating post-editing effort 26 / 40

Page 65: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Discussion

Evaluation metrics

DeltaAvg → suitable for the ranking taskautomatic and deterministic (and therefore consistent)Extrinsic interpretability, e.g.:

Average quality in [1-5] = 2.5Quality of top 25% = 3.1Delta [1-5] = 0.6

Versatile: valuation function V can changeHigh correlation with Spearman, but less strict

MAE, RMSE → difficult task, values stubbornly high

Regression vs ranking

Most submissions: regression results to infer ranking

Ranking approach is simpler, directly useful in manyapplications

Estimating post-editing effort 26 / 40

Page 66: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Discussion

Evaluation metrics

DeltaAvg → suitable for the ranking taskautomatic and deterministic (and therefore consistent)Extrinsic interpretability, e.g.:

Average quality in [1-5] = 2.5Quality of top 25% = 3.1Delta [1-5] = 0.6

Versatile: valuation function V can changeHigh correlation with Spearman, but less strict

MAE, RMSE → difficult task, values stubbornly high

Regression vs ranking

Most submissions: regression results to infer ranking

Ranking approach is simpler, directly useful in manyapplications

Estimating post-editing effort 26 / 40

Page 67: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Discussion

Evaluation metrics

DeltaAvg → suitable for the ranking taskautomatic and deterministic (and therefore consistent)Extrinsic interpretability, e.g.:

Average quality in [1-5] = 2.5Quality of top 25% = 3.1Delta [1-5] = 0.6

Versatile: valuation function V can changeHigh correlation with Spearman, but less strict

MAE, RMSE → difficult task, values stubbornly high

Regression vs ranking

Most submissions: regression results to infer ranking

Ranking approach is simpler, directly useful in manyapplications

Estimating post-editing effort 26 / 40

Page 68: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Discussion

Evaluation metrics

DeltaAvg → suitable for the ranking taskautomatic and deterministic (and therefore consistent)Extrinsic interpretability, e.g.:

Average quality in [1-5] = 2.5Quality of top 25% = 3.1Delta [1-5] = 0.6

Versatile: valuation function V can changeHigh correlation with Spearman, but less strict

MAE, RMSE → difficult task, values stubbornly high

Regression vs ranking

Most submissions: regression results to infer ranking

Ranking approach is simpler, directly useful in manyapplications

Estimating post-editing effort 26 / 40

Page 69: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Discussion

Establish state-of-the-art performance

“Baseline” - hard to beat, previous state-of-the-art

Metrics, data sets, and performance points available

Known values for oracle-based upperbounds

Estimating post-editing effort 27 / 40

Page 70: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Discussion

Establish state-of-the-art performance

“Baseline” - hard to beat, previous state-of-the-art

Metrics, data sets, and performance points available

Known values for oracle-based upperbounds

Estimating post-editing effort 27 / 40

Page 71: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Discussion

Establish state-of-the-art performance

“Baseline” - hard to beat, previous state-of-the-art

Metrics, data sets, and performance points available

Known values for oracle-based upperbounds

Estimating post-editing effort 27 / 40

Page 72: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Outline

1 Overview

2 Quality Estimation

3 Shared Task

4 Open issues

5 Conclusions

Estimating post-editing effort 28 / 40

Page 73: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Agreement between translators

Absolute value judgements: difficult to achieveconsistency across annotators even in highly controlledsetup

30% of initial dataset discarded: annotators disagreed bymore than one categoryRemaining annotations had to be scaled

Estimating post-editing effort 29 / 40

Page 74: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Agreement between translators

Absolute value judgements: difficult to achieveconsistency across annotators even in highly controlledsetup

30% of initial dataset discarded: annotators disagreed bymore than one categoryRemaining annotations had to be scaled

Estimating post-editing effort 29 / 40

Page 75: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Agreement between translators

en-pt subtitles of TV series: 3 non-professionals annotators,1-4 scores

351 cases (41%): full agreement

445 cases (52%): partial agreement

54 cases (7%): null agreement

Agreement by score:

Score Full Partial/Null4 59% 41%3 35% 65%2 23% 77%1 50% 50%

Estimating post-editing effort 30 / 40

Page 76: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Agreement between translators

en-pt subtitles of TV series: 3 non-professionals annotators,1-4 scores

351 cases (41%): full agreement

445 cases (52%): partial agreement

54 cases (7%): null agreement

Agreement by score:

Score Full Partial/Null4 59% 41%3 35% 65%2 23% 77%1 50% 50%

Estimating post-editing effort 30 / 40

Page 77: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

More objective ways of generating absolute scores

TIME: varies considerably across translators (expected). E.g.:seconds per word

Can we normalise this variation?

Dedicated QE systems?

Estimating post-editing effort 31 / 40

Page 78: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

More objective ways of generating absolute scores

TIME: varies considerably across translators (expected). E.g.:seconds per word

Can we normalise this variation?

Dedicated QE systems?

Estimating post-editing effort 31 / 40

Page 79: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

More objective ways of generating absolute scores

TIME: varies considerably across translators (expected). E.g.:seconds per word

Can we normalise this variation?

Dedicated QE systems?

Estimating post-editing effort 31 / 40

Page 80: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

More objective ways of generating absolute scores

HTER: Edit distance between MT output and its minimallypost-edited version

HTER =#edits

#words postedited version

Edits: substitute, delete, insert, shift

Analysis by Maarit Koponen (WMT-12) on post-editedtranslations with HTER and 1-5 scores

A number of cases where translations with low HTER(few edits) were assigned low quality scores (highpost-editing effort), and vice-versaCertain edits seem to require more cognitive effort thanothers - not captured by HTER

Estimating post-editing effort 32 / 40

Page 81: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

More objective ways of generating absolute scores

HTER: Edit distance between MT output and its minimallypost-edited version

HTER =#edits

#words postedited version

Edits: substitute, delete, insert, shift

Analysis by Maarit Koponen (WMT-12) on post-editedtranslations with HTER and 1-5 scores

A number of cases where translations with low HTER(few edits) were assigned low quality scores (highpost-editing effort), and vice-versaCertain edits seem to require more cognitive effort thanothers - not captured by HTER

Estimating post-editing effort 32 / 40

Page 82: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

More objective ways of generating absolute scores

HTER: Edit distance between MT output and its minimallypost-edited version

HTER =#edits

#words postedited version

Edits: substitute, delete, insert, shift

Analysis by Maarit Koponen (WMT-12) on post-editedtranslations with HTER and 1-5 scores

A number of cases where translations with low HTER(few edits) were assigned low quality scores (highpost-editing effort), and vice-versa

Certain edits seem to require more cognitive effort thanothers - not captured by HTER

Estimating post-editing effort 32 / 40

Page 83: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

More objective ways of generating absolute scores

HTER: Edit distance between MT output and its minimallypost-edited version

HTER =#edits

#words postedited version

Edits: substitute, delete, insert, shift

Analysis by Maarit Koponen (WMT-12) on post-editedtranslations with HTER and 1-5 scores

A number of cases where translations with low HTER(few edits) were assigned low quality scores (highpost-editing effort), and vice-versaCertain edits seem to require more cognitive effort thanothers - not captured by HTER

Estimating post-editing effort 32 / 40

Page 84: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

More objective ways of generating absolute scores

Keystrokes: different PE strategies - data from 8 translators(joint work with Maarit Koponen and Wilker Aziz):

Estimating post-editing effort 33 / 40

Page 85: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

More objective ways of generating absolute scores

Keystrokes: different PE strategies - data from 8 translators(joint work with Maarit Koponen and Wilker Aziz):

Estimating post-editing effort 33 / 40

Page 86: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Use of relative scores

Ranking of translations: Suitable if the final application isto compare alternative translations of same source sentence

N-best list re-ranking

System combination

MT system evaluation

Estimating post-editing effort 34 / 40

Page 87: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Use of relative scores

Ranking of translations: Suitable if the final application isto compare alternative translations of same source sentence

N-best list re-ranking

System combination

MT system evaluation

Estimating post-editing effort 34 / 40

Page 88: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

What is the best metric to estimate PE effort?

Effort/HTER seem to lack “cognitive load”

Time varies too much across post-editors

Keystrokes seems to capture PE strategies, but do notcorrelate well with PE effort

Eye-tracking data can be useful, but not always feasible

Estimating post-editing effort 35 / 40

Page 89: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

What is the best metric to estimate PE effort?

Effort/HTER seem to lack “cognitive load”

Time varies too much across post-editors

Keystrokes seems to capture PE strategies, but do notcorrelate well with PE effort

Eye-tracking data can be useful, but not always feasible

Estimating post-editing effort 35 / 40

Page 90: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

What is the best metric to estimate PE effort?

Effort/HTER seem to lack “cognitive load”

Time varies too much across post-editors

Keystrokes seems to capture PE strategies, but do notcorrelate well with PE effort

Eye-tracking data can be useful, but not always feasible

Estimating post-editing effort 35 / 40

Page 91: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

What is the best metric to estimate PE effort?

Effort/HTER seem to lack “cognitive load”

Time varies too much across post-editors

Keystrokes seems to capture PE strategies, but do notcorrelate well with PE effort

Eye-tracking data can be useful, but not always feasible

Estimating post-editing effort 35 / 40

Page 92: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

How to use estimated PE effort scores?

Should (supposedly) bad quality translations be filteredout or shown to translators (different scores/colourcodes as in TMs)?

Wasting time to read scores and translations vs wasting“gisting” information

How to define a threshold on the estimated translationquality to decide what should be filtered out?

Translator dependentTask dependent (SDL)

Do translators prefer detailed estimates (sub-sentencelevel) or an overall estimate for the complete sentence?

Too much information vs hard-to-interpret scores

Estimating post-editing effort 36 / 40

Page 93: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

How to use estimated PE effort scores?

Should (supposedly) bad quality translations be filteredout or shown to translators (different scores/colourcodes as in TMs)?

Wasting time to read scores and translations vs wasting“gisting” information

How to define a threshold on the estimated translationquality to decide what should be filtered out?

Translator dependentTask dependent (SDL)

Do translators prefer detailed estimates (sub-sentencelevel) or an overall estimate for the complete sentence?

Too much information vs hard-to-interpret scores

Estimating post-editing effort 36 / 40

Page 94: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

How to use estimated PE effort scores?

Should (supposedly) bad quality translations be filteredout or shown to translators (different scores/colourcodes as in TMs)?

Wasting time to read scores and translations vs wasting“gisting” information

How to define a threshold on the estimated translationquality to decide what should be filtered out?

Translator dependentTask dependent (SDL)

Do translators prefer detailed estimates (sub-sentencelevel) or an overall estimate for the complete sentence?

Too much information vs hard-to-interpret scores

Estimating post-editing effort 36 / 40

Page 95: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Outline

1 Overview

2 Quality Estimation

3 Shared Task

4 Open issues

5 Conclusions

Estimating post-editing effort 37 / 40

Page 96: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Conclusions

It is possible to estimate at least certain aspects of PEeffort

PE effort estimates can be used in real applications

Ranking translations: filter out bad quality translationsSelecting translations from multiple MT systems

A number of open issues to be investigated...

My vision

Sub-sentence level QE (error detection), highlightingerrors but also given an overall estimate for the sentence

Estimating post-editing effort 38 / 40

Page 97: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Conclusions

It is possible to estimate at least certain aspects of PEeffort

PE effort estimates can be used in real applications

Ranking translations: filter out bad quality translationsSelecting translations from multiple MT systems

A number of open issues to be investigated...

My vision

Sub-sentence level QE (error detection), highlightingerrors but also given an overall estimate for the sentence

Estimating post-editing effort 38 / 40

Page 98: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Conclusions

It is possible to estimate at least certain aspects of PEeffort

PE effort estimates can be used in real applications

Ranking translations: filter out bad quality translationsSelecting translations from multiple MT systems

A number of open issues to be investigated...

My vision

Sub-sentence level QE (error detection), highlightingerrors but also given an overall estimate for the sentence

Estimating post-editing effort 38 / 40

Page 99: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Conclusions

It is possible to estimate at least certain aspects of PEeffort

PE effort estimates can be used in real applications

Ranking translations: filter out bad quality translationsSelecting translations from multiple MT systems

A number of open issues to be investigated...

My vision

Sub-sentence level QE (error detection), highlightingerrors but also given an overall estimate for the sentence

Estimating post-editing effort 38 / 40

Page 100: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Journal of MT - Special issue

15-06-12 - 1st CFP

15-08-12 - 2nd CFP

15-09-12 - submission deadline

15-10-12 - reviews due

End of December 2012 - camera-ready due (tentative)

WMT-12 QE Shared Task

All feature sets available

Estimating post-editing effort 39 / 40

Page 101: Estimating post-editing e ortstaff€¦ · Estimating post-editing e ort State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 17 August

Overview Quality Estimation Shared Task Open issues Conclusions

Estimating post-editing effort

State-of-the-art systems and open issues

Lucia Specia

University of [email protected]

17 August 2012

Estimating post-editing effort 40 / 40