Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Overview Quality Estimation Shared Task Open issues Conclusions
Estimating post-editing effort
State-of-the-art systems and open issues
Lucia Specia
University of [email protected]
17 August 2012
Estimating post-editing effort 1 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Outline
1 Overview
2 Quality Estimation
3 Shared Task
4 Open issues
5 Conclusions
Estimating post-editing effort 2 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Outline
1 Overview
2 Quality Estimation
3 Shared Task
4 Open issues
5 Conclusions
Estimating post-editing effort 3 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Overview
Why don’t translators use (more) MT?
Translations are not good enough!What about TMs? Aren’t fuzzy matches useful?
Estimating post-editing effort 4 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Overview
Why don’t translators use (more) MT?Translations are not good enough!
What about TMs? Aren’t fuzzy matches useful?
Estimating post-editing effort 4 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Overview
Why don’t translators use (more) MT?Translations are not good enough!What about TMs? Aren’t fuzzy matches useful?
Estimating post-editing effort 4 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Overview
Why don’t translators use (more) MT?Translations are not good enough!What about TMs? Aren’t fuzzy matches useful?
Estimating post-editing effort 4 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Outline
1 Overview
2 Quality Estimation
3 Shared Task
4 Open issues
5 Conclusions
Estimating post-editing effort 5 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Framework
Quality estimation (QE): provide an estimate ofquality for new translated text
Quality = post-editing effort
No access to reference translations: machine learningtechniques to predict post-editing effort scores
Sourcesentence
MT system
Translation sentence
QE system
PE effort
Examples of translations + PE scores
Qualityindicators
Estimating post-editing effort 6 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Framework
Quality estimation (QE): provide an estimate ofquality for new translated text
Quality = post-editing effort
No access to reference translations: machine learningtechniques to predict post-editing effort scores
Sourcesentence
MT system
Translation sentence
QE system
PE effort
Examples of translations + PE scores
Qualityindicators
Estimating post-editing effort 6 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Framework
Quality estimation (QE): provide an estimate ofquality for new translated text
Quality = post-editing effort
No access to reference translations: machine learningtechniques to predict post-editing effort scores
Sourcesentence
MT system
Translation sentence
QE system
PE effort
Examples of translations + PE scores
Qualityindicators
Estimating post-editing effort 6 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Progress so far
QE, also called confidence estimation, started in 2003
Estimate BLEU: difficult to interpret outputHard to beat the baseline: MT is always badNo success in any application
Semi-dormant until 2009
Better MT systemsMT used in translation industryEstimate more interpretable metrics: post-editing effort(human scores, time, edit distance)Positive results
Estimating post-editing effort 7 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Progress so far
QE, also called confidence estimation, started in 2003
Estimate BLEU: difficult to interpret outputHard to beat the baseline: MT is always badNo success in any application
Semi-dormant until 2009
Better MT systemsMT used in translation industryEstimate more interpretable metrics: post-editing effort(human scores, time, edit distance)Positive results
Estimating post-editing effort 7 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Examples of positive results
Time to post-edit subset of sentences predicted as“good” (low effort) vs time to post-edit random subset ofsentences
Language no QE QEfr-en 0.75 words/sec 1.09 words/secen-es 0.32 words/sec 0.57 words/sec
Accuracy in selecting best translation among 4 MTsystems
Best MT system Highest QE score54% 77%
Estimating post-editing effort 8 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Examples of positive results
Time to post-edit subset of sentences predicted as“good” (low effort) vs time to post-edit random subset ofsentences
Language no QE QEfr-en 0.75 words/sec 1.09 words/secen-es 0.32 words/sec 0.57 words/sec
Accuracy in selecting best translation among 4 MTsystems
Best MT system Highest QE score54% 77%
Estimating post-editing effort 8 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Examples of positive results
Time to post-edit subset of sentences predicted as“good” (low effort) vs time to post-edit random subset ofsentences
Language no QE QEfr-en 0.75 words/sec 1.09 words/secen-es 0.32 words/sec 0.57 words/sec
Accuracy in selecting best translation among 4 MTsystems
Best MT system Highest QE score54% 77%
Estimating post-editing effort 8 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
State-of-the-art (before WMT-12)
Quality indicators:
Source text TranslationMT system
Confidence indicators
Complexity indicators
Fluency indicators
Adequacyindicators
Learning algorithms: wide range
Datasets: few with absolute human scores (1-4 scores,PE time, edit distance), WMT data (relative scores)
Estimating post-editing effort 9 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
State-of-the-art (before WMT-12)
Quality indicators:
Source text TranslationMT system
Confidence indicators
Complexity indicators
Fluency indicators
Adequacyindicators
Learning algorithms: wide range
Datasets: few with absolute human scores (1-4 scores,PE time, edit distance), WMT data (relative scores)
Estimating post-editing effort 9 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
State-of-the-art (before WMT-12)
Quality indicators:
Source text TranslationMT system
Confidence indicators
Complexity indicators
Fluency indicators
Adequacyindicators
Learning algorithms: wide range
Datasets: few with absolute human scores (1-4 scores,PE time, edit distance), WMT data (relative scores)
Estimating post-editing effort 9 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Outline
1 Overview
2 Quality Estimation
3 Shared Task
4 Open issues
5 Conclusions
Estimating post-editing effort 10 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Goal
WMT-12, joint work with Radu Soricut
First common ground for development and comparison ofQE systems, focusing on sentence-level estimation ofPE effort:
Identify (new) effective quality indicatorsIdentify most suitable machine learning techniquesTest (new) automatic evaluation metricsEstablish the state of the art performance in the fieldContrast the performance of regression and rankingtechniques
Estimating post-editing effort 11 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Goal
WMT-12, joint work with Radu Soricut
First common ground for development and comparison ofQE systems, focusing on sentence-level estimation ofPE effort:
Identify (new) effective quality indicatorsIdentify most suitable machine learning techniquesTest (new) automatic evaluation metricsEstablish the state of the art performance in the fieldContrast the performance of regression and rankingtechniques
Estimating post-editing effort 11 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Goal
WMT-12, joint work with Radu Soricut
First common ground for development and comparison ofQE systems, focusing on sentence-level estimation ofPE effort:
Identify (new) effective quality indicators
Identify most suitable machine learning techniquesTest (new) automatic evaluation metricsEstablish the state of the art performance in the fieldContrast the performance of regression and rankingtechniques
Estimating post-editing effort 11 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Goal
WMT-12, joint work with Radu Soricut
First common ground for development and comparison ofQE systems, focusing on sentence-level estimation ofPE effort:
Identify (new) effective quality indicatorsIdentify most suitable machine learning techniques
Test (new) automatic evaluation metricsEstablish the state of the art performance in the fieldContrast the performance of regression and rankingtechniques
Estimating post-editing effort 11 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Goal
WMT-12, joint work with Radu Soricut
First common ground for development and comparison ofQE systems, focusing on sentence-level estimation ofPE effort:
Identify (new) effective quality indicatorsIdentify most suitable machine learning techniquesTest (new) automatic evaluation metrics
Establish the state of the art performance in the fieldContrast the performance of regression and rankingtechniques
Estimating post-editing effort 11 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Goal
WMT-12, joint work with Radu Soricut
First common ground for development and comparison ofQE systems, focusing on sentence-level estimation ofPE effort:
Identify (new) effective quality indicatorsIdentify most suitable machine learning techniquesTest (new) automatic evaluation metricsEstablish the state of the art performance in the field
Contrast the performance of regression and rankingtechniques
Estimating post-editing effort 11 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Goal
WMT-12, joint work with Radu Soricut
First common ground for development and comparison ofQE systems, focusing on sentence-level estimation ofPE effort:
Identify (new) effective quality indicatorsIdentify most suitable machine learning techniquesTest (new) automatic evaluation metricsEstablish the state of the art performance in the fieldContrast the performance of regression and rankingtechniques
Estimating post-editing effort 11 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Datasets
English source sentences
Spanish MT outputs (PBSMT Moses)
Effort scores by 3 human judges, scale 1-5, averaged
Post-edited output
Human Spanish translation (original references)
Training set: 1832
Blind test set: 422
Estimating post-editing effort 12 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Datasets
English source sentences
Spanish MT outputs (PBSMT Moses)
Effort scores by 3 human judges, scale 1-5, averaged
Post-edited output
Human Spanish translation (original references)
Training set: 1832
Blind test set: 422
Estimating post-editing effort 12 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Datasets
English source sentences
Spanish MT outputs (PBSMT Moses)
Effort scores by 3 human judges, scale 1-5, averaged
Post-edited output
Human Spanish translation (original references)
Training set: 1832
Blind test set: 422
Estimating post-editing effort 12 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Datasets
English source sentences
Spanish MT outputs (PBSMT Moses)
Effort scores by 3 human judges, scale 1-5, averaged
Post-edited output
Human Spanish translation (original references)
Training set: 1832
Blind test set: 422
Estimating post-editing effort 12 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Datasets
English source sentences
Spanish MT outputs (PBSMT Moses)
Effort scores by 3 human judges, scale 1-5, averaged
Post-edited output
Human Spanish translation (original references)
Training set: 1832
Blind test set: 422
Estimating post-editing effort 12 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Datasets
English source sentences
Spanish MT outputs (PBSMT Moses)
Effort scores by 3 human judges, scale 1-5, averaged
Post-edited output
Human Spanish translation (original references)
Training set: 1832
Blind test set: 422
Estimating post-editing effort 12 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Datasets
English source sentences
Spanish MT outputs (PBSMT Moses)
Effort scores by 3 human judges, scale 1-5, averaged
Post-edited output
Human Spanish translation (original references)
Training set: 1832
Blind test set: 422
Estimating post-editing effort 12 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Datasets
Annotation guidelines3 human judges for PE effort assigning 1-5 scores for〈source, MT output, PE output〉
[1] The MT output is incomprehensible, with little or no information transferredaccurately. It cannot be edited, needs to be translated from scratch.
[2] About 50-70% of the MT output needs to be edited. It requires a significantediting effort in order to reach publishable level.
[3] About 25-50% of the MT output needs to be edited. It contains different errorsand mistranslations that need to be corrected.
[4] About 10-25% of the MT output needs to be edited. It is generally clear andintelligible.
[5] The MT output is perfectly clear and intelligible. It is not necessarily a perfecttranslation, but requires little to no editing.
Estimating post-editing effort 13 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Resources
SMT resources for training and test sets:
SMT training corpus (Europarl and News-bommentaries)
LMs: 5-gram LM; 3-gram LM and 1-3-gram counts
IBM Model 1 table (Giza)
Word-alignment file as produced by grow-diag-final
Phrase table with word alignment information
Moses configuration file used for decoding
Moses run-time log: model component values, wordgraph, etc.
Estimating post-editing effort 14 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Evaluation metrics
Scoring metrics - standard MAE and RMSE
MAE =
∑Ni=1 |H(si)− V (si)|
N
RMSE =
√∑Ni=1(H(si)− V (si))2
N
N = |S |H(si) is the predicted score for siV (si) the is human score for si
Estimating post-editing effort 15 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Evaluation metrics
Ranking metrics Spearman’s rank correlation and newmetric: DeltaAvg
For S1, S2, . . . , Sn quantiles:
DeltaAvgV [n] =
∑n−1k=1 V (S1,k)
n − 1− V (S)
V (S): extrinsic function measuring the “quality” of set S
Average human scores (1-5) of set S
Estimating post-editing effort 16 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Evaluation metrics
Ranking metrics Spearman’s rank correlation and newmetric: DeltaAvg
For S1, S2, . . . , Sn quantiles:
DeltaAvgV [n] =
∑n−1k=1 V (S1,k)
n − 1− V (S)
V (S): extrinsic function measuring the “quality” of set S
Average human scores (1-5) of set S
Estimating post-editing effort 16 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Evaluation metrics
Example 1: n=2, quantiles S1, S2
DeltaAvg[2] = V (S1)− V (S)“Quality of the top half compared to the overall quality”
Average human scores of top half compared to averagehuman scores of complete set
Example 2: n=3, quantiles S1, S2, S3
DeltaAvg[3] =(V (S1)−V (S))+(V (S1,2)−V (S))
2
Average human scores of top third compared to averagehuman scores of complete set; average human scores of top
two thirds compared to average human scores of completeset, averaged
Estimating post-editing effort 17 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Evaluation metrics
Example 1: n=2, quantiles S1, S2
DeltaAvg[2] = V (S1)− V (S)“Quality of the top half compared to the overall quality”
Average human scores of top half compared to averagehuman scores of complete set
Example 2: n=3, quantiles S1, S2, S3
DeltaAvg[3] =(V (S1)−V (S))+(V (S1,2)−V (S))
2
Average human scores of top third compared to averagehuman scores of complete set; average human scores of top
two thirds compared to average human scores of completeset, averaged
Estimating post-editing effort 17 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Evaluation metrics
Final DeltaAvg metric
DeltaAvgV =
∑Nn=2 DeltaAvgV [n]
N − 1
where N = |S |/2
Average DeltaAvg[n] for all n, 2 ≤ n ≤ |S |/2
Estimating post-editing effort 18 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Evaluation metrics
Final DeltaAvg metric
DeltaAvgV =
∑Nn=2 DeltaAvgV [n]
N − 1
where N = |S |/2
Average DeltaAvg[n] for all n, 2 ≤ n ≤ |S |/2
Estimating post-editing effort 18 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Participants
ID Participating teamPRHLT-UPV Universitat Politecnica de Valencia, Spain
UU Uppsala University, SwedenSDLLW SDL Language Weaver, USA
Loria LORIA Institute, FranceUPC Universitat Politecnica de Catalunya, Spain
DFKI DFKI, GermanyWLV-SHEF Univ of Wolverhampton & Univ of Sheffield, UK
SJTU Shanghai Jiao Tong University, ChinaDCU-SYMC Dublin City University, Ireland & Symantec, Ireland
UEdin University of Edinburgh, UKTCD Trinity College Dublin, Ireland
One or two systems per team, most teams submitting for rankingand scoring sub-tasks
Estimating post-editing effort 19 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Baseline system
Feature extraction software – system-independent features:
number of tokens in the source and target sentences
average source token length
average number of occurrences of words in the target
number of punctuation marks in source and target sentences
LM probability of source and target sentences
average number of translations per source word
% of source 1-grams, 2-grams and 3-grams in frequencyquartiles 1 and 4
% of seen source unigrams
SVM regression with RBF kernel with the parameters γ, ε and Coptimized using a grid-search and 5-fold cross validation on thetraining set
Estimating post-editing effort 20 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Baseline system
Feature extraction software – system-independent features:
number of tokens in the source and target sentences
average source token length
average number of occurrences of words in the target
number of punctuation marks in source and target sentences
LM probability of source and target sentences
average number of translations per source word
% of source 1-grams, 2-grams and 3-grams in frequencyquartiles 1 and 4
% of seen source unigrams
SVM regression with RBF kernel with the parameters γ, ε and Coptimized using a grid-search and 5-fold cross validation on thetraining set
Estimating post-editing effort 20 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Results - ranking sub-task
System ID DeltaAvg Spearman Corr• SDLLW M5PbestDeltaAvg 0.63 0.64
• SDLLW SVM 0.61 0.60UU bltk 0.58 0.61UU best 0.56 0.62
TCD M5P-resources-only* 0.56 0.56Baseline (17FFs SVM) 0.55 0.58
PRHLT-UPV 0.55 0.55UEdin 0.54 0.58SJTU 0.53 0.53
WLV-SHEF FS 0.51 0.52WLV-SHEF BL 0.50 0.49
DFKI morphPOSibm1LM 0.46 0.46DCU-SYMC unconstrained 0.44 0.41
DCU-SYMC constrained 0.43 0.41TCD M5P-all* 0.42 0.41
UPC 1 0.22 0.26UPC 2 0.15 0.19
• = winning submissionsgray area = not different from baseline* = bug-fix was applied after the submission
Estimating post-editing effort 21 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Results - ranking sub-task
Oracle methods: associate various metrics in a oraclemanner to the test input:
Oracle Effort: the gold-label Effort
Oracle HTER: the HTER metric against the post-editedtranslations as reference
System ID DeltaAvg Spearman CorrOracle Effort 0.95 1.00
Oracle HTER 0.77 0.70
Estimating post-editing effort 22 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Results - scoring sub-task
System ID MAE RMSE• SDLLW M5PbestDeltaAvg 0.61 0.75
UU best 0.64 0.79SDLLW SVM 0.64 0.78
UU bltk 0.64 0.79Loria SVMlinear 0.68 0.82
UEdin 0.68 0.82TCD M5P-resources-only* 0.68 0.82
Baseline (17FFs SVM) 0.69 0.82Loria SVMrbf 0.69 0.83
SJTU 0.69 0.83WLV-SHEF FS 0.69 0.85
PRHLT-UPV 0.70 0.85WLV-SHEF BL 0.72 0.86
DCU-SYMC unconstrained 0.75 0.97DFKI grcfs-mars 0.82 0.98DFKI cfs-plsreg 0.82 0.99
UPC 1 0.84 1.01DCU-SYMC constrained 0.86 1.12
UPC 2 0.87 1.04TCD M5P-all 2.09 2.32
Estimating post-editing effort 23 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Discussion
New and effective quality indicators (features)
Most participating systems use external resources:parsers, POS taggers, NER, etc. → wide variety offeatures
Many tried to exploit linguistically-oriented features
none or modest improvements (e.g. WLV-SHEF)high performance (e.g. “UU” with constituency anddependency trees)
Previously overlooked features: SMT decoder featurevalues (e.g. SDLLW)
A powerful single feature: agreement between twodifferent SMT systems (e.g. SDLLW)
Estimating post-editing effort 24 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Discussion
New and effective quality indicators (features)
Most participating systems use external resources:parsers, POS taggers, NER, etc. → wide variety offeatures
Many tried to exploit linguistically-oriented features
none or modest improvements (e.g. WLV-SHEF)high performance (e.g. “UU” with constituency anddependency trees)
Previously overlooked features: SMT decoder featurevalues (e.g. SDLLW)
A powerful single feature: agreement between twodifferent SMT systems (e.g. SDLLW)
Estimating post-editing effort 24 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Discussion
New and effective quality indicators (features)
Most participating systems use external resources:parsers, POS taggers, NER, etc. → wide variety offeatures
Many tried to exploit linguistically-oriented features
none or modest improvements (e.g. WLV-SHEF)
high performance (e.g. “UU” with constituency anddependency trees)
Previously overlooked features: SMT decoder featurevalues (e.g. SDLLW)
A powerful single feature: agreement between twodifferent SMT systems (e.g. SDLLW)
Estimating post-editing effort 24 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Discussion
New and effective quality indicators (features)
Most participating systems use external resources:parsers, POS taggers, NER, etc. → wide variety offeatures
Many tried to exploit linguistically-oriented features
none or modest improvements (e.g. WLV-SHEF)high performance (e.g. “UU” with constituency anddependency trees)
Previously overlooked features: SMT decoder featurevalues (e.g. SDLLW)
A powerful single feature: agreement between twodifferent SMT systems (e.g. SDLLW)
Estimating post-editing effort 24 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Discussion
New and effective quality indicators (features)
Most participating systems use external resources:parsers, POS taggers, NER, etc. → wide variety offeatures
Many tried to exploit linguistically-oriented features
none or modest improvements (e.g. WLV-SHEF)high performance (e.g. “UU” with constituency anddependency trees)
Previously overlooked features: SMT decoder featurevalues (e.g. SDLLW)
A powerful single feature: agreement between twodifferent SMT systems (e.g. SDLLW)
Estimating post-editing effort 24 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Discussion
New and effective quality indicators (features)
Most participating systems use external resources:parsers, POS taggers, NER, etc. → wide variety offeatures
Many tried to exploit linguistically-oriented features
none or modest improvements (e.g. WLV-SHEF)high performance (e.g. “UU” with constituency anddependency trees)
Previously overlooked features: SMT decoder featurevalues (e.g. SDLLW)
A powerful single feature: agreement between twodifferent SMT systems (e.g. SDLLW)
Estimating post-editing effort 24 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Discussion
Machine Learning techniques
Best performing: Regression Trees (M5P) and SVR
M5P Regression Trees: compact models, less overfitting,“readable”SVRs: easily overfit with small training data and largefeature set
Feature selection crucial in this setup
Structured learning techniques: “UU” submissions (treekernels)
Estimating post-editing effort 25 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Discussion
Machine Learning techniques
Best performing: Regression Trees (M5P) and SVR
M5P Regression Trees: compact models, less overfitting,“readable”
SVRs: easily overfit with small training data and largefeature set
Feature selection crucial in this setup
Structured learning techniques: “UU” submissions (treekernels)
Estimating post-editing effort 25 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Discussion
Machine Learning techniques
Best performing: Regression Trees (M5P) and SVR
M5P Regression Trees: compact models, less overfitting,“readable”SVRs: easily overfit with small training data and largefeature set
Feature selection crucial in this setup
Structured learning techniques: “UU” submissions (treekernels)
Estimating post-editing effort 25 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Discussion
Machine Learning techniques
Best performing: Regression Trees (M5P) and SVR
M5P Regression Trees: compact models, less overfitting,“readable”SVRs: easily overfit with small training data and largefeature set
Feature selection crucial in this setup
Structured learning techniques: “UU” submissions (treekernels)
Estimating post-editing effort 25 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Discussion
Machine Learning techniques
Best performing: Regression Trees (M5P) and SVR
M5P Regression Trees: compact models, less overfitting,“readable”SVRs: easily overfit with small training data and largefeature set
Feature selection crucial in this setup
Structured learning techniques: “UU” submissions (treekernels)
Estimating post-editing effort 25 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Discussion
Evaluation metrics
DeltaAvg → suitable for the ranking task
automatic and deterministic (and therefore consistent)Extrinsic interpretability, e.g.:
Average quality in [1-5] = 2.5Quality of top 25% = 3.1Delta [1-5] = 0.6
Versatile: valuation function V can changeHigh correlation with Spearman, but less strict
MAE, RMSE → difficult task, values stubbornly high
Regression vs ranking
Most submissions: regression results to infer ranking
Ranking approach is simpler, directly useful in manyapplications
Estimating post-editing effort 26 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Discussion
Evaluation metrics
DeltaAvg → suitable for the ranking taskautomatic and deterministic (and therefore consistent)
Extrinsic interpretability, e.g.:Average quality in [1-5] = 2.5Quality of top 25% = 3.1Delta [1-5] = 0.6
Versatile: valuation function V can changeHigh correlation with Spearman, but less strict
MAE, RMSE → difficult task, values stubbornly high
Regression vs ranking
Most submissions: regression results to infer ranking
Ranking approach is simpler, directly useful in manyapplications
Estimating post-editing effort 26 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Discussion
Evaluation metrics
DeltaAvg → suitable for the ranking taskautomatic and deterministic (and therefore consistent)Extrinsic interpretability, e.g.:
Average quality in [1-5] = 2.5Quality of top 25% = 3.1Delta [1-5] = 0.6
Versatile: valuation function V can changeHigh correlation with Spearman, but less strict
MAE, RMSE → difficult task, values stubbornly high
Regression vs ranking
Most submissions: regression results to infer ranking
Ranking approach is simpler, directly useful in manyapplications
Estimating post-editing effort 26 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Discussion
Evaluation metrics
DeltaAvg → suitable for the ranking taskautomatic and deterministic (and therefore consistent)Extrinsic interpretability, e.g.:
Average quality in [1-5] = 2.5Quality of top 25% = 3.1Delta [1-5] = 0.6
Versatile: valuation function V can change
High correlation with Spearman, but less strict
MAE, RMSE → difficult task, values stubbornly high
Regression vs ranking
Most submissions: regression results to infer ranking
Ranking approach is simpler, directly useful in manyapplications
Estimating post-editing effort 26 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Discussion
Evaluation metrics
DeltaAvg → suitable for the ranking taskautomatic and deterministic (and therefore consistent)Extrinsic interpretability, e.g.:
Average quality in [1-5] = 2.5Quality of top 25% = 3.1Delta [1-5] = 0.6
Versatile: valuation function V can changeHigh correlation with Spearman, but less strict
MAE, RMSE → difficult task, values stubbornly high
Regression vs ranking
Most submissions: regression results to infer ranking
Ranking approach is simpler, directly useful in manyapplications
Estimating post-editing effort 26 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Discussion
Evaluation metrics
DeltaAvg → suitable for the ranking taskautomatic and deterministic (and therefore consistent)Extrinsic interpretability, e.g.:
Average quality in [1-5] = 2.5Quality of top 25% = 3.1Delta [1-5] = 0.6
Versatile: valuation function V can changeHigh correlation with Spearman, but less strict
MAE, RMSE → difficult task, values stubbornly high
Regression vs ranking
Most submissions: regression results to infer ranking
Ranking approach is simpler, directly useful in manyapplications
Estimating post-editing effort 26 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Discussion
Evaluation metrics
DeltaAvg → suitable for the ranking taskautomatic and deterministic (and therefore consistent)Extrinsic interpretability, e.g.:
Average quality in [1-5] = 2.5Quality of top 25% = 3.1Delta [1-5] = 0.6
Versatile: valuation function V can changeHigh correlation with Spearman, but less strict
MAE, RMSE → difficult task, values stubbornly high
Regression vs ranking
Most submissions: regression results to infer ranking
Ranking approach is simpler, directly useful in manyapplications
Estimating post-editing effort 26 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Discussion
Evaluation metrics
DeltaAvg → suitable for the ranking taskautomatic and deterministic (and therefore consistent)Extrinsic interpretability, e.g.:
Average quality in [1-5] = 2.5Quality of top 25% = 3.1Delta [1-5] = 0.6
Versatile: valuation function V can changeHigh correlation with Spearman, but less strict
MAE, RMSE → difficult task, values stubbornly high
Regression vs ranking
Most submissions: regression results to infer ranking
Ranking approach is simpler, directly useful in manyapplications
Estimating post-editing effort 26 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Discussion
Establish state-of-the-art performance
“Baseline” - hard to beat, previous state-of-the-art
Metrics, data sets, and performance points available
Known values for oracle-based upperbounds
Estimating post-editing effort 27 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Discussion
Establish state-of-the-art performance
“Baseline” - hard to beat, previous state-of-the-art
Metrics, data sets, and performance points available
Known values for oracle-based upperbounds
Estimating post-editing effort 27 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Discussion
Establish state-of-the-art performance
“Baseline” - hard to beat, previous state-of-the-art
Metrics, data sets, and performance points available
Known values for oracle-based upperbounds
Estimating post-editing effort 27 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Outline
1 Overview
2 Quality Estimation
3 Shared Task
4 Open issues
5 Conclusions
Estimating post-editing effort 28 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Agreement between translators
Absolute value judgements: difficult to achieveconsistency across annotators even in highly controlledsetup
30% of initial dataset discarded: annotators disagreed bymore than one categoryRemaining annotations had to be scaled
Estimating post-editing effort 29 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Agreement between translators
Absolute value judgements: difficult to achieveconsistency across annotators even in highly controlledsetup
30% of initial dataset discarded: annotators disagreed bymore than one categoryRemaining annotations had to be scaled
Estimating post-editing effort 29 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Agreement between translators
en-pt subtitles of TV series: 3 non-professionals annotators,1-4 scores
351 cases (41%): full agreement
445 cases (52%): partial agreement
54 cases (7%): null agreement
Agreement by score:
Score Full Partial/Null4 59% 41%3 35% 65%2 23% 77%1 50% 50%
Estimating post-editing effort 30 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Agreement between translators
en-pt subtitles of TV series: 3 non-professionals annotators,1-4 scores
351 cases (41%): full agreement
445 cases (52%): partial agreement
54 cases (7%): null agreement
Agreement by score:
Score Full Partial/Null4 59% 41%3 35% 65%2 23% 77%1 50% 50%
Estimating post-editing effort 30 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
More objective ways of generating absolute scores
TIME: varies considerably across translators (expected). E.g.:seconds per word
Can we normalise this variation?
Dedicated QE systems?
Estimating post-editing effort 31 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
More objective ways of generating absolute scores
TIME: varies considerably across translators (expected). E.g.:seconds per word
Can we normalise this variation?
Dedicated QE systems?
Estimating post-editing effort 31 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
More objective ways of generating absolute scores
TIME: varies considerably across translators (expected). E.g.:seconds per word
Can we normalise this variation?
Dedicated QE systems?
Estimating post-editing effort 31 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
More objective ways of generating absolute scores
HTER: Edit distance between MT output and its minimallypost-edited version
HTER =#edits
#words postedited version
Edits: substitute, delete, insert, shift
Analysis by Maarit Koponen (WMT-12) on post-editedtranslations with HTER and 1-5 scores
A number of cases where translations with low HTER(few edits) were assigned low quality scores (highpost-editing effort), and vice-versaCertain edits seem to require more cognitive effort thanothers - not captured by HTER
Estimating post-editing effort 32 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
More objective ways of generating absolute scores
HTER: Edit distance between MT output and its minimallypost-edited version
HTER =#edits
#words postedited version
Edits: substitute, delete, insert, shift
Analysis by Maarit Koponen (WMT-12) on post-editedtranslations with HTER and 1-5 scores
A number of cases where translations with low HTER(few edits) were assigned low quality scores (highpost-editing effort), and vice-versaCertain edits seem to require more cognitive effort thanothers - not captured by HTER
Estimating post-editing effort 32 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
More objective ways of generating absolute scores
HTER: Edit distance between MT output and its minimallypost-edited version
HTER =#edits
#words postedited version
Edits: substitute, delete, insert, shift
Analysis by Maarit Koponen (WMT-12) on post-editedtranslations with HTER and 1-5 scores
A number of cases where translations with low HTER(few edits) were assigned low quality scores (highpost-editing effort), and vice-versa
Certain edits seem to require more cognitive effort thanothers - not captured by HTER
Estimating post-editing effort 32 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
More objective ways of generating absolute scores
HTER: Edit distance between MT output and its minimallypost-edited version
HTER =#edits
#words postedited version
Edits: substitute, delete, insert, shift
Analysis by Maarit Koponen (WMT-12) on post-editedtranslations with HTER and 1-5 scores
A number of cases where translations with low HTER(few edits) were assigned low quality scores (highpost-editing effort), and vice-versaCertain edits seem to require more cognitive effort thanothers - not captured by HTER
Estimating post-editing effort 32 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
More objective ways of generating absolute scores
Keystrokes: different PE strategies - data from 8 translators(joint work with Maarit Koponen and Wilker Aziz):
Estimating post-editing effort 33 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
More objective ways of generating absolute scores
Keystrokes: different PE strategies - data from 8 translators(joint work with Maarit Koponen and Wilker Aziz):
Estimating post-editing effort 33 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Use of relative scores
Ranking of translations: Suitable if the final application isto compare alternative translations of same source sentence
N-best list re-ranking
System combination
MT system evaluation
Estimating post-editing effort 34 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Use of relative scores
Ranking of translations: Suitable if the final application isto compare alternative translations of same source sentence
N-best list re-ranking
System combination
MT system evaluation
Estimating post-editing effort 34 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
What is the best metric to estimate PE effort?
Effort/HTER seem to lack “cognitive load”
Time varies too much across post-editors
Keystrokes seems to capture PE strategies, but do notcorrelate well with PE effort
Eye-tracking data can be useful, but not always feasible
Estimating post-editing effort 35 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
What is the best metric to estimate PE effort?
Effort/HTER seem to lack “cognitive load”
Time varies too much across post-editors
Keystrokes seems to capture PE strategies, but do notcorrelate well with PE effort
Eye-tracking data can be useful, but not always feasible
Estimating post-editing effort 35 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
What is the best metric to estimate PE effort?
Effort/HTER seem to lack “cognitive load”
Time varies too much across post-editors
Keystrokes seems to capture PE strategies, but do notcorrelate well with PE effort
Eye-tracking data can be useful, but not always feasible
Estimating post-editing effort 35 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
What is the best metric to estimate PE effort?
Effort/HTER seem to lack “cognitive load”
Time varies too much across post-editors
Keystrokes seems to capture PE strategies, but do notcorrelate well with PE effort
Eye-tracking data can be useful, but not always feasible
Estimating post-editing effort 35 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
How to use estimated PE effort scores?
Should (supposedly) bad quality translations be filteredout or shown to translators (different scores/colourcodes as in TMs)?
Wasting time to read scores and translations vs wasting“gisting” information
How to define a threshold on the estimated translationquality to decide what should be filtered out?
Translator dependentTask dependent (SDL)
Do translators prefer detailed estimates (sub-sentencelevel) or an overall estimate for the complete sentence?
Too much information vs hard-to-interpret scores
Estimating post-editing effort 36 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
How to use estimated PE effort scores?
Should (supposedly) bad quality translations be filteredout or shown to translators (different scores/colourcodes as in TMs)?
Wasting time to read scores and translations vs wasting“gisting” information
How to define a threshold on the estimated translationquality to decide what should be filtered out?
Translator dependentTask dependent (SDL)
Do translators prefer detailed estimates (sub-sentencelevel) or an overall estimate for the complete sentence?
Too much information vs hard-to-interpret scores
Estimating post-editing effort 36 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
How to use estimated PE effort scores?
Should (supposedly) bad quality translations be filteredout or shown to translators (different scores/colourcodes as in TMs)?
Wasting time to read scores and translations vs wasting“gisting” information
How to define a threshold on the estimated translationquality to decide what should be filtered out?
Translator dependentTask dependent (SDL)
Do translators prefer detailed estimates (sub-sentencelevel) or an overall estimate for the complete sentence?
Too much information vs hard-to-interpret scores
Estimating post-editing effort 36 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Outline
1 Overview
2 Quality Estimation
3 Shared Task
4 Open issues
5 Conclusions
Estimating post-editing effort 37 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Conclusions
It is possible to estimate at least certain aspects of PEeffort
PE effort estimates can be used in real applications
Ranking translations: filter out bad quality translationsSelecting translations from multiple MT systems
A number of open issues to be investigated...
My vision
Sub-sentence level QE (error detection), highlightingerrors but also given an overall estimate for the sentence
Estimating post-editing effort 38 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Conclusions
It is possible to estimate at least certain aspects of PEeffort
PE effort estimates can be used in real applications
Ranking translations: filter out bad quality translationsSelecting translations from multiple MT systems
A number of open issues to be investigated...
My vision
Sub-sentence level QE (error detection), highlightingerrors but also given an overall estimate for the sentence
Estimating post-editing effort 38 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Conclusions
It is possible to estimate at least certain aspects of PEeffort
PE effort estimates can be used in real applications
Ranking translations: filter out bad quality translationsSelecting translations from multiple MT systems
A number of open issues to be investigated...
My vision
Sub-sentence level QE (error detection), highlightingerrors but also given an overall estimate for the sentence
Estimating post-editing effort 38 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Conclusions
It is possible to estimate at least certain aspects of PEeffort
PE effort estimates can be used in real applications
Ranking translations: filter out bad quality translationsSelecting translations from multiple MT systems
A number of open issues to be investigated...
My vision
Sub-sentence level QE (error detection), highlightingerrors but also given an overall estimate for the sentence
Estimating post-editing effort 38 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Journal of MT - Special issue
15-06-12 - 1st CFP
15-08-12 - 2nd CFP
15-09-12 - submission deadline
15-10-12 - reviews due
End of December 2012 - camera-ready due (tentative)
WMT-12 QE Shared Task
All feature sets available
Estimating post-editing effort 39 / 40
Overview Quality Estimation Shared Task Open issues Conclusions
Estimating post-editing effort
State-of-the-art systems and open issues
Lucia Specia
University of [email protected]
17 August 2012
Estimating post-editing effort 40 / 40