Goodness: A Method for Measuring Machine Translation
Confidence
Nguyen Bach Fei Huang and Yaser Al-Onaizan
Carnegie Mellon University IBM T.J. Watson Research Center
1
you totally different from zaid amr , and not to deprive yourself in a
basement of imitation and assimilation .
أنت مختلف تمامأ عن زيد وعمرو فال تحشر نفسك في سرداب
التقليد والمحاكاة والذوبان
MT output
Source
you totally different from zaid amr , and not to
deprive yourself in a basement of imitation and
assimilation .
We predict
and visualize
you are quite different from zaid and amr , so do not cram yourself in
the tunnel of simulation , imitation and assimilation . Human
correction
2
Why does it matter?
• Current MT systems don’t have efficient ways to predict machine translation confidence.
• Applications – Post-editing
– Commercial practice for end-users
– Down-stream processing: QA, IE
– MT engine: reranking
3
Outline
• Introduction
• Predicting
– Review
– Our models
– Feature sets
• Experiments
• N-best list reranking
• Visualizing
4
5
Confidence Measure
Back Translation (Gao et al. 2006, Bach et al. 2007)
Word Posterior Probability
(Ueffing&Ney, CL 2007)
Predicting BLEU (Soricut&Echihabi,
ACL 2010)
Binary Classification (Blazt et al., JHU 2003; Quirk, LREC
2004; Xiong, ACL 2010)
Features:
- Source & target LM
- WPP
- IBM Model 1
- Target POS
Our work belongs to this category
Can we do a better job?
• Can we use a rich feature set such as dependency structures, source-side information, and alignment context to improve error prediction performance?
• Can we predict translation error types?
• Do confidence measures help the MT system to select a better translation?
• How confidence score can be presented to improve end-user perception?
6
Confidence Measure Models
• Confidence estimation ≈ sequential labelling task
– Word Sequence = MT output
• Two classifiers
– Binary: Good/Bad
– Multi classes: Insertion/Substitution/Shift/Good.
• Supervised training
• Word -> Sentence confidence
7
Word-level Model
• A feature-rich classifier for a given word
• Discriminatively trained by MIRA
8
where f is its feature vector, y is the label, wy is
its feature weight vector
Sentence-level Model
• Given a sentence S, use word-level models to obtain labels for each words
• Use forward and backward values to compute marginal probability of Good label
• Normalize
9
Type 1: Source & Target Dependency Structures
wydyf an hdhh alamlyt ayda tshyr aly adm qdrt almtaddt aljnsyt alqwat albhryt
He adds that this process also refers to the inability of the multinational naval forces
null
10
Type 1: Source & Target Dependency Structures
wydyf an hdhh alamlyt ayda tshyr aly adm qdrt almtaddt aljnsyt alqwat albhryt
He adds that this process also refers to the inability of the multinational naval forces
Source Parent = an
Target Parent = that
Are parent aligned: Yes
11
Child-Parent Agreement
Type 1: Source & Target Dependency Structures
wydyf an hdhh alamlyt ayda tshyr aly adm qdrt almtaddt aljnsyt alqwat albhryt
He adds that this process also refers to the inability of the multinational naval forces
Children Agreement: 2
12
Type 2: Source POS and Phrases
wydyf an hdhh alamlyt ayda tshyr aly adm qdrt almtaddt aljnsyt alqwat albhryt
He adds that this process also refers to the inability of the multinational naval forces
VBP IN DT DTNN RB VBP IN NN NN DTJJ DTJJ DTNNS DTJJ
null
13
Type 2: Source POS & Phrases
He adds that this process also refers to the inability of the multinational naval forces
VBP IN DT DTNN RB VBP IN NN NN DTJJ DTJJ DTNNS DTJJ
wydyf an hdhh alamlyt ayda tshyr aly adm qdrt almtaddt aljnsyt alqwat albhryt
MT output
Source POS
Source
14
Type 2: Source POS and Phrases
wydyf an hdhh alamlyt ayda tshyr aly adm qdrt almtaddt aljnsyt alqwat albhryt
1 if source-POS-sequence = “DT DTNN”
f125(target-word = “process”) =
0 otherwise
He adds that this process also refers to the inability of the multinational naval forces
VBP IN DT DTNN RB VBP IN NN NN DTJJ DTJJ DTNNS DTJJ
MT output
Source POS
Source
15
Type 2: Source POS and Phrases
VBP IN DT DTNN RB VBP IN NN NN DTJJ DTJJ DTNNS DTJJ
He adds that this process also refers to the inability of the multinational naval forces
wydyf an hdhh alamlyt ayda tshyr aly adm qdrt almtaddt aljnsyt alqwat albhryt
MT output
Source POS
Source
16
Type 3: Alignment Context
wydyf an hdhh alamlyt ayda tshyr aly adm qdrt almtaddt aljnsyt alqwat albhryt
He adds that this process also refers to the inability of the multinational naval forces
null
PRP VBZ IN DT NN RB VBZ TO DT NN IN DT JJ JJ NNS
VBP IN DT DTNN RB VBP IN NN NN DTJJ DTJJ DTNNS DTJJ
MT output
Source POS
Source
Target POS
17
Type 3: Alignment Context
PRP VBZ IN DT NN RB VBZ TO DT NN IN DT JJ JJ NNS
VBP IN DT DTNN RB VBP IN NN NN DTJJ DTJJ DTNNS DTJJ
wydyf an hdhh alamlyt ayda tshyr aly adm qdrt almtaddt aljnsyt alqwat albhryt
He adds that this process also refers to the inability of the multinational naval forces MT output
Source POS
Source
Target POS
18
Type 3: Alignment Context
PRP VBZ IN DT NN RB VBZ TO DT NN IN DT JJ JJ NNS
VBP IN DT DTNN RB VBP IN NN NN DTJJ DTJJ DTNNS DTJJ
wydyf an hdhh alamlyt ayda tshyr aly adm qdrt almtaddt aljnsyt alqwat albhryt
He adds that this process also refers to the inability of the multinational naval forces MT output
Source POS
Source
Target POS
19
Type 3: Alignment Context
PRP VBZ IN DT NN RB VBZ TO DT NN IN DT JJ JJ NNS
VBP IN DT DTNN RB VBP IN NN NN DTJJ DTJJ DTNNS DTJJ
wydyf an hdhh alamlyt ayda tshyr aly adm qdrt almtaddt aljnsyt alqwat albhryt
He adds that this process also refers to the inability of the multinational naval forces MT output
Source POS
Source
Target POS
20
Type 3: Alignment Context
PRP VBZ IN DT NN RB VBZ TO DT NN IN DT JJ JJ NNS
VBP IN DT DTNN RB VBP IN NN NN DTJJ DTJJ DTNNS DTJJ
wydyf an hdhh alamlyt ayda tshyr aly adm qdrt almtaddt aljnsyt alqwat albhryt
He adds that this process also refers to the inability of the multinational naval forces MT output
Source POS
Source
Target POS
21
Type 3: Alignment Context
PRP VBZ IN DT NN RB VBZ TO DT NN IN DT JJ JJ NNS
VBP IN DT DTNN RB VBP IN NN NN DTJJ DTJJ DTNNS DTJJ
wydyf an hdhh alamlyt ayda tshyr aly adm qdrt almtaddt aljnsyt alqwat albhryt
He adds that this process also refers to the inability of the multinational naval forces MT output
Source POS
Source
Target POS
22
Recap
• Source & Target Dependency
• Source POS and Phrases
• Alignment Context
23
Experiment Setup
you totally different from zaid amr , and not to deprive yourself in a basement of imitation
and assimilation .
أنت مختلف تمامأ عن زيد وعمرو فال تحشر نفسك في سرداب التقليد والمحاكاة والذوبان
MT output
Source
Human
correction
you are quite different from zaid and amr , so do not cram yourself in the tunnel of
simulation , imitation and assimilation .
24
Experiment Setup
• Training data – Human correction Arabic-English system: 72K sentences with 2.2M words;
mix newswire and weblog
• Dev and Unseen test set – Dev: 2,707 sentences, 80K words – Unseen: 2,707 sentences, 79K words
• WPP: implement WPP algorithm in (Ueffing&Ney, 2007)
• WPP+target-POS: similar feature set as in (Xiong, ACL 2010)
• Learner: MIRA, 100 iterations, hyper-param = 5, cut-off = 1
• Evaluation metric: F-score
25
How does the proposed method compare with previous work?
26
Performance of binary classifiers
27
59.4 59.3
69.3 68.7
69.6 69.1
72.1 71.5
72.4 72 72.6 72.2
58
60
62
64
66
68
70
72
74
dev test
F-sc
ore
Test sets
Predicting Good/Bad words
All-Good
WPP (Ueffing&Ney,CL 2007)
WPP+ target POS (Xiong,ACL 2010)
our features
WPP+ our features
WPP+ target POS + ourfeatures
Performance of 4-class classifiers
- Improve from WPP (Ueffing & Ney, CL 2007) and WPP+targetPOS (Xiong et
al., ACL 2010)
- Our features alone already outperformed previous work
28
59.4 59.3
64.4
63.7
64.4 63.9
66.2 65.6
66.6
65.9
66.8
66.1
58
59
60
61
62
63
64
65
66
67
68
dev test
F-sc
ore
Test sets
Predicting Good/Insertion/Substitution/Shift words
All-Good
WPP (Ueffing&Ney,CL 2007)
WPP+ target POS (Xiong,ACL 2010)
our features
WPP+ our features
WPP+ target POS + ourfeatures
What are the contributions of individual feature sets?
29
69.3
68.7
72.1
71.6 71.4
70.9
69.9
69.5
67
68
69
70
71
72
73
dev test
F-sc
ore
Test sets
Contributions of features for predicting G/B words
WPP (Ueffing&Ney, CL 2007) WPP + source features
WPP+ alignment features WPP+ dependency features
64.4
63.7
66.2
65.7 65.7
65.3
64.9
64.3
62
62.5
63
63.5
64
64.5
65
65.5
66
66.5
dev test
F-sc
ore
Test sets
Contributions of features for predicting G/I/S/Sh
WPP (Ueffing&Ney, CL 2007) WPP + source features
WPP+ alignment features WPP+ dependency features
Where are improvements coming from?
30
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 10 20 30 40 50 60 70 80 90 100
Go
od
ne
ss
HTER
Good Bad Linear fit
Goodness vs. HTER
Pearson correlation = 0.6
31
Outline
• Introduction
• Predicting
– Review
– Our models
– Feature sets
• Experiments
• N-best list reranking
• Visualizing
32
N-best list reranking
• Reranking n-best list by goodness
33
Dev Test TER BLEU TER BLEU
Baseline 49.9 31 50.2 30.6
2-best 49.5 31.4 49.9 30.8 5-best 49.2 31.4 49.6 30.8 50-best 49.1 30.9 49.4 30.5 100-best 49 30.9 49.3 30.5
N-best list reranking
• Reranking n-best list by goodness
34
Dev Test TER BLEU TER BLEU
Baseline 49.9 31 50.2 30.6
2-best 49.5 31.4 49.9 30.8 5-best 49.2 31.4 49.6 30.8 50-best 49.1 30.9 49.4 30.5 100-best 49 30.9 49.3 30.5
N-best list reranking
• Reranking n-best list by goodness
35
Dev Test TER BLEU TER BLEU
Baseline 49.9 31 50.2 30.6
2-best 49.5 31.4 49.9 30.8 5-best 49.2 31.4 49.6 30.8 50-best 49.1 30.9 49.4 30.5 100-best 49 30.9 49.3 30.5
Outline
• Introduction
• Predicting
– Review
– Our models
– Feature sets
• Experiments
• N-best list reranking
• Visualizing
36
Designing Objectives
• Minimize post editing efforts in word-level and sentence-level
• Simple and Intuitive
– Inspired by Wordle ©IBM
37
Designing Objectives
• Font size
–Big: bad – Small: good
– Medium: decent
• Color – Red: bad – Black: good – Orange: decent
• Threshold – Good word > 0.8; Bad word < 0.45 – Good sentence > 0.7; Bad sentence < 0.5
38
39
Bad translations: editing these sentences first
40
the poll also showed that most of the participants in the developing countries are ready
to introduce qualitative changes in the pattern of their lives for the sake of reducing the
effects of climate change.
واظهر االستطالع ايضا ان معظم المشاركين في الدول النامية مستعدون الدخال تغييرات نوعية على نمط حياتهم في
. سبيل خفض تأثيرات التغير المناخي
MT output
Source
the poll also showed that most of the participants in the developing countries are ready
to introduce qualitative changes in the pattern of their lives for the sake of
reducing the effects of climate change.
We predict
and visualize
the survey also showed that most of the participants in developing countries are ready
to introduce changes to the quality of their lifestyle in order to reduce the effects of
climate change .
Human
correction
41
perhaps racism of heads of human diseases and the most difficult treatment, survives,
only used rope,
لعل العنصرية من اوسأ امراض البشرية ومن اصعبها عالجا وال ينجو منها اال من استعان بحبل من هللا
MT output
Source
perhaps racism of heads of human diseases and the most difficult treatment,
survives, only used rope,
We predict
and visualize
racism is but the worst human disease and the hardest to recover ; just he who is
faithful to allah is able to recover . Human
correction
42
Contributions
• Source & Target Dependency Structures
• Source POS and Phrases
• Word Alignment Context
• Experimental results – Our features outperformed features described in the previous work
• Improve MT quality through n-best list reranking
• Visualizing confidence in word and sentence level
43
Thanks!
44