Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Machine Translation12: (Non-neural) Statistical Machine Translation
Rico Sennrich
University of Edinburgh
R. Sennrich MT – 2018 – 12 1 / 27
Today’s Lecture
So far, main focus of lecture was on:neural machine translationresearch since ≈2013
today, we look at (non-neural) Statistical Machine Translation,and research since ≈ 1990
R. Sennrich MT – 2018 – 12 1 / 27
MT – 2018 – 12
1 Statistical Machine TranslationBasicsPhrase-based SMTHierarchical SMTSyntax-based SMT
R. Sennrich MT – 2018 – 12 2 / 27
Refresher: A probabilistic model of translation
Suppose that we have:a source sentence S of length m (x1, . . . , xm)a target sentence T of length n (y1, . . . , yn)
We can express translation as a probabilistic model:
T ∗ = argmaxT
P (T |S)
= argmaxT
P (S|T )P (T ) Bayes’ theorem
We can model translation via two models:
language model to estimate P (T )translation model to estimate P (S|T )
Without continuous space representations, how to estimate P (S|T )?→ break it up into smaller units
R. Sennrich MT – 2018 – 12 3 / 27
Word Alignment
chicken-and-egg problemlet’s break up P (S|T ) into small units (words):
we can estimate an alignment given a translation modelexpectation step
we can estimate translation model given a an alignment(using relative frequencies)maximization step
what can we do if we have neither?
solution: Expectation Maximization Algorithm
initialize model
iterate between estimating alignment and translation model
simplest model based on lexical translation; more complex modelsconsider position and fertility
R. Sennrich MT – 2018 – 12 4 / 27
Word Alignment: IBM Models [Brown et al., 1993]
EM Algorithm
... la maison ... la maison blue ... la fleur ...
... the house ... the blue house ... the flower ...
• Initial step: all alignments equally likely
• Model learns that, e.g., la is often aligned with the
Chapter 4: Word-Based Models 14
R. Sennrich MT – 2018 – 12 5 / 27
Word Alignment: IBM Models [Brown et al., 1993]
EM Algorithm
... la maison ... la maison blue ... la fleur ...
... the house ... the blue house ... the flower ...
• After one iteration
• Alignments, e.g., between la and the are more likely
Chapter 4: Word-Based Models 15
R. Sennrich MT – 2018 – 12 5 / 27
Word Alignment: IBM Models [Brown et al., 1993]
EM Algorithm
... la maison ... la maison bleu ... la fleur ...
... the house ... the blue house ... the flower ...
• After another iteration
• It becomes apparent that alignments, e.g., between fleur and flower are morelikely (pigeon hole principle)
Chapter 4: Word-Based Models 16
R. Sennrich MT – 2018 – 12 5 / 27
Word Alignment: IBM Models [Brown et al., 1993]
EM Algorithm
... la maison ... la maison bleu ... la fleur ...
... the house ... the blue house ... the flower ...
• Convergence
• Inherent hidden structure revealed by EM
Chapter 4: Word-Based Models 17
R. Sennrich MT – 2018 – 12 5 / 27
Word Alignment: IBM Models [Brown et al., 1993]
IBM Model 1 and EM
• Probabilitiesp(the|la) = 0.7 p(house|la) = 0.05
p(the|maison) = 0.1 p(house|maison) = 0.8
• Alignments
la •maison•
the•house•
la •maison•
the•house•
@@@
la •maison•
the•house•,
,, la •
maison•the•house•
@@@,
,,
p(e, a|f) = 0.56 p(e, a|f) = 0.035 p(e, a|f) = 0.08 p(e, a|f) = 0.005
p(a|e, f) = 0.824 p(a|e, f) = 0.052 p(a|e, f) = 0.118 p(a|e, f) = 0.007
• Countsc(the|la) = 0.824 + 0.052 c(house|la) = 0.052 + 0.007
c(the|maison) = 0.118 + 0.007 c(house|maison) = 0.824 + 0.118
Chapter 4: Word-Based Models 21
R. Sennrich MT – 2018 – 12 5 / 27
Linear Models
T ∗ = argmaxT
P (S|T )P (T ) Bayes’ theorem
T ∗ ≈ argmaxT
M∑
m=1
λmhm(S, T ) [Och, 2003]
linear combination of arbitrary features
Minimum Error Rate Training to optimize feature weights
big trend in SMT research: engineering new/better features
R. Sennrich MT – 2018 – 12 6 / 27
Word-based SMT
core ideacombine word-based translation model and n-gram language model tocompute score of translation
consequences+ models are easy to compute
- word translations are assumed to be independent of each other:only LM takes into account context
- poor at modelling long-distance phenomena:n-gram context is limited
R. Sennrich MT – 2018 – 12 7 / 27
MT – 2018 – 12
1 Statistical Machine TranslationBasicsPhrase-based SMTHierarchical SMTSyntax-based SMT
R. Sennrich MT – 2018 – 12 8 / 27
Phrase-based SMT
core ideaBasic translation unit in translation model is not word, but word sequence(phrase)
consequences+ much better memorization of frequent phrase translations
- large (and noisy) phrase table
- large search space; requires sophisticated pruning
- still poor at modelling long-distance phenomena
Mr Steiger gone to Cologne
Herr Steiger nach Köln gefahren
unfortunately ,
leider ist
has
R. Sennrich MT – 2018 – 12 9 / 27
Phrase Extraction
extraction rules based on word-aligned sentence pairphrase pair must be compatible with alignment......but unaligned words are okphrases are contiguous sequences
Extracting Phrase Translation Rules
Ishall
bepassing
some
onto
you
comments
Ich
werd
eIh
nen
die
ents
prec
hend
enAn
mer
kung
enau
shän
dige
n
shall be = werde
Syntax-based Statistical Machine Translation 36
R. Sennrich MT – 2018 – 12 10 / 27
Phrase Extraction
extraction rules based on word-aligned sentence pairphrase pair must be compatible with alignment......but unaligned words are okphrases are contiguous sequences
Extracting Phrase Translation Rules
Ishall
bepassing
some
onto
you
commentsIc
hwe
rde
Ihne
ndi
een
tspr
eche
nden
Anm
erku
ngen
aush
ändi
gen
some comments = die entsprechenden Anmerkungen
Syntax-based Statistical Machine Translation 37
R. Sennrich MT – 2018 – 12 10 / 27
Phrase Extraction
extraction rules based on word-aligned sentence pair
phrase pair must be compatible with alignment...
...but unaligned words are ok
phrases are contiguous sequencesExtracting Phrase Translation Rules
Ishall
bepassing
some
onto
you
comments
Ich
werd
eIh
nen
die
ents
prec
hend
enAn
mer
kung
enau
shän
dige
nwerde Ihnen die entsprechenden Anmerkungen aushändigen = shall be passing on to you some comments
Syntax-based Statistical Machine Translation 38R. Sennrich MT – 2018 – 12 10 / 27
Common Features in Phrase-based SMT
phrase translation probabilities (in both directions)
word translation probabilities (in both directions)
language model
reordering model
constant penalty for each phrase used
sparse features with learned cost for some (classes of) phrase pairs
multiple models of each type possible
R. Sennrich MT – 2018 – 12 11 / 27
Decoding
Translation Options
he
er geht ja nicht nach hause
it, it
, he
isare
goesgo
yesis
, of course
notdo not
does notis not
afterto
according toin
househome
chamberat home
notis not
does notdo not
homeunder housereturn home
do not
it ishe will be
it goeshe goes
isare
is after alldoes
tofollowingnot after
not tonot
is notare notis not a
• The machine translation decoder does not know the right answer– picking the right translation options– arranging them in the right order
→ Search problem solved by heuristic beam search
Chapter 6: Decoding 9
R. Sennrich MT – 2018 – 12 12 / 27
DecodingDecoding: Hypothesis Expansion
er geht ja nicht nach hause
are
pick any translation option, create new hypothesis
Chapter 6: Decoding 12
R. Sennrich MT – 2018 – 12 13 / 27
DecodingDecoding: Hypothesis Expansion
er geht ja nicht nach hause
are
it
he
create hypotheses for all other translation options
Chapter 6: Decoding 13
R. Sennrich MT – 2018 – 12 13 / 27
Decoding
Decoding: Hypothesis Expansion
er geht ja nicht nach hause
are
it
hegoes
does not
yes
go
to
home
home
also create hypotheses from created partial hypothesis
Chapter 6: Decoding 14
R. Sennrich MT – 2018 – 12 13 / 27
Decoding
Decoding: Find Best Path
er geht ja nicht nach hause
are
it
hegoes
does not
yes
go
to
home
home
backtrack from highest scoring complete hypothesis
Chapter 6: Decoding 15
R. Sennrich MT – 2018 – 12 13 / 27
Decoding
large search space (exponential number of hypotheses)reduction of search space:
recombination of identical hypothesespruning of hypotheses
efficient decoding is a lot more complex in SMT than in neural MT
R. Sennrich MT – 2018 – 12 14 / 27
MT – 2018 – 12
1 Statistical Machine TranslationBasicsPhrase-based SMTHierarchical SMTSyntax-based SMT
R. Sennrich MT – 2018 – 12 15 / 27
Hierarchical SMT
core ideause context-free grammars (CFG) rules as basic translation units→ allows gaps
consequences+ better modeling of some reordering patterns
Herr Steigerleider Kölnist nach gefahren
Mr Steigerunfortunately Cologne , has gone to
- overgeneralisation is still possible
Herr Steiger nichtleider Kölnist nach gefahren
Herr Steiger does notunfortunately Cologne, has gone to
R. Sennrich MT – 2018 – 12 16 / 27
Hierarchical Phrase Extraction
Extracting Hierarchical Phrase Translation Rules
Ishall
bepassing
some
onto
you
comments
Ich
werd
eIh
nen
die
ents
prec
hend
enAn
mer
kung
enau
shän
dige
n
werde X aushändigen= shall be passing on X
subtractingsubphrase
Syntax-based Statistical Machine Translation 39
R. Sennrich MT – 2018 – 12 17 / 27
Decoding
Decoding via (S)CFG derivationSCFG Derivations1 | s1
• Derivation starts with pair of linked s symbols.
Syntax-based Statistical Machine Translation 12
R. Sennrich MT – 2018 – 12 18 / 27
Decoding
Decoding via (S)CFG derivationSCFG Derivations1 | s1
⇒ s2 x3 | s2 x3
• s→ s1 x2 | s1 x2 (glue rule)
Syntax-based Statistical Machine Translation 13
R. Sennrich MT – 2018 – 12 18 / 27
Decoding
Decoding via (S)CFG derivationSCFG Derivations1 | s1
⇒ s2 x3 | s2 x3
⇒ s2 x4 und x5 | s2 x4 and x5
• x→ x1 und x2 | x1 and x2
Syntax-based Statistical Machine Translation 14
R. Sennrich MT – 2018 – 12 18 / 27
Decoding
Decoding via (S)CFG derivationSCFG Derivations1 | s1
⇒ s2 x3 | s2 x3
⇒ s2 x4 und x5 | s2 x4 and x5
⇒ s2 unzutreffend und x5 | s2 unfounded and x5
• x→ unzutreffend | unfounded
Syntax-based Statistical Machine Translation 15
R. Sennrich MT – 2018 – 12 18 / 27
Decoding
Decoding via (S)CFG derivationSCFG Derivations1 | s1
⇒ s2 x3 | s2 x3
⇒ s2 x4 und x5 | s2 x4 and x5
⇒ s2 unzutreffend und x5 | s2 unfounded and x5
⇒ s2 unzutreffend und irrefuhrend | s2 unfounded and misleading
• x→ irrefuhrend | misleading
Syntax-based Statistical Machine Translation 16
R. Sennrich MT – 2018 – 12 18 / 27
Decoding
Decoding via (S)CFG derivationSCFG Derivations1 | s1
⇒ s2 x3 | s2 x3
⇒ s2 x4 und x5 | s2 x4 and x5
⇒ s2 unzutreffend und x5 | s2 unfounded and x5
⇒ s2 unzutreffend und irrefuhrend | s2 unfounded and misleading
⇒ x6 unzutreffend und irrefuhrend | x6 unfounded and misleading
• s→ x1 | x1 (glue rule)
Syntax-based Statistical Machine Translation 17
R. Sennrich MT – 2018 – 12 18 / 27
Decoding
Decoding via (S)CFG derivationSCFG Derivations1 | s1
⇒ s2 x3 | s2 x3
⇒ s2 x4 und x5 | s2 x4 and x5
⇒ s2 unzutreffend und x5 | s2 unfounded and x5
⇒ s2 unzutreffend und irrefuhrend | s2 unfounded and misleading
⇒ x6 unzutreffend und irrefuhrend | x6 unfounded and misleading
⇒ deshalb x7 die x8 unzutreffend und irrefuhrend
| therefore the x8 x7 unfounded and misleading
• x→ deshalb x1 die x2 | therefore the x2 x1 (non-terminal reordering)
Syntax-based Statistical Machine Translation 18
R. Sennrich MT – 2018 – 12 18 / 27
Decoding
Decoding via (S)CFG derivationSCFG Derivations1 | s1
⇒ s2 x3 | s2 x3
⇒ s2 x4 und x5 | s2 x4 and x5
⇒ s2 unzutreffend und x5 | s2 unfounded and x5
⇒ s2 unzutreffend und irrefuhrend | s2 unfounded and misleading
⇒ x6 unzutreffend und irrefuhrend | x6 unfounded and misleading
⇒ deshalb x7 die x8 unzutreffend und irrefuhrend
| therefore the x8 x7 unfounded and misleading
⇒ deshalb sei die x8 unzutreffend und irrefuhrend
| therefore the x8 was unfounded and misleading
• x→ sei | was
Syntax-based Statistical Machine Translation 19
R. Sennrich MT – 2018 – 12 18 / 27
Decoding
Decoding via (S)CFG derivationSCFG Derivations1 | s1
⇒ s2 x3 | s2 x3
⇒ s2 x4 und x5 | s2 x4 and x5
⇒ s2 unzutreffend und x5 | s2 unfounded and x5
⇒ s2 unzutreffend und irrefuhrend | s2 unfounded and misleading
⇒ x6 unzutreffend und irrefuhrend | x6 unfounded and misleading
⇒ deshalb x7 die x8 unzutreffend und irrefuhrend
| therefore the x8 x7 unfounded and misleading
⇒ deshalb sei die x8 unzutreffend und irrefuhrend
| therefore the x8 was unfounded and misleading
⇒ deshalb sei die Werbung unzutreffend und irrefuhrend
| therefore the advertisement was unfounded and misleading
• x→Werbung | advertisement
Syntax-based Statistical Machine Translation 20
R. Sennrich MT – 2018 – 12 18 / 27
MT – 2018 – 12
1 Statistical Machine TranslationBasicsPhrase-based SMTHierarchical SMTSyntax-based SMT
R. Sennrich MT – 2018 – 12 19 / 27
Syntax-based SMT
core ideause syntax on source, target, or both
rule extraction constrained by syntax
potentially use syntactic structures for scoring (syntax-based LMs)
consequencesdepend on exact flavor of syntax used; here: string-to-tree SMT
+ less overgeneralisation
- sparsity in grammar requires relaxation of extraction constraints
- label matching constraints increase search space during decoding
Herr Steiger
NN
S
NNNP
VAFIN
leiderADV APPR
Köln
NE
ist nach gefahren
VVPP
PPVP
Herr SteigerNNP
S
NNP
NP
,
unfortunatelyADV VBZ
Cologne
NP
, has goneVBN
VPVP
NNP
toTO
PP
R. Sennrich MT – 2018 – 12 20 / 27
Syntax-based Phrase Extraction
Learning Syntactic Translation Rules
PRP IMD shall
VB beVBG passing
DT some
RP onTO to
PRP you
NNS comments
Ich
PPE
R
werd
e V
AFIN
Ihne
n P
PER
die
ART
ents
pr.
ADJ
Anm
. N
N
aush
änd.
VV
FIN
NP
PPVP
VP
VP
S
NP
VPVP
S
pro
Ihnen
= pp
prp
you
to
to
Syntax-based Statistical Machine Translation 43
R. Sennrich MT – 2018 – 12 21 / 27
Decoding
Example
Input jemand mußte Josef K. verleumdet habensomeone must Josef K. slandered have
Grammar
⇒ r1: np → Josef K. | Josef K. 0.90⇒ r2: vbn → verleumdet | slandered 0.40⇒ r3: vbn → verleumdet | defamed 0.20⇒ r4: vp → mußte x1 x2 haben | must have vbn2 np1 0.10⇒ r5: s → jemand x1 | someone vp1 0.60⇒ r6: s → jemand mußte x1 x2 haben | someone must have vbn2 np1 0.80⇒ r7: s → jemand mußte x1 x2 haben | np1 must have been vbn1 by someone 0.05
Derivation 1 jemand
X
someone
S
Source Target
verleumdet
X
Josef
habenX
X
mußte
slandered
have VBNmust
VP
K.
NP
Josef K.
Syntax-based Statistical Machine Translation 72
R. Sennrich MT – 2018 – 12 22 / 27
Why Syntax-based SMT?
many variants (syntax on source/target/both...)syntactic constraints for rule extraction and application prevent someover-generalizationssyntactic structure can be exploited by feature functions:
unification constraints [Williams, 2009]
“eine” →
cat ART
infl
case nomdeclension mixed
agr
[gender fnum sg
]
“Welt” →
cat NN
infl
case nom
agr
[gender fnum sg
]
syntax-based neural language model [Sennrich, 2015]
PSYNTAX(T,D) ≈n∏
i=1
Pl(i)× Pw(i)
Pl(i) =P (li |wa(i), la(i))
Pw(i) =P (wi |li, wa(i), la(i))
Laura hat einen kleinen GartenLaura has a small garden
root obja
attrsubj
det
R. Sennrich MT – 2018 – 12 23 / 27
Edinburgh’s* WMT Results over the Years
2013 2014 2015 2016 20170.0
10.0
20.0
30.0
20.3 20.9 20.8 21.519.4 20.2
22.0 22.1
18.9
24.726.0
BLE
U(n
ewst
est2
013
EN→
DE
)
phrase-based SMTsyntax-based SMTneural MT
*NMT 2015 from U. Montréal: https://sites.google.com/site/acl16nmt/
R. Sennrich MT – 2018 – 12 24 / 27
What Phrase-based SMT (Still) Does Better than NMT
better performance in low-data conditions [Koehn and Knowles, 2017]
clear stopping criterion at decoding time:when all source words have been covered by a phrase pair
good ecosystem of methods for specialized requirements (e.g.inclusion of terminology)ability to inspect translation decisions and models:
alignment between source and outputadd/remove phrase table entries
R. Sennrich MT – 2018 – 12 25 / 27
Software
Moses SMT Toolkitdeveloped in Edinburgh
many features and extensive documentation:http://www.statmt.org/moses
documentation of baseline phrase-based systems:http://www.statmt.org/moses/?n=moses.baseline
http://lotus.kuee.kyoto-u.ac.jp/WAT/WAT2017/baseline/
baselineSystemPhrase_kj.html
config files for SOTA (in 2014/5) syntax-based systems:https://github.com/rsennrich/wmt2014-scripts
R. Sennrich MT – 2018 – 12 26 / 27
Further Reading
text booksPhilipp Koehn (2009). Statistical Machine Translation.
Philip Williams; Rico Sennrich; Matt Post; Philipp Koehn (2016).Syntax-based Statistical Machine Translation.
online resourcessyntax-based tutorial by Philip Williams and Philipp Koehn(slide credit to them for some slides shown here):http://homepages.inf.ed.ac.uk/s0898777/syntax-tutorial.pdf
slides on word- and phrase-based SMT by Philipp Koehn:http://www.statmt.org/book/slides/04-word-based-models.pdf
http://www.statmt.org/book/slides/05-phrase-based-models.pdf
http://www.statmt.org/book/slides/06-decoding.pdf
R. Sennrich MT – 2018 – 12 27 / 27
Bibliography I
Brown, P. F., Della Pietra, V. J., Della Pietra, S. A., and Mercer, R. L. (1993).The Mathematics of Statistical Machine Translation: Parameter Estimation.Computational Linguistics, 19(2):263–311.
Koehn, P. and Knowles, R. (2017).Six Challenges for Neural Machine Translation.In Proceedings of the First Workshop on Neural Machine Translation, pages 28–39, Vancouver. Association for ComputationalLinguistics.
Och, F. J. (2003).Minimum Error Rate Training in Statistical Machine Translation.In ACL ’03: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, pages 160–167, Sapporo,Japan. Association for Computational Linguistics.
Sennrich, R. (2015).Modelling and Optimizing on Syntactic N-Grams for Statistical Machine Translation.Transactions of the Association for Computational Linguistics, 3:169–182.
Williams, P. (2009).Towards Statistical Machine Translation with Unification Grammars.Master’s thesis, University of Edinburgh, Edinburgh, UK.
R. Sennrich MT – 2018 – 12 28 / 27