Recurrent neural networks for statistical machine translation
(Neural Machine Translation)
Outline
● Machine translation● Statistical machine translation● Traditional methods● Why Neural Machine Translation (NMT)● General sequence learning with RNNs● Encoder decoder approach● Attention-based NMT● Using monolingual corpora in NMT● Character-based NMT
Machine Translation
● Translate a source sentence F to a target sentence E
● Define set of rules (Rule based translation)
● We don’t even know the set of rules underlying a single language
● Statistical approaches attempts to automatically extract rules from a large corpus of text, either implicitly or explicitly
Statistical Machine Translation
Corpora
e=(economic, growth, has, slowed, down, in, recent, years, .)
f=(La, croissance, économique, s'est, ralenti, ces, dernières, années, .)
Translation function
Traditional methods
Word-Based Model
● Words may not be the best candidates for the smallest units
● One word may translate to multiple words and vice versa● No word in target translation corresponding to a source
word and vice versa
the house is very small
das Hous ist klitzeklein
Phrase-Based Model
● Translate small word sequences at a time● Foreign input is segmented in phrases● Each phrase is translated into English● Phrases are reordered
of course john has fun with the game
naturerlich hat john spass am spiel
Phrase-Based Model (cont.)
● Fundamental equation of SMT
– Translation model
– Language model
● Decomposition of the translation model
– Phrase translation probability φ
– Reordering probability d
ebest=argmaxe p (e∣f )=argmax e p( f 1I∣e1
I) pLM (e)
p( f 1I∣e1
I )=∏i=1
I
φ(f i∣ei)d (starti−end i−1−1)
p( f 1I∣e1
I)
pLM (e )
Phrase Translation Table
● Main knowledge source: table with phrase translations and their probabilities
● Example: phrase translations for natuerlich
Translation Probability φ(e|f)
of course 0.5
naturally 0.3
of course , 0.15
, of course , 0.05
Weighted Model
● Some sub-models may be more important than others
● Add weights
ebest=argmaxe∏i=1
I
φ( f i∣e i)d (starti−end i−1−1)∏i=1
|e|
pLM (e i∣e1…e i−1)
ebest=argmaxe∏i=1
I
φ( f i∣e i)λφd (start i−end i−1−1)
λd∏i=1
|e|
pLM (e i∣e1…e i−1)λLM
λϕ , λd , λLM
Weighted Model as Log-Linear Model
p(e , a∣f )=exp {λφ∑i=1
I
log φ( f i∣e i)
+λd∑i=1
I
log d (ai−bi−1−1)
+λLM∑i=1
I
log pLM (e i∣e1…e i−1) }
Log-Linear Model
● Such a weighted model is a log-linear model:
● Feature functions– number of feature function
– random variable
– feature function
– feature function
– feature function
● More features can be added
p(x )=exp∑i=1
n
λ ihi(x )
x=(e , f , start , end)n=3
h1=log φh2=log dh3=log pLM
Problems with traditional approaches
● Highly Memory Intensive● Initial alignment makes conditional independence
assumption● Word and Phrase translation models only count co-
occurrences of surface form – don’t take word similarity into account
● Highly non-trivial to decode phrase based translation– word alignments + lexical reordering model
– language model
– phrase translations
Evaluation
● BLEU (BiLingual Evaluation Understudy)
● N-gram overlap between machine translation output and reference translation
● Compute precision for n-grams of size 1 to 4
● Add brevity penalty (for too short translations)
BLEU=min (1,output−length
reference−length )(∏i=1
4
precisioni)14
Example
SYSTEM A: Israeli officials responsibility of airport safety
REFERENCE: Israeli officials are responsible for airport security
SYSTEM B: airport security Israeli officials are responsible
2-GRAM MATCH 1-GRAM MATCH
2-GRAM MATCH 4-GRAM MATCH
Metric System A System B
precision (1gram) 3/6 6/6
precision (2gram) 1/5 4/5
precision (3gram) 0/4 2/4
precision (4gram) 0/3 1/3
brevity penalty 6/7 6/7
BLEU 0% 52%
Multiple Reference Translations
● To account for variability, use multiple reference translations
– n-grams may match in any of the references
– closest reference length used● Example
Israeli officials responsibility of airport safety
Israeli officials are responsible for airport securityIsrael is in charge of the security at this airportThe security work for this airport is the responsibility of the Israel governmentIsraeli side was in charge of the security of this airport
REFERENCES::
SYSTEM::2-GRAM MATCH 2-GRAM MATCH 1-GRAM
General sequence learning using RNNs (Text analysis example)
Unfolding in time
Edge to next time step
W W
V
U
W
V
W W W
x t−1
o t−1
st−1
U
V
x t
o t
st
U
V
x t+1
o t+1
st+1
U
V
x
o
s
Weights are tied
How RNN works1
-of-
K c
od
ing
Co
nti
nu
ou
s-sp
ace
Wo
rd
Re
pre
sen
tati
on
W i
S i
hi
Re
curr
ent
Sta
te
e=(economic, growth, has, slowed, down, in, recent, years, .)
softmax
+ive
For example predict sentiment of the input document
-ive
Step 1: Word to one-hot vector1
-of-
K c
od
ing
Co
nti
nu
ou
s-sp
ace
Wo
rd
Re
pre
sen
tati
on
W i
S i
hi
e=(economic, growth, has, slowed, down, in, recent, years, .)
Re
curr
ent
Sta
te
(1)
Step 2: One-hot vector to continuous-space representation
1-o
f-K
co
din
g
Co
nti
nu
ou
s-sp
ace
Wo
rd
Re
pre
sen
tati
on
W i
S i
hi
Re
curr
ent
Sta
te
(2)
e=(economic, growth, has, slowed, down, in, recent, years, .)
s i=Ed× K x i
Projects the 1-of-K coded vector with a matrix E to d-dimensional (typically 100 – 500) continuous word representation
Step 3: Sequence summarization by RNN1
-of-
K c
od
ing
Co
nti
nu
ou
s-sp
ace
Wo
rd
Re
pre
sen
tati
on
W i
S i
hi
Re
curr
ent
Sta
te
(3)
e=(economic, growth, has, slowed, down, in, recent, years, .)
h i=φθ (hi−1 , si)
Sequence of continuous vectors s_i summarized by RNN
Architecture is flexible1
-of-
K c
od
ing
Co
nti
nu
ou
s-sp
ace
Wo
rd
Re
pre
sen
tati
on
W i
S i
hi
Re
curr
ent
Sta
te
e=(economic, growth, has, slowed, down, in, recent, years, .)
Summary sentence representation vectors
Sentence Representations from [Sutskever et al., 2014]. Similar sentences are close together
Step 4: Use summarization for prediction1
-of-
K c
od
ing
Co
nti
nu
ou
s-sp
ace
Wo
rd
Re
pre
sen
tati
on
W i
S i
hi
Re
curr
ent
Sta
te
e=(economic, growth, has, slowed, down, in, recent, years, .)
softmax
+ive -ive
Range of applications
image classification
fixed-sized input to fixed-sized output
Image: http://karpathy.github.io/2015/05/21/rnn-effectiveness/
image captioning
Sequence output
sentiment analysis
Sequence input
Machine Translation
Sequence input and sequence output
video description
Synced sequence input and output
Neural Machine Translation
Inputsentence
Outputsentence
Neuralnet
Inputsentence
Outputsentence
Neural net
Inputsentence
Outputsentence
SMT
Neural net
SMT
(Schwenk et al. 2006) (Devlin et al. 2014)
Different from
SMT+Reranking-by-NN SMT-NN
Neural Machine Translation
(Ñeco&Forcada, 1997)
(Kalchbrenner et al., 2013)
(Cho et al., 2014)
(Sutskever et al., 2014)
Encoder-Decoder Architecture for Machine Translation
Encoder-Decoder Approach
Inputsentence
Encoder
Fixed sizerepresentation
DecoderOutput
sentence
(Ñeco&Forcada, 1997)
(Kalchbrenner et al., 2013)
(Cho et al., 2014)
(Sutskever et al., 2014)
RNN Encoder-Decoder (Cho et al. 2014)
h1 h2 hL
s0 s1 sT−1
Encoder
Decoder
Representation
...
...
y1 yTy2
x1 xLx2
En
cod
erD
ecod
er
1-o
f-K
co
din
g
Co
nti
nu
ou
s-sp
ace
Wo
rd
Rep
rese
nta
tio
nR
ecu
rre
nt
Sta
teR
ecu
rren
tS
tate
Wo
rd
Pro
bab
ility
Wo
rd S
amp
le
W i
S i
z i
hi
Pi
U i
f=(La, croissance, économique, s'est, ralenti, ces, dernières, années, .)
Neural machine translation system
e=(economic, growth, has, slowed, down, in, recent, years, .)
The Encoder
Step 1: Word to one-hot vectorE
nco
der
1-o
f-K
co
din
g
Co
nti
nu
ou
s-sp
ace
Wo
rd
Re
pre
sen
tati
on
W i
S i
hi
Re
curr
ent
Sta
te
(1)
f=(La, croissance, économique, s'est, ralenti, ces, dernières, années, .)
Step 2: One-hot vector to continuous-space representation
En
cod
er
1-o
f-K
co
din
g
Co
nti
nu
ou
s-sp
ace
Wo
rd
Re
pre
sen
tati
on
W i
S i
hi
Re
curr
ent
Sta
te
(2)
s i=Ed× K x i
Projects the 1-of-K coded vector with a matrix E to d-dimensional (typically 100 – 500) continuous word representation
s_i updated to maximize the translation performance
f=(La, croissance, économique, s'est, ralenti, ces, dernières, années, .)
Step 3: Sequence summarization by RNNE
nco
der
1-o
f-K
co
din
g
Co
nti
nu
ou
s-sp
ace
Wo
rd
Re
pre
sen
tati
on
W i
S i
hi
Re
curr
ent
Sta
te
(3)
h i=φθ (hi−1 , si)
Sequence of continuous vectors s_i summarized by RNN
f=(La, croissance, économique, s'est, ralenti, ces, dernières, années, .)
Summary sentence representation vectors
Sentence Representations from [Sutskever et al., 2014]. Similar sentences are close together
The Decoder
Deco
der
Rec
urr
ent
Sta
teW
ord
P
rob
abili
tyW
ord
Sam
ple
z i
Pi
U i
Step 1: Compute internal hidden state of decoder
(1)
z i=φθ '(hT , ui−1 , zi−1)
e=(economic, growth, has, slowed, down, in, recent, years, .)
Deco
der
Rec
urr
ent
Sta
teW
ord
P
rob
abili
tyW
ord
Sam
ple
z i
Pi
U i
Step 2: Next word probability
(2)
e ( k)=wkT z i+bk
p(w i=k ∣w1,w2,… ,w i−1 , hT)=exp(e ( k))
∑ jexp(e ( j))
Softmax
Score and normalize target words
e=(economic, growth, has, slowed, down, in, recent, years, .)
Deco
der
Rec
urr
ent
Sta
teW
ord
P
rob
abili
tyW
ord
Sam
ple
z i
Pi
U i
Step 3: Sample next word
(3)
e=(economic, growth, has, slowed, down, in, recent, years, .)
Training
Training: Maximum Likelihood Estimation
● Given parallel corpus D of training examples
● Compute
● Log-likelihood of training corpus
● Maximize log-likelihood using stochastic gradient descent (SGD)
D={(x1, y1), (x2, y2) ,…,(xn , yn)}
log P( yn∣xn ,θ)
L(D ,θ)=1N∑n=1
N
logP ( yn∨xn ,θ) ,
Problem with simple encoder-decoder architectures
Soft Attention Mechanism for Neural Machine Translation
Bidirectional recurrent neural networks for encoding a source sentence
En
cod
er
1-o
f-K
co
din
g
Co
nti
nu
ou
s-sp
ace
Wo
rd
Re
pre
sen
tati
on
W i
S i
hi
Bid
irec
tio
nal
Rec
urr
ent
Sta
te
f=(La, croissance, économique, s'est, ralenti, ces, dernières, années, .)
Attention Mechanism
● To decide i-th target word, calculate relevance score between i-th target word and every source word
● Attention mechanism is implemented by a neural network
Deco
der
Rec
urr
ent
Sta
teW
ord
Sam
ple
z i
ui
Attention Mechanism (step 1)
hJ
Att
enti
on
M
ech
anis
m
a j
Attention Weight ∑ a j=1
An
no
tati
on
V
ecto
r
(1)e j=W a zi−1+U a h j
e=(economic, growth, has, slowed, down, in, recent, years, .)
Deco
der
Rec
urr
ent
Sta
teW
ord
Sam
ple
z i
ui
Attention Mechanism (step 2)
hJ
Att
enti
on
M
ech
anis
m
a j
Attention Weight ∑ a j=1
An
no
tati
on
V
ecto
r
(2) α j=exp (e j)
∑ j 'exp( e j ')
e=(economic, growth, has, slowed, down, in, recent, years, .)
Deco
der
Rec
urr
ent
Sta
teW
ord
Sam
ple
z i
ui
Attention Mechanism (step 3)
hJ
Att
enti
on
M
ech
anis
m
a j
Attention Weight ∑ a j=1
An
no
tati
on
V
ecto
r
c i=∑ j=1
Tα j h j
e=(economic, growth, has, slowed, down, in, recent, years, .)
Effective Approaches to Attention-based Neural Machine Translation (Minh-Thang Luong et al.2015)
Global attentional model
Global align weights
Context vector
Attention Layer
hs ht
~h t
y t
c t
a t
Local attentional model
Local weights
Context vector
Attention Layer
hs ht
~h t
y t
c t
a t
Aligned position
pt
Input-feeding approach
A B C D <eos> X Y Z
<eos>YX Z
Attention Layer
~h t
Results
WMT’14 English-German results
Results (cont)
WMT’15 English-German results
Word-Character Models
Another Character-based NMT
Character-based word embedding
Marta R. Costa-juss` a and Jos e A. R. Fonollosa
Yet another (fully) Character-Level NMT
Fully Character-Level Neural Machine Translation without Explicit Segmentation (Nov 2016)Jason Lee, Kyunghyun Cho, Thomas Hofmann
Using monolingual corpora in NMT(C. Gulcehre et al., 2015)
Integrating Language Model into the decoder
● Pre-train NMT model (on parallel corpora) and (RNNLM (Mikolov et al., 2011), on larger monolingual corpora) separately
● Shallow Fusion● At each time step:
– TM proposes a set of candidate words– TM score: “Candidate sentences” + TM scores of candidate words– Select top-K– rescore these hypotheses with the weighted sum of the scores by the
NMT and RNNLM– The score of the new word
log p( y t=k )=log pTM ( y t=k )+β log pLM ( y t=k )
Shallow fusion (diagram)
stTM
stLM
y t
c t
y≤(i )
β
“Language modelrescoring”
Shallow Fusion
“Candidate sentences” +Translation model scores
Deep fusion
● Deep Fusion– integrate the RNNLM and the decoder of the NMT by
concatenating their hidden states next to each other
– controller mechanism (gate)
p( y t∣y <t , x )∝exp( y tT(W o f o(s t−1
TM , st−1LM , y t−1 , c t)+bo))
gt=σ(vgT st
LM+bg)
Deep fusion (diagram)
s tTM
stLM
y t
c t
y≤(i )
β
“Language modelrescoring”
Shallow Fusion
“Candidate sentences” +Translation model scores
stTM
s tLM
y t
c t
Deep Fusion
+
gt
st+1TM
s t+1LM
y t+1
c t+1
+
gt+1
Results
SMS/CHAT CTS
Dev Test Dev Test
PB 15.5 14.73 21.94 21.68
+CSLM 16.02 15.25 23.05 22.79
HPB 15.33 14.71 21.45 21.43
+CSLM 15.93 15.8 22.61 22.17
NMT 17.32 17.36 23.4 23.59
Shallow 16.59 16.42 22.7 22.83
Deep 17.58 17.64 23.78 23.5
Results on the task of Zh-En
De-En Cs-En
Dev Test Dev Test
NMT 25.51 23.61 21.47 21.89
Shallow 25.53 23.69 21.95 22.18
Deep 25.88 24 22.49 22.36
Results for De-En and Cs-En translation tasks on WMT’15 dataset.
Results (cont)
Development setTest set
dev2010
tst2010 tst2011 tst2012 tst2013 Test 2014
Previous Best (Single) 15.33 17.14 18.62 18.88 18.88 -
Previous Best (Combination)
- 17.34 18.93 18.7 18.7 -
NMT 14.5 18.01 18.77 19.86 19.86 18.64
NMT+LM (Shallow) 14.44 17.99 18.8 19.87 19.87 18.66
NMT+LM (Deep) 15.69 19.34 20.23 21.34 21.34 20.56
Results on Tr-En
Future work: domain adaption of the language model may further improve translations.
Performance improvement from incorporating an external LM was highly dependent on the domain similarity between the monolingual corpus and the target task.
References● Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. “Neural machine translation by jointly learning
to align and translate.” arXiv preprint arXiv:1409.0473 (2014).● Cho, Kyunghyun et al. “Learning phrase representations using RNN encoder-decoder for statistical
machine translation.” arXiv preprint arXiv:1406.1078 (2014).● Cho, Kyunghyun, Aaron Courville, and Yoshua Bengio. “Describing Multimedia Content using Attention-
based Encoder–Decoder Networks.” arXiv preprint arXiv:1507.01053 (2015).● Graves, Alex, Greg Wayne, and Ivo Danihelka. “Neural Turing Machines.” arXiv preprint arXiv:1410.5401
(2014).● Gulcehre, Caglar et al. “On Using Monolingual Corpora in Neural Machine Translation.” arXiv preprint
arXiv:1503.03535 (2015).● Kalchbrenner, Nal, and Phil Blunsom. “Recurrent Continuous Translation Models.” EMNLP 2013: 1700-
1709.● Koehn, Philipp. Statistical machine translation. Cambridge University Press, 2009.● Pascanu, Razvan et al. “How to construct deep recurrent neural networks.” arXiv preprint
arXiv:1312.6026 (2013).● Schwenk, Holger. “Continuous space language models.” Computer Speech & Language 21.3 (2007):
492-518.● Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. “Sequence to sequence learning with neural networks.”
Advances in Neural Information Processing Systems 2014: 3104-3112.● http://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-with-gpus/● Minh Thang Luong, Hieu Pham, Christopher D. Manning“Effective Approaches to Attention-based Neural
Machine Translation.” arXiv:1508.04025 (2015).● Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing, “BLEU: a method for
automatic evaluation of machine translation”, Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, 2002