Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Neural Machine Transla/on
Thang Luong Lecture @ CS224N
(Thanks to Chris Manning, Abigail See, and
Russell Stewart for comments and discussions.)
Sixi roasted husband Meat Muscle Stupid Bean Sprouts
Sixi roasted husband Meat Muscle Stupid Bean Sprouts
Let’s backtrack
(From Chris Manning)
• MT: learn to translate from parallel corpora.
Let’s backtrack – Approaches
ProbabilisJc “DicJonary” (Brown et al., 1993)
Phrase Table (Koehn et al., 2003, Och & Ney, 2004)
It’s 2015! “Sentence” Table?
Can we build a “sentence” table?
• |V|N combinaJons in principle. – English: average sentence length 14 (~ 70 chars). – 1014*70*2 bytes = 14M gigabytes.
WHAT THE BRITISH SAY WHAT THE BRITISH MEAN q With the greatest respect q You are an idiot q That’s not bad q That’s good q Very interesJng q That is clearly nonsense q Quite good q A bit disappoinJng
Neural Machine Transla/on to the rescue!
• Store a sentence table implicitly.
• Simple and coherent.
But we need to understand Recurrent Neural Networks first!
Outline
• Recurrent Neural Networks (RNNs)
• NMT basics (Sutskever et al., 2014)
• AdenJon mechanism (Bahdanau et al., 2015)
Recurrent Neural Networks (RNNs)
(Picture adapted from Andrej Karparthy)
RNN
RNN – Input Layer
(Picture adapted from Andrej Karparthy)
RNN
RNN – Hidden Layer
(Picture adapted from Andrej Karparthy)
ht-‐1 ht
xt
RNN – Hidden Layer
(Picture adapted from Andrej Karparthy)
ht-‐1 ht
xt
RNNs to represent sequences!
Outline
• Recurrent Neural Networks (RNNs)
• NMT basics (Sutskever et al., 2014) – Encoder-‐Decoder. – Training vs. TesJng. – BackpropagaJon. – More about RNNs.
• AdenJon mechanism (Bahdanau et al., 2015)
Neural Machine Transla/on (NMT)
• Model P(target | source) directly.
am a student _ Je suis étudiant
Je suis étudiant _
I
Neural Machine Transla/on (NMT)
• RNNs trained end-‐to-‐end (Sutskever et al., 2014). am a student _ Je suis étudiant
Je suis étudiant _
I
Neural Machine Transla/on (NMT)
• RNNs trained end-‐to-‐end (Sutskever et al., 2014). am a student _ Je suis étudiant
Je suis étudiant _
I
Neural Machine Transla/on (NMT)
• RNNs trained end-‐to-‐end (Sutskever et al., 2014). am a student _ Je suis étudiant
Je suis étudiant _
I
Neural Machine Transla/on (NMT)
• RNNs trained end-‐to-‐end (Sutskever et al., 2014). am a student _ Je suis étudiant
Je suis étudiant _
I
Neural Machine Transla/on (NMT)
• RNNs trained end-‐to-‐end (Sutskever et al., 2014). am a student _ Je suis étudiant
Je suis étudiant _
I
Neural Machine Transla/on (NMT)
• RNNs trained end-‐to-‐end (Sutskever et al., 2014). am a student _ Je suis étudiant
Je suis étudiant _
I
Neural Machine Transla/on (NMT)
• RNNs trained end-‐to-‐end (Sutskever et al., 2014). am a student _ Je suis étudiant
Je suis étudiant _
I
Neural Machine Transla/on (NMT)
• RNNs trained end-‐to-‐end (Sutskever et al., 2014).
• Encoder-‐decoder approach.
am a student _ Je suis étudiant
Je suis étudiant _
I
Encoder Decoder
Word Embeddings
am a student _ Je suis étudiant
Je suis étudiant _
I
• Randomly iniJalized, one for each language. – Learnable parameters.
Source embeddings
Target embeddings
am a student _ Je suis étudiant
Je suis étudiant _
I
Recurrent Connec/ons IniJal states
• Ojen set to 0.
am a student _ Je suis étudiant
Je suis étudiant _
I
Recurrent Connec/ons
• Different across layers and encoder / decoder.
Encoder 1st layer
am a student _ Je suis étudiant
Je suis étudiant _
I
Recurrent Connec/ons
• Different across layers and encoder / decoder.
Encoder 2nd layer
am a student _ Je suis étudiant
Je suis étudiant _
I
Recurrent Connec/ons
• Different across layers and encoder / decoder.
Decoder 1st layer
am a student _ Je suis étudiant
Je suis étudiant _
I
Recurrent Connec/ons
• Different across layers and encoder / decoder.
Decoder 2nd layer
Outline
• Recurrent Neural Networks (RNNs)
• NMT basics (Sutskever et al., 2014) – Encoder-‐Decoder. – Training vs. TesJng. – BackpropagaJon. – More about RNNs.
• AdenJon mechanism (Bahdanau et al., 2015)
Training vs. Tes/ng
• Training – Correct translaJons are available.
• Tes:ng – Only source sentences are given.
am a student _ Je suis étudiant
Je suis étudiant _
I
am a student _ Je suis étudiant
Je suis étudiant _
I
am a student _ Je suis étudiant
Je suis étudiant _
I
Training – So6max
• Hidden states ↦ scores.
Scores
am a student _ Je suis étudiant
Je suis étudiant _
I
am a student _ Je suis étudiant
Je suis étudiant _
I
= |V|
Sojmax parameters
am a student _ Je suis étudiant
Je suis étudiant _
I
Training – So6max
• Scores ↦ probabiliJes.
am a student _ Je suis étudiant
Je suis étudiant _
I
am a student _ Je suis étudiant
Je suis étudiant _
I
=
Scores
so;max func:on
Probs
P(suis | Je, source) |V|
Sojmax parameters
Training Loss
• Maximize P(target | source): – Decompose into individual word predicJons.
am a student _ Je suis étudiantI
Je suis étudiant _
Training Loss
am a student _ Je suis étudiant
-log P(Je)
I
-log P(suis)
• Sum of all individual losses
Training Loss
am a student _ Je suis étudiant
-log P(Je)
I
-log P(suis)
• Sum of all individual losses
Training Loss
• Sum of all individual losses
am a student _ Je suis étudiantI
-log P(étudiant)
Training Loss
• Sum of all individual losses
am a student _ Je suis étudiantI
-log P(_)
Tes/ng
• Feed the most likely word
Tes/ng
• Feed the most likely word
Tes/ng
• Feed the most likely word
Tes/ng
• Feed the most likely word
Tes/ng
NMT beam-‐search decoders are much simpler!
Outline
• Recurrent Neural Networks (RNNs)
• NMT basics (Sutskever et al., 2014) – Encoder-‐Decoder. – Training vs. TesJng. – BackpropagaJon. – More about RNNs.
• AdenJon mechanism (Bahdanau et al., 2015)
Backpropaga/on Through Time
am a student _ Je suis étudiantI
-log P(_)
Init to 0
am a student _ Je suis étudiantI
-log P(étudiant)
Backpropaga/on Through Time
am a student _ Je suis étudiantI
-log P(étudiant)
Backpropaga/on Through Time
am a student _ Je suis étudiantI
-log P(suis)
Backpropaga/on Through Time
am a student _ Je suis étudiantI
-log P(suis)
Backpropaga/on Through Time
am a student _ Je suis étudiantI
Backpropaga/on Through Time
RNN gradients are accumulated.
Outline
• Recurrent Neural Networks (RNNs)
• NMT basics (Sutskever et al., 2014) – Encoder-‐Decoder. – Training vs. TesJng. – BackpropagaJon. – More about RNNs.
• AdenJon mechanism (Bahdanau et al., 2015)
Recurrent types – vanilla RNN
RNN
Vanishing gradient problem!
Vanishing gradients
(Pascanu et al., 2013)
Chain Rule
Bound Rules
Bound Largest singular value
Vanishing gradients
Chain Rule
Sufficient Cond
(Pascanu et al., 2013)
Chain Rule
Bound Rules
Vanishing gradients
Chain Rule
Sufficient Cond
(Pascanu et al., 2013)
Bound Rules
Recurrent types – LSTM
• Long-‐Short Term Memory (LSTM) – (Hochreiter & Schmidhuber, 1997)
• LSTM cells are addiJvely updated – Make backprop through Jme easier.
C’mon, it’s been around for 20 years!
LSTM
LSTM cells
Building LSTM
• A naïve version.
LSTM
Nice gradients!
Building LSTM
• Add input gates: control input signal.
Input gates
LSTM
Building LSTM
• Add forget gates: control memory.
Forget gates
LSTM
Building LSTM
Output gates
• Add output gates: extract informaJon. • (Zaremba et al., 2014).
LSTM
Why LSTM works?
• The addiJve operaJon is the key! • BackpropaJon path through the cell is effecJve.
LSTMt LSTMt-‐1
+
Why LSTM works?
• The addiJve operaJon is the key! • BackpropaJon path through the cell is effecJve.
LSTMt LSTMt-‐1
+
Forget gates are important!
• (Graves, 2013): revived LSTM. – Direct connecJons between cells and gates.
• Gated Recurrent Unit (GRU) – (Cho et al., 2014a) – No cells, same addiJve idea.
• LSTM vs. GRU: mixed results (Chung et al., 2015).
Other RNN units
English – French WMT Results
Systems BLEU SOTA in WMT’14 (Durrani et al., 2014) 37.0
Standard MT + neural components Schwenk (2014) – neural language model 33.3 Cho et al. (2014) – phrase table neural features 34.5
English – French WMT Results
Systems BLEU SOTA in WMT’14 (Durrani et al., 2014) 37.0
Standard MT + neural components Schwenk (2014) – neural language model 33.3 Cho et al. (2014) – phrase table neural features 34.5
NMT Sutskever et al. (2014) – ensemble LSTMs 34.8
English – French WMT Results
Systems BLEU SOTA in WMT’14 (Durrani et al., 2014) 37.0
Standard MT + neural components Schwenk (2014) – neural language model 33.3 Cho et al. (2014) – phrase table neural features 34.5
NMT Sutskever et al. (2014) – ensemble LSTMs 34.8 Luong et al. (2015a) – ensemble LSTMs + rare word 37.5
New SOTA!
Effects of Transla/ng Rare Words
25
30
35
40
BLEU
Sentences ordered by average frequency rank
Durrani et al. (37.0) Sutskever et al. (34.8) Luong et al. (37.5)
Summary Deep RNNs
(Sutskever et al., 2014)
am a student _ Je suis étudiant
Je suis étudiant _
I
am a student
_ Je suis étudiant
Je suis étudiant _
I
BidirecJonal RNNs (Bahdanau et al., 2015)
• Generalize well.
• Small memory.
• Simple decoder.
Outline
• Recurrent Neural Networks (RNNs)
• NMT basics (Sutskever et al., 2014)
• AdenJon mechanism (Bahdanau et al., 2015)
Sentence Length Problem
With adenJon Without adenJon
(Bahdanau et al., 2015)
Why?
• A fixed-‐dimensional source vector.
• Problem: Markovian process.
am a student _ Je suis étudiant
Je suis étudiant _
I
A^en/on Mechanism
• SoluJon: random access memory – Retrieve as needed. – cf. Neural Turing Machine (Graves et al., 2014).
am a student _ Je suis étudiant
Je suis étudiant _
I
Pool of source states
Alignments as a by-‐product
• Recent innovaJon in deep learning: – Control problem (Mnih et al., 14) – Speech recogniJon (Chorowski et al., 15) – Image capJon generaJon (Xu et al., 15)
(Bahdanau et al., 2015)
Simplified AdenJon (Bahdanau et al., 2015)
Deep LSTM (Sutskever et al., 2014)
am a student _ Je suis étudiant
Je suis étudiant _
I
+
am a student _ Je
suis
I
Attention Layer
Context vector
?
What’s next?
am a student _ Je suis étudiant
Je suis étudiant _
I
• Compare target and source hidden states.
A^en/on Mechanism – Scoring
am a student _ Je
suis
I
Attention Layer
Context vector
?
• Compare target and source hidden states.
A^en/on Mechanism – Scoring
am a student _ Je
suis
I
Attention Layer
Context vector
?
3
• Compare target and source hidden states.
A^en/on Mechanism – Scoring
am a student _ Je
suis
I
Attention Layer
Context vector
?
5 3
• Compare target and source hidden states.
A^en/on Mechanism – Scoring
am a student _ Je
suis
I
Attention Layer
Context vector
?
1 3 5
• Compare target and source hidden states.
A^en/on Mechanism – Scoring
am a student _ Je
suis
I
Attention Layer
Context vector
?
1 3 5 1
• Convert into alignment weights.
A^en/on Mechanism – Normaliza<on
am a student _ Je
suis
I
Attention Layer
Context vector
?
0.1 0.3 0.5 0.1
am a student _ Je
suis
I
Context vector
• Build context vector: weighted average.
A^en/on Mechanism – Context vector
?
am a student _ Je
suis
I
Context vector
• Compute the next hidden state.
A^en/on Mechanism – Hidden state
am a student _ Je
suis
I
Context vector
• Predict the next word.
A^en/on Mechanism – Predict
A^en/on Mechanism – Score Func<ons
(Bahdanau et al., 2015)
(Luong et al., 2015b)
A^en/on Mechanism – Score Func<ons
• More focused adenJon (Luong et al., 2015b) – Focus on a subset of words each Jme.
(Bahdanau et al., 2015)
(Luong et al., 2015b)
English-‐German WMT Results
New SOTA!
Systems BLEU SOTA in WMT’14 (Buck et al., 2014) 20.7
NMT Jean et al., (2015) – GRUs + adenJon 21.6
English-‐German WMT Results
Even beder!
Systems BLEU SOTA in WMT’14 (Buck et al., 2014) 20.7
NMT Jean et al., (2015) – GRUs + adenJon 21.6 Luong et al. (2015b) – LSTMs + improved adenJon 23.0 (+2.3)
Translate Long Sentences
10 20 30 40 50 60 7010
15
20
25
Sent Lengths
BLEU
����
�
ours, no attn (BLEU 13.9)ours, local−p attn (BLEU 20.9)ours, best system (BLEU 23.0)WMT’14 best (BLEU 20.7)Jeans et al., 2015 (BLEU 21.6)
No AdenJon
AdenJon
(Luong et al., 2015b)
Sample English-‐German transla/ons
• Translate names correctly.
src Orlando Bloom and Miranda Kerr sJll love each other
ref Orlando Bloom und Miranda Kerr lieben sich noch immer
best Orlando Bloom und Miranda Kerr lieben einander noch immer .
base Orlando Bloom und Lucas Miranda lieben einander noch immer .
(Luong et al., 2015b)
Sample English-‐German transla/ons
• Translate a doubly-‐negated phrase correctly • Fail to translate “passenger experience”.
src ʹ′ʹ′ We ʹ′ re pleased the FAA recognizes that an enjoyable passenger experience is not incompa<ble with safety and security , ʹ′ʹ′ said Roger Dow , CEO of the U.S. Travel AssociaJon .
ref “ Wir freuen uns , dass die FAA erkennt , dass ein angenehmes Passagiererlebnis nicht im Wider-‐ spruch zur Sicherheit steht ” , sagte Roger Dow , CEO der U.S. Travel AssociaJon .
best ʹ′ʹ′ Wir freuen uns , dass die FAA anerkennt , dass ein angenehmes ist nicht mit Sicherheit und Sicherheit unvereinbar ist ʹ′ʹ′ , sagte Roger Dow , CEO der US -‐ die .
base ʹ′ʹ′ Wir freuen uns u ber die <unk> , dass ein <unk> <unk> mit Sicherheit nicht vereinbar ist mit Sicherheit und Sicherheit ʹ′ʹ′ , sagte Roger Cameron , CEO der US -‐ <unk> .
(Luong et al., 2015b)
Sample English-‐German transla/ons
• Translate a doubly-‐negated phrase correctly • Fail to translate “passenger experience”.
src ʹ′ʹ′ We ʹ′ re pleased the FAA recognizes that an enjoyable passenger experience is not incompa<ble with safety and security , ʹ′ʹ′ said Roger Dow , CEO of the U.S. Travel AssociaJon .
ref “ Wir freuen uns , dass die FAA erkennt , dass ein angenehmes Passagiererlebnis nicht im Wider-‐ spruch zur Sicherheit steht ” , sagte Roger Dow , CEO der U.S. Travel AssociaJon .
best ʹ′ʹ′ Wir freuen uns , dass die FAA anerkennt , dass ein angenehmes ist nicht mit Sicherheit und Sicherheit unvereinbar ist ʹ′ʹ′ , sagte Roger Dow , CEO der US -‐ die .
base ʹ′ʹ′ Wir freuen uns u ber die <unk> , dass ein <unk> <unk> mit Sicherheit nicht vereinbar ist mit Sicherheit und Sicherheit ʹ′ʹ′ , sagte Roger Cameron , CEO der US -‐ <unk> .
(Luong et al., 2015b)
Sample English-‐German transla/ons
• Fail to translate “passenger experience”.
src ʹ′ʹ′ We ʹ′ re pleased the FAA recognizes that an enjoyable passenger experience is not incompa<ble with safety and security , ʹ′ʹ′ said Roger Dow , CEO of the U.S. Travel AssociaJon .
ref “ Wir freuen uns , dass die FAA erkennt , dass ein angenehmes Passagiererlebnis nicht im Wider-‐ spruch zur Sicherheit steht ” , sagte Roger Dow , CEO der U.S. Travel AssociaJon .
best ʹ′ʹ′ Wir freuen uns , dass die FAA anerkennt , dass ein angenehmes ist nicht mit Sicherheit und Sicherheit unvereinbar ist ʹ′ʹ′ , sagte Roger Dow , CEO der US -‐ die .
base ʹ′ʹ′ Wir freuen uns u ber die <unk> , dass ein <unk> <unk> mit Sicherheit nicht vereinbar ist mit Sicherheit und Sicherheit ʹ′ʹ′ , sagte Roger Cameron , CEO der US -‐ <unk> .
(Luong et al., 2015b)
src
Wegen der von Berlin und der Europa ischen Zentralbank verha ngten strengen SparpoliJk in Verbindung mit der Zwangsjacke , in die die jeweilige naJonale Wirtschaj durch das Festhal-‐ ten an der gemeinsamen Wa hrung geno Jgt wird , sind viele Menschen der Ansicht , das Projekt Europa sei zu weit gegangen
ref The austerity imposed by Berlin and the European Central Bank , coupled with the straitjacket imposed on naJonal economies through adherence to the common currency , has led many people to think Project Europe has gone too far .
best
Because of the strict austerity measures imposed by Berlin and the European Central Bank in connec/on with the straitjacket in which the respecJve naJonal economy is forced to adhere to the common currency , many people believe that the European project has gone too far .
base
Because of the pressure imposed by the European Central Bank and the Federal Central Bank with the strict austerity imposed on the naJonal economy in the face of the single currency , many people believe that the European project has gone too far .
Sample German-‐English transla/ons
• Translate well long sentences. (Luong et al., 2015b)
Summary
Simplified AdenJon (Bahdanau et al., 2015)
Deep LSTM (Sutskever et al., 2014)
am a student _ Je suis étudiant
Je suis étudiant _
I
+
Thank you!
References (1) • [Bahdanau et al., 2015] Neural TranslaJon by Jointly Learning to Align and
Translate. hdp://arxiv.org/pdf/1409.0473.pdf • [Cho et al., 2014a] Learning Phrase RepresentaJons using RNN Encoder–Decoder
for StaJsJcal Machine TranslaJon. hdp://aclweb.org/anthology/D/D14/D14-‐1179.pdf
• [Cho et al., 2014b] On the ProperJes of Neural Machine TranslaJon: Encoder–Decoder Approaches. hdp://www.aclweb.org/anthology/W14-‐4012
• [Chorowski et al., 2015] AdenJon-‐Based Models for Speech RecogniJon. hdp://arxiv.org/pdf/1506.07503v1.pdf
• [Chung et al., 2015] Empirical EvaluaJon of Gated Recurrent Neural Networks on Sequence Modeling. hdp://arxiv.org/pdf/1412.3555.pdf
• [Graves, 2013] GeneraJng Sequences With Recurrent Neural Networks. hdp://arxiv.org/pdf/1308.0850v5.pdf
• [Graves, 2014] Neural Turing Machine. hdp://arxiv.org/pdf/1410.5401v2.pdf. • [Hochreiter & Schmidhuber, 1997] Long Short-‐term Memory.
hdp://deeplearning.cs.cmu.edu/pdfs/Hochreiter97_lstm.pdf
References (2) • [Kalchbrenner & Blunsom, 2013] Recurrent ConJnuous TranslaJon Models.
hdp://nal.co/papers/KalchbrennerBlunsom_EMNLP13 • [Luong et al., 2015a] Addressing the Rare Word Problem in Neural Machine
TranslaJon. hdp://www.aclweb.org/anthology/P15-‐1002 • [Luong et al., 2015b] EffecJve Approaches to AdenJon-‐based Neural Machine
TranslaJon. hdps://aclweb.org/anthology/D/D15/D15-‐1166.pdf • [Mnih et al., 2014] Recurrent Models of Visual AdenJon.
hdp://papers.nips.cc/paper/5542-‐recurrent-‐models-‐of-‐visual-‐adenJon.pdf • [Pascanu et al., 2013] On the difficulty of training Recurrent Neural Networks.
hdp://arxiv.org/pdf/1211.5063v2.pdf • [Xu et al., 2015] Show, Adend and Tell: Neural Image CapJon GeneraJon with
Visual AdenJon. hdp://jmlr.org/proceedings/papers/v37/xuc15.pdf • [Sutskever et al., 2014] Sequence to Sequence Learning with Neural Networks.
hdp://papers.nips.cc/paper/5346-‐sequence-‐to-‐sequence-‐learning-‐with-‐neural-‐networks.pdf
• [Zaremba et al., 2015] Recurrent Neural Network RegularizaJon. hdp://arxiv.org/pdf/1409.2329.pdf
Encoder-‐decoder Summary Encoder Decoder
(Sutskever et al., 2014) (Luong et al., 2015a) (Luong et al., 2015b)
Deep LSTM Deep LSTM
(Cho et al., 2014a) (Bahdanau et al., 2015)
(Jean et al., 2015)
(BidirecJonal) GRU GRU
(Kalchbrenner & Blunsom, 2013) CNN (Inverse CNN)
RNN
(Cho et al., 2014b) Gated Recursive CNN GRU
Important LSTM components?
• (Jozefowicz et al., 2015): forget gate bias of 1.
• (Greff et al., 2015): forget gates & output acts.
LSTMt LSTMt-‐1 +
LSTM Backpropaga/on
• Deltas sent back from the top layers.
LSTMt LSTMt-‐1
+
LSTM Backpropaga/on – Context
LSTMt LSTMt-‐1
+
LSTM Backpropaga/on – Context
• Complete context vector gradient.
LSTMt LSTMt-‐1
+
LSTM Backpropaga/on – Context
LSTMt LSTMt-‐1
+
LSTM Backpropaga/on – Context
• First, use to compute gradients for .
LSTMt LSTMt-‐1
+
LSTM Backpropaga/on – Context
• Then, update .
LSTMt LSTMt-‐1
+
LSTM Backprop
LSTMt LSTMt-‐1
+
LSTM Backprop
• Compute gradients for , , .
LSTMt LSTMt-‐1
+
LSTM Backprop
• Add gradients from the loss / upper layers.
LSTMt LSTMt-‐1
+
Summary
• LSTM backpropagaJon is nasty.
• But it will be much easier if: – Know your matrix calculus! – Pay adenJon to and .
Other A^en/on Func/ons
• Content-‐based:
• LocaJon-‐based: – (Graves, 2013): hand-‐wriJng synthesis model.
• Hybrid: – (Chorowski et al., 2015) for speech recogniJon.
Local A^en/on (Luong et al., 2015b)
• More focused adenJon. – PotenJally useful for longer text sequences.
aligned posiJons?
Predict aligned posi/ons
Real value in [0, S] Source sentence
How do we learn to the posiJon parameters?
[0,1]
3.5 4 4.5 5 5.5 6 6.5 7 7.50
0.2
0.4
0.6
0.8
1
s
Alignment Weights
5.5 2
0.4
0.1
0.4
0.1
Truncated Gaussian
3.5 4 4.5 5 5.5 6 6.5 7 7.50
0.2
0.4
0.6
0.8
1
s
• Favor points close to the center.
Scaled Alignment Weights
3.5 4 4.5 5 5.5 6 6.5 7 7.50
0.2
0.4
0.6
0.8
1
s
New Peak
DifferenJable almost everywhere!