Neural’Machine’Transla/on’’...Neural’Machine’Transla/on’’ ThangLuong Lecture’@CS224N’ ’ (Thanks’to’Chris’Manning,’Abigail’See,’and’’ Russell’Stewartfor’comments

Neural Machine Transla/on

Thang Luong Lecture @ CS224N

(Thanks to Chris Manning, Abigail See, and

Russell Stewart for comments and discussions.)

Sixi roasted husband Meat Muscle Stupid Bean Sprouts

Sixi roasted husband Meat Muscle Stupid Bean Sprouts

Let’s backtrack

(From Chris Manning)

•  MT: learn to translate from parallel corpora.

Let’s backtrack – Approaches

ProbabilisJc “DicJonary” (Brown et al., 1993)

Phrase Table (Koehn et al., 2003, Och & Ney, 2004)

It’s 2015! “Sentence” Table?

Can we build a “sentence” table?

•  |V|N combinaJons in principle. – English: average sentence length 14 (~ 70 chars). – 1014*70*2 bytes = 14M gigabytes.

WHAT THE BRITISH SAY WHAT THE BRITISH MEAN q With the greatest respect q You are an idiot q That’s not bad q That’s good q Very interesJng q That is clearly nonsense q Quite good q A bit disappoinJng

Neural Machine Transla/on to the rescue!

•  Store a sentence table implicitly.

•  Simple and coherent.

But we need to understand Recurrent Neural Networks first!

Outline

•  Recurrent Neural Networks (RNNs)

•  NMT basics (Sutskever et al., 2014)

•  AdenJon mechanism (Bahdanau et al., 2015)

Recurrent Neural Networks (RNNs)

(Picture adapted from Andrej Karparthy)

RNN

RNN – Input Layer


RNN

RNN – Hidden Layer


ht-‐1 ht

xt

RNN – Hidden Layer


ht-‐1 ht

xt

RNNs to represent sequences!

Outline


•  NMT basics (Sutskever et al., 2014) – Encoder-‐Decoder. – Training vs. TesJng. – BackpropagaJon. – More about RNNs.


Neural Machine Transla/on (NMT)

•  Model P(target | source) directly.

am a student _ Je suis étudiant

Je suis étudiant _

I


•  RNNs trained end-‐to-‐end (Sutskever et al., 2014). am a student _ Je suis étudiant

Je suis étudiant _

I



Je suis étudiant _

I



Je suis étudiant _

I



Je suis étudiant _

I



Je suis étudiant _

I



Je suis étudiant _

I



Je suis étudiant _

I


•  RNNs trained end-‐to-‐end (Sutskever et al., 2014).

•  Encoder-‐decoder approach.


Je suis étudiant _

I

Encoder Decoder

Word Embeddings


Je suis étudiant _

I

•  Randomly iniJalized, one for each language. – Learnable parameters.

Source embeddings

Target embeddings


Je suis étudiant _

I

Recurrent Connec/ons IniJal states

•  Ojen set to 0.


Je suis étudiant _

I

Recurrent Connec/ons

•  Different across layers and encoder / decoder.

Encoder 1st layer


Je suis étudiant _

I



Encoder 2nd layer


Je suis étudiant _

I



Decoder 1st layer


Je suis étudiant _

I



Decoder 2nd layer

Outline




Training vs. Tes/ng

•  Training – Correct translaJons are available.

•  Tes:ng – Only source sentences are given.


Je suis étudiant _

I


Je suis étudiant _

I


Je suis étudiant _

I

Training – So6max

•  Hidden states ↦ scores.

Scores


Je suis étudiant _

I


Je suis étudiant _

I

= |V|

Sojmax parameters


Je suis étudiant _

I

Training – So6max

•  Scores ↦ probabiliJes.


Je suis étudiant _

I


Je suis étudiant _

I

=

Scores

so;max func:on

Probs

P(suis | Je, source) |V|

Sojmax parameters

Training Loss

•  Maximize P(target | source): – Decompose into individual word predicJons.

am a student _ Je suis étudiantI

Je suis étudiant _

Training Loss


-log P(Je)

I

-log P(suis)

•  Sum of all individual losses

Training Loss


-log P(Je)

I

-log P(suis)


Training Loss



-log P(étudiant)

Training Loss



-log P(_)

Tes/ng

•  Feed the most likely word

Tes/ng


Tes/ng


Tes/ng


Tes/ng

NMT beam-‐search decoders are much simpler!

Outline




Backpropaga/on Through Time


-log P(_)

Init to 0


-log P(étudiant)



-log P(étudiant)



-log P(suis)



-log P(suis)




RNN gradients are accumulated.

Outline




Recurrent types – vanilla RNN

RNN

Vanishing gradient problem!

Vanishing gradients

(Pascanu et al., 2013)

Chain Rule

Bound Rules

Bound Largest singular value

Vanishing gradients

Chain Rule

Sufficient Cond


Chain Rule

Bound Rules

Vanishing gradients

Chain Rule

Sufficient Cond


Bound Rules

Recurrent types – LSTM

•  Long-‐Short Term Memory (LSTM) –  (Hochreiter & Schmidhuber, 1997)

•  LSTM cells are addiJvely updated – Make backprop through Jme easier.

C’mon, it’s been around for 20 years!

LSTM

LSTM cells

Building LSTM

•  A naïve version.

LSTM

Nice gradients!

Building LSTM

•  Add input gates: control input signal.

Input gates

LSTM

Building LSTM

•  Add forget gates: control memory.

Forget gates

LSTM

Building LSTM

Output gates

•  Add output gates: extract informaJon. •  (Zaremba et al., 2014).

LSTM

Why LSTM works?

•  The addiJve operaJon is the key! •  BackpropaJon path through the cell is effecJve.

LSTMt LSTMt-‐1

+

Why LSTM works?

•  The addiJve operaJon is the key! •  BackpropaJon path through the cell is effecJve.

LSTMt LSTMt-‐1

+

Forget gates are important!

•  (Graves, 2013): revived LSTM. – Direct connecJons between cells and gates.

•  Gated Recurrent Unit (GRU) – (Cho et al., 2014a) – No cells, same addiJve idea.

•  LSTM vs. GRU: mixed results (Chung et al., 2015).

Other RNN units

English – French WMT Results

Systems BLEU SOTA in WMT’14 (Durrani et al., 2014) 37.0

Standard MT + neural components Schwenk (2014) – neural language model 33.3 Cho et al. (2014) – phrase table neural features 34.5




NMT Sutskever et al. (2014) – ensemble LSTMs 34.8




NMT Sutskever et al. (2014) – ensemble LSTMs 34.8 Luong et al. (2015a) – ensemble LSTMs + rare word 37.5

New SOTA!

Effects of Transla/ng Rare Words

25

30

35

40

BLEU

Sentences ordered by average frequency rank

Durrani et al. (37.0) Sutskever et al. (34.8) Luong et al. (37.5)

Summary Deep RNNs

(Sutskever et al., 2014)


Je suis étudiant _

I

am a student

_ Je suis étudiant

Je suis étudiant _

I

BidirecJonal RNNs (Bahdanau et al., 2015)

•  Generalize well.

•  Small memory.

•  Simple decoder.

Outline


•  NMT basics (Sutskever et al., 2014)


Sentence Length Problem

With adenJon Without adenJon

(Bahdanau et al., 2015)

Why?

•  A fixed-‐dimensional source vector.

•  Problem: Markovian process.


Je suis étudiant _

I

Aên/on Mechanism

•  SoluJon: random access memory – Retrieve as needed. – cf. Neural Turing Machine (Graves et al., 2014).


Je suis étudiant _

I

Pool of source states

Alignments as a by-‐product

•  Recent innovaJon in deep learning: –  Control problem (Mnih et al., 14) –  Speech recogniJon (Chorowski et al., 15) –  Image capJon generaJon (Xu et al., 15)


Simplified AdenJon (Bahdanau et al., 2015)

Deep LSTM (Sutskever et al., 2014)


Je suis étudiant _

I

+

am a student _ Je

suis

I

Attention Layer

Context vector

?

What’s next?


Je suis étudiant _

I

•  Compare target and source hidden states.

Aên/on Mechanism – Scoring

am a student _ Je

suis

I

Attention Layer

Context vector

?



am a student _ Je

suis

I

Attention Layer

Context vector

?

3



am a student _ Je

suis

I

Attention Layer

Context vector

?

5 3



am a student _ Je

suis

I

Attention Layer

Context vector

?

1 3 5



am a student _ Je

suis

I

Attention Layer

Context vector

?

1 3 5 1

•  Convert into alignment weights.

Aên/on Mechanism – Normaliza<on

am a student _ Je

suis

I

Attention Layer

Context vector

?

0.1 0.3 0.5 0.1

am a student _ Je

suis

I

Context vector

•  Build context vector: weighted average.

Aên/on Mechanism – Context vector

?

am a student _ Je

suis

I

Context vector

•  Compute the next hidden state.

Aên/on Mechanism – Hidden state

am a student _ Je

suis

I

Context vector

•  Predict the next word.

Aên/on Mechanism – Predict

Aên/on Mechanism – Score Func<ons


(Luong et al., 2015b)

Aên/on Mechanism – Score Func<ons

•  More focused adenJon (Luong et al., 2015b) – Focus on a subset of words each Jme.



English-‐German WMT Results

New SOTA!

Systems BLEU SOTA in WMT’14 (Buck et al., 2014) 20.7

NMT Jean et al., (2015) – GRUs + adenJon 21.6

English-‐German WMT Results

Even beder!

Systems BLEU SOTA in WMT’14 (Buck et al., 2014) 20.7

NMT Jean et al., (2015) – GRUs + adenJon 21.6 Luong et al. (2015b) – LSTMs + improved adenJon 23.0 (+2.3)

Translate Long Sentences

10 20 30 40 50 60 7010

15

20

25

Sent Lengths

BLEU

��

�

ours, no attn (BLEU 13.9)ours, local−p attn (BLEU 20.9)ours, best system (BLEU 23.0)WMT’14 best (BLEU 20.7)Jeans et al., 2015 (BLEU 21.6)

No AdenJon

AdenJon


Sample English-‐German transla/ons

•  Translate names correctly.

src Orlando Bloom and Miranda Kerr sJll love each other

ref Orlando Bloom und Miranda Kerr lieben sich noch immer

best Orlando Bloom und Miranda Kerr lieben einander noch immer .

base Orlando Bloom und Lucas Miranda lieben einander noch immer .



•  Translate a doubly-‐negated phrase correctly •  Fail to translate “passenger experience”.

src ʹ′ʹ′ We ʹ′ re pleased the FAA recognizes that an enjoyable passenger experience is not incompa<ble with safety and security , ʹ′ʹ′ said Roger Dow , CEO of the U.S. Travel AssociaJon .

ref “ Wir freuen uns , dass die FAA erkennt , dass ein angenehmes Passagiererlebnis nicht im Wider-‐ spruch zur Sicherheit steht ” , sagte Roger Dow , CEO der U.S. Travel AssociaJon .

best ʹ′ʹ′ Wir freuen uns , dass die FAA anerkennt , dass ein angenehmes ist nicht mit Sicherheit und Sicherheit unvereinbar ist ʹ′ʹ′ , sagte Roger Dow , CEO der US -‐ die .

base ʹ′ʹ′ Wir freuen uns u ber die <unk> , dass ein <unk> <unk> mit Sicherheit nicht vereinbar ist mit Sicherheit und Sicherheit ʹ′ʹ′ , sagte Roger Cameron , CEO der US -‐ <unk> .



•  Translate a doubly-‐negated phrase correctly •  Fail to translate “passenger experience”.







•  Fail to translate “passenger experience”.






src

Wegen der von Berlin und der Europa ischen Zentralbank verha ngten strengen SparpoliJk in Verbindung mit der Zwangsjacke , in die die jeweilige naJonale Wirtschaj durch das Festhal-‐ ten an der gemeinsamen Wa hrung geno Jgt wird , sind viele Menschen der Ansicht , das Projekt Europa sei zu weit gegangen

ref The austerity imposed by Berlin and the European Central Bank , coupled with the straitjacket imposed on naJonal economies through adherence to the common currency , has led many people to think Project Europe has gone too far .

best

Because of the strict austerity measures imposed by Berlin and the European Central Bank in connec/on with the straitjacket in which the respecJve naJonal economy is forced to adhere to the common currency , many people believe that the European project has gone too far .

base

Because of the pressure imposed by the European Central Bank and the Federal Central Bank with the strict austerity imposed on the naJonal economy in the face of the single currency , many people believe that the European project has gone too far .

Sample German-‐English transla/ons

•  Translate well long sentences. (Luong et al., 2015b)

Summary

Simplified AdenJon (Bahdanau et al., 2015)

Deep LSTM (Sutskever et al., 2014)


Je suis étudiant _

I

+

Thank you!

References (1) •  [Bahdanau et al., 2015] Neural TranslaJon by Jointly Learning to Align and

Translate. hdp://arxiv.org/pdf/1409.0473.pdf •  [Cho et al., 2014a] Learning Phrase RepresentaJons using RNN Encoder–Decoder

for StaJsJcal Machine TranslaJon. hdp://aclweb.org/anthology/D/D14/D14-‐1179.pdf

•  [Cho et al., 2014b] On the ProperJes of Neural Machine TranslaJon: Encoder–Decoder Approaches. hdp://www.aclweb.org/anthology/W14-‐4012

•  [Chorowski et al., 2015] AdenJon-‐Based Models for Speech RecogniJon. hdp://arxiv.org/pdf/1506.07503v1.pdf

•  [Chung et al., 2015] Empirical EvaluaJon of Gated Recurrent Neural Networks on Sequence Modeling. hdp://arxiv.org/pdf/1412.3555.pdf

•  [Graves, 2013] GeneraJng Sequences With Recurrent Neural Networks. hdp://arxiv.org/pdf/1308.0850v5.pdf

•  [Graves, 2014] Neural Turing Machine. hdp://arxiv.org/pdf/1410.5401v2.pdf. •  [Hochreiter & Schmidhuber, 1997] Long Short-‐term Memory.

hdp://deeplearning.cs.cmu.edu/pdfs/Hochreiter97_lstm.pdf

References (2) •  [Kalchbrenner & Blunsom, 2013] Recurrent ConJnuous TranslaJon Models.

hdp://nal.co/papers/KalchbrennerBlunsom_EMNLP13 •  [Luong et al., 2015a] Addressing the Rare Word Problem in Neural Machine

TranslaJon. hdp://www.aclweb.org/anthology/P15-‐1002 •  [Luong et al., 2015b] EffecJve Approaches to AdenJon-‐based Neural Machine

TranslaJon. hdps://aclweb.org/anthology/D/D15/D15-‐1166.pdf •  [Mnih et al., 2014] Recurrent Models of Visual AdenJon.

hdp://papers.nips.cc/paper/5542-‐recurrent-‐models-‐of-‐visual-‐adenJon.pdf •  [Pascanu et al., 2013] On the difficulty of training Recurrent Neural Networks.

hdp://arxiv.org/pdf/1211.5063v2.pdf •  [Xu et al., 2015] Show, Adend and Tell: Neural Image CapJon GeneraJon with

Visual AdenJon. hdp://jmlr.org/proceedings/papers/v37/xuc15.pdf •  [Sutskever et al., 2014] Sequence to Sequence Learning with Neural Networks.

hdp://papers.nips.cc/paper/5346-‐sequence-‐to-‐sequence-‐learning-‐with-‐neural-‐networks.pdf

•  [Zaremba et al., 2015] Recurrent Neural Network RegularizaJon. hdp://arxiv.org/pdf/1409.2329.pdf

Encoder-‐decoder Summary Encoder Decoder

(Sutskever et al., 2014) (Luong et al., 2015a) (Luong et al., 2015b)

Deep LSTM Deep LSTM

(Cho et al., 2014a) (Bahdanau et al., 2015)

(Jean et al., 2015)

(BidirecJonal) GRU GRU

(Kalchbrenner & Blunsom, 2013) CNN (Inverse CNN)

RNN

(Cho et al., 2014b) Gated Recursive CNN GRU

Important LSTM components?

•  (Jozefowicz et al., 2015): forget gate bias of 1.

•  (Greff et al., 2015): forget gates & output acts.

LSTMt LSTMt-‐1 +

LSTM Backpropaga/on

•  Deltas sent back from the top layers.

LSTMt LSTMt-‐1

+

LSTM Backpropaga/on – Context

LSTMt LSTMt-‐1

+


•  Complete context vector gradient.

LSTMt LSTMt-‐1

+


LSTMt LSTMt-‐1

+


•  First, use to compute gradients for .

LSTMt LSTMt-‐1

+


•  Then, update .

LSTMt LSTMt-‐1

+

LSTM Backprop

LSTMt LSTMt-‐1

+

LSTM Backprop

•  Compute gradients for , , .

LSTMt LSTMt-‐1

+

LSTM Backprop

•  Add gradients from the loss / upper layers.

LSTMt LSTMt-‐1

+

Summary

•  LSTM backpropagaJon is nasty.

•  But it will be much easier if: – Know your matrix calculus! – Pay adenJon to and .

Other Aên/on Func/ons

•  Content-‐based:

•  LocaJon-‐based: –  (Graves, 2013): hand-‐wriJng synthesis model.

•  Hybrid: –  (Chorowski et al., 2015) for speech recogniJon.

Local Aên/on (Luong et al., 2015b)

•  More focused adenJon. – PotenJally useful for longer text sequences.

aligned posiJons?

Predict aligned posi/ons

Real value in [0, S] Source sentence

How do we learn to the posiJon parameters?

[0,1]

3.5 4 4.5 5 5.5 6 6.5 7 7.50

0.2

0.4

0.6

0.8

1

s

Alignment Weights

5.5 2

0.4

0.1

0.4

0.1

Truncated Gaussian

3.5 4 4.5 5 5.5 6 6.5 7 7.50

0.2

0.4

0.6

0.8

1

s

•  Favor points close to the center.

Scaled Alignment Weights

3.5 4 4.5 5 5.5 6 6.5 7 7.50

0.2

0.4

0.6

0.8

1

s

New Peak

DifferenJable almost everywhere!

Documents

Neural’Machine’Transla/on’’...Neural’Machine’Transla/on’’ ThangLuong Lecture’@CS224N’ ’ (Thanks’to’Chris’Manning,’Abigail’See,’and’’ Russell’Stewartfor’comments