Neural’Machine’Transla/on’’...Neural’Machine’Transla/on’’ ThangLuong...

Neural Machine Transla/on

Thang Luong Lecture @ CS224N

(Thanks to Chris Manning, Abigail See, and

Russell Stewart for comments and discussions.)

Sixi roasted husband Meat Muscle Stupid Bean Sprouts

Let’s backtrack

(From Chris Manning)

•  MT: learn to translate from parallel corpora.

Let’s backtrack – Approaches

ProbabilisJc “DicJonary” (Brown et al., 1993)

Phrase Table (Koehn et al., 2003, Och & Ney, 2004)

It’s 2015! “Sentence” Table?

Can we build a “sentence” table?

•  |V|N combinaJons in principle. – English: average sentence length 14 (~ 70 chars). – 1014*70*2 bytes = 14M gigabytes.

WHAT THE BRITISH SAY WHAT THE BRITISH MEAN q With the greatest respect q You are an idiot q That’s not bad q That’s good q Very interesJng q That is clearly nonsense q Quite good q A bit disappoinJng

Neural Machine Transla/on to the rescue!

•  Store a sentence table implicitly.

•  Simple and coherent.

But we need to understand Recurrent Neural Networks first!

Outline

•  Recurrent Neural Networks (RNNs)

•  NMT basics (Sutskever et al., 2014)

•  AdenJon mechanism (Bahdanau et al., 2015)

Recurrent Neural Networks (RNNs)

(Picture adapted from Andrej Karparthy)

RNN – Input Layer

RNN – Hidden Layer

ht-‐1 ht

RNN – Hidden Layer

ht-‐1 ht

RNNs to represent sequences!

Outline

•  NMT basics (Sutskever et al., 2014) – Encoder-‐Decoder. – Training vs. TesJng. – BackpropagaJon. – More about RNNs.

Neural Machine Transla/on (NMT)

•  Model P(target | source) directly.

am a student _ Je suis étudiant

Je suis étudiant _

•  RNNs trained end-‐to-‐end (Sutskever et al., 2014). am a student _ Je suis étudiant

Je suis étudiant _

•  RNNs trained end-‐to-‐end (Sutskever et al., 2014).

•  Encoder-‐decoder approach.

Je suis étudiant _

Encoder Decoder

Word Embeddings

Je suis étudiant _

•  Randomly iniJalized, one for each language. – Learnable parameters.

Source embeddings

Target embeddings

Je suis étudiant _

Recurrent Connec/ons IniJal states

•  Ojen set to 0.

Je suis étudiant _

Recurrent Connec/ons

•  Different across layers and encoder / decoder.

Encoder 1st layer

Je suis étudiant _

Encoder 2nd layer

Je suis étudiant _

Decoder 1st layer

Je suis étudiant _

Decoder 2nd layer

Outline

Training vs. Tes/ng

•  Training – Correct translaJons are available.

•  Tes:ng – Only source sentences are given.

Je suis étudiant _

Training – So6max

•  Hidden states ↦ scores.

Scores

Je suis étudiant _

Sojmax parameters

Je suis étudiant _

Training – So6max

•  Scores ↦ probabiliJes.

Je suis étudiant _

Scores

so;max func:on

P(suis | Je, source) |V|

Sojmax parameters

Training Loss

•  Maximize P(target | source): – Decompose into individual word predicJons.

am a student _ Je suis étudiantI

Je suis étudiant _

Training Loss

-log P(Je)

-log P(suis)

•  Sum of all individual losses

Training Loss

-log P(Je)

-log P(suis)

Training Loss

-log P(étudiant)

Training Loss

-log P(_)

Tes/ng

•  Feed the most likely word

Tes/ng

NMT beam-‐search decoders are much simpler!

Outline

Backpropaga/on Through Time

-log P(_)

Init to 0

-log P(étudiant)

-log P(suis)

RNN gradients are accumulated.

Outline

Recurrent types – vanilla RNN

Vanishing gradient problem!

Vanishing gradients

(Pascanu et al., 2013)

Chain Rule

Bound Rules

Bound Largest singular value

Vanishing gradients

Chain Rule

Sufficient Cond

Chain Rule

Bound Rules

Vanishing gradients

Chain Rule

Sufficient Cond

Bound Rules

Recurrent types – LSTM

•  Long-‐Short Term Memory (LSTM) –  (Hochreiter & Schmidhuber, 1997)

•  LSTM cells are addiJvely updated – Make backprop through Jme easier.

C’mon, it’s been around for 20 years!

LSTM cells

Building LSTM

•  A naïve version.

Nice gradients!

Building LSTM

•  Add input gates: control input signal.

Input gates

Building LSTM

•  Add forget gates: control memory.

Forget gates

Building LSTM

Output gates

•  Add output gates: extract informaJon. •  (Zaremba et al., 2014).

Why LSTM works?

•  The addiJve operaJon is the key! •  BackpropaJon path through the cell is effecJve.

LSTMt LSTMt-‐1

Why LSTM works?

•  The addiJve operaJon is the key! •  BackpropaJon path through the cell is effecJve.

LSTMt LSTMt-‐1

Forget gates are important!

•  (Graves, 2013): revived LSTM. – Direct connecJons between cells and gates.

•  Gated Recurrent Unit (GRU) – (Cho et al., 2014a) – No cells, same addiJve idea.

•  LSTM vs. GRU: mixed results (Chung et al., 2015).

Other RNN units

English – French WMT Results

Systems BLEU SOTA in WMT’14 (Durrani et al., 2014) 37.0

Standard MT + neural components Schwenk (2014) – neural language model 33.3 Cho et al. (2014) – phrase table neural features 34.5

NMT Sutskever et al. (2014) – ensemble LSTMs 34.8

NMT Sutskever et al. (2014) – ensemble LSTMs 34.8 Luong et al. (2015a) – ensemble LSTMs + rare word 37.5

New SOTA!

Effects of Transla/ng Rare Words

Sentences ordered by average frequency rank

Durrani et al. (37.0) Sutskever et al. (34.8) Luong et al. (37.5)

Summary Deep RNNs

(Sutskever et al., 2014)

Je suis étudiant _

am a student

_ Je suis étudiant

Je suis étudiant _

BidirecJonal RNNs (Bahdanau et al., 2015)

•  Generalize well.

•  Small memory.

•  Simple decoder.

Outline

•  NMT basics (Sutskever et al., 2014)

Sentence Length Problem

With adenJon Without adenJon

(Bahdanau et al., 2015)

•  A fixed-‐dimensional source vector.

•  Problem: Markovian process.

Je suis étudiant _

A^en/on Mechanism

•  SoluJon: random access memory – Retrieve as needed. – cf. Neural Turing Machine (Graves et al., 2014).

Je suis étudiant _

Pool of source states

Alignments as a by-‐product

•  Recent innovaJon in deep learning: –  Control problem (Mnih et al., 14) –  Speech recogniJon (Chorowski et al., 15) –  Image capJon generaJon (Xu et al., 15)

Simplified AdenJon (Bahdanau et al., 2015)

Deep LSTM (Sutskever et al., 2014)

Je suis étudiant _

am a student _ Je

Attention Layer

Context vector

What’s next?

Je suis étudiant _

•  Compare target and source hidden states.

A^en/on Mechanism – Scoring

am a student _ Je

Attention Layer

Context vector

am a student _ Je

Attention Layer

Context vector

am a student _ Je

Attention Layer

Context vector

am a student _ Je

Attention Layer

Context vector

am a student _ Je

Attention Layer

Context vector

1 3 5 1

•  Convert into alignment weights.

A^en/on Mechanism – Normaliza<on

am a student _ Je

Attention Layer

Context vector

0.1 0.3 0.5 0.1

am a student _ Je

Context vector

•  Build context vector: weighted average.

A^en/on Mechanism – Context vector

am a student _ Je

Context vector

•  Compute the next hidden state.

A^en/on Mechanism – Hidden state

am a student _ Je

Context vector

•  Predict the next word.

A^en/on Mechanism – Predict

A^en/on Mechanism – Score Func<ons

(Luong et al., 2015b)

A^en/on Mechanism – Score Func<ons

•  More focused adenJon (Luong et al., 2015b) – Focus on a subset of words each Jme.

English-‐German WMT Results

New SOTA!

Systems BLEU SOTA in WMT’14 (Buck et al., 2014) 20.7

NMT Jean et al., (2015) – GRUs + adenJon 21.6

English-‐German WMT Results

Even beder!

Systems BLEU SOTA in WMT’14 (Buck et al., 2014) 20.7

NMT Jean et al., (2015) – GRUs + adenJon 21.6 Luong et al. (2015b) – LSTMs + improved adenJon 23.0 (+2.3)

Translate Long Sentences

10 20 30 40 50 60 7010

Sent Lengths

��

ours, no attn (BLEU 13.9)ours, local−p attn (BLEU 20.9)ours, best system (BLEU 23.0)WMT’14 best (BLEU 20.7)Jeans et al., 2015 (BLEU 21.6)

No AdenJon

AdenJon

Sample English-‐German transla/ons

•  Translate names correctly.

src Orlando Bloom and Miranda Kerr sJll love each other

ref Orlando Bloom und Miranda Kerr lieben sich noch immer

best Orlando Bloom und Miranda Kerr lieben einander noch immer .

base Orlando Bloom und Lucas Miranda lieben einander noch immer .

•  Translate a doubly-‐negated phrase correctly •  Fail to translate “passenger experience”.

src ʹ′ʹ′ We ʹ′ re pleased the FAA recognizes that an enjoyable passenger experience is not incompa<ble with safety and security , ʹ′ʹ′ said Roger Dow , CEO of the U.S. Travel AssociaJon .

ref “ Wir freuen uns , dass die FAA erkennt , dass ein angenehmes Passagiererlebnis nicht im Wider-‐ spruch zur Sicherheit steht ” , sagte Roger Dow , CEO der U.S. Travel AssociaJon .

best ʹ′ʹ′ Wir freuen uns , dass die FAA anerkennt , dass ein angenehmes ist nicht mit Sicherheit und Sicherheit unvereinbar ist ʹ′ʹ′ , sagte Roger Dow , CEO der US -‐ die .

base ʹ′ʹ′ Wir freuen uns u ber die <unk> , dass ein <unk> <unk> mit Sicherheit nicht vereinbar ist mit Sicherheit und Sicherheit ʹ′ʹ′ , sagte Roger Cameron , CEO der US -‐ <unk> .

•  Translate a doubly-‐negated phrase correctly •  Fail to translate “passenger experience”.

•  Fail to translate “passenger experience”.

Wegen der von Berlin und der Europa ischen Zentralbank verha ngten strengen SparpoliJk in Verbindung mit der Zwangsjacke , in die die jeweilige naJonale Wirtschaj durch das Festhal-‐ ten an der gemeinsamen Wa hrung geno Jgt wird , sind viele Menschen der Ansicht , das Projekt Europa sei zu weit gegangen

ref The austerity imposed by Berlin and the European Central Bank , coupled with the straitjacket imposed on naJonal economies through adherence to the common currency , has led many people to think Project Europe has gone too far .

Because of the strict austerity measures imposed by Berlin and the European Central Bank in connec/on with the straitjacket in which the respecJve naJonal economy is forced to adhere to the common currency , many people believe that the European project has gone too far .

Because of the pressure imposed by the European Central Bank and the Federal Central Bank with the strict austerity imposed on the naJonal economy in the face of the single currency , many people believe that the European project has gone too far .

Sample German-‐English transla/ons

•  Translate well long sentences. (Luong et al., 2015b)

Summary

Simplified AdenJon (Bahdanau et al., 2015)

Deep LSTM (Sutskever et al., 2014)

Je suis étudiant _

Thank you!

References (1) •  [Bahdanau et al., 2015] Neural TranslaJon by Jointly Learning to Align and

Translate. hdp://arxiv.org/pdf/1409.0473.pdf •  [Cho et al., 2014a] Learning Phrase RepresentaJons using RNN Encoder–Decoder

for StaJsJcal Machine TranslaJon. hdp://aclweb.org/anthology/D/D14/D14-‐1179.pdf

•  [Cho et al., 2014b] On the ProperJes of Neural Machine TranslaJon: Encoder–Decoder Approaches. hdp://www.aclweb.org/anthology/W14-‐4012

•  [Chorowski et al., 2015] AdenJon-‐Based Models for Speech RecogniJon. hdp://arxiv.org/pdf/1506.07503v1.pdf

•  [Chung et al., 2015] Empirical EvaluaJon of Gated Recurrent Neural Networks on Sequence Modeling. hdp://arxiv.org/pdf/1412.3555.pdf

•  [Graves, 2013] GeneraJng Sequences With Recurrent Neural Networks. hdp://arxiv.org/pdf/1308.0850v5.pdf

•  [Graves, 2014] Neural Turing Machine. hdp://arxiv.org/pdf/1410.5401v2.pdf. •  [Hochreiter & Schmidhuber, 1997] Long Short-‐term Memory.

hdp://deeplearning.cs.cmu.edu/pdfs/Hochreiter97_lstm.pdf

References (2) •  [Kalchbrenner & Blunsom, 2013] Recurrent ConJnuous TranslaJon Models.

hdp://nal.co/papers/KalchbrennerBlunsom_EMNLP13 •  [Luong et al., 2015a] Addressing the Rare Word Problem in Neural Machine

TranslaJon. hdp://www.aclweb.org/anthology/P15-‐1002 •  [Luong et al., 2015b] EffecJve Approaches to AdenJon-‐based Neural Machine

TranslaJon. hdps://aclweb.org/anthology/D/D15/D15-‐1166.pdf •  [Mnih et al., 2014] Recurrent Models of Visual AdenJon.

hdp://papers.nips.cc/paper/5542-‐recurrent-‐models-‐of-‐visual-‐adenJon.pdf •  [Pascanu et al., 2013] On the difficulty of training Recurrent Neural Networks.

hdp://arxiv.org/pdf/1211.5063v2.pdf •  [Xu et al., 2015] Show, Adend and Tell: Neural Image CapJon GeneraJon with

Visual AdenJon. hdp://jmlr.org/proceedings/papers/v37/xuc15.pdf •  [Sutskever et al., 2014] Sequence to Sequence Learning with Neural Networks.

hdp://papers.nips.cc/paper/5346-‐sequence-‐to-‐sequence-‐learning-‐with-‐neural-‐networks.pdf

•  [Zaremba et al., 2015] Recurrent Neural Network RegularizaJon. hdp://arxiv.org/pdf/1409.2329.pdf

Encoder-‐decoder Summary Encoder Decoder

(Sutskever et al., 2014) (Luong et al., 2015a) (Luong et al., 2015b)

Deep LSTM Deep LSTM

(Cho et al., 2014a) (Bahdanau et al., 2015)

(Jean et al., 2015)

(BidirecJonal) GRU GRU

(Kalchbrenner & Blunsom, 2013) CNN (Inverse CNN)

(Cho et al., 2014b) Gated Recursive CNN GRU

Important LSTM components?

•  (Jozefowicz et al., 2015): forget gate bias of 1.

•  (Greff et al., 2015): forget gates & output acts.

LSTMt LSTMt-‐1 +

LSTM Backpropaga/on

•  Deltas sent back from the top layers.

LSTMt LSTMt-‐1

LSTM Backpropaga/on – Context

LSTMt LSTMt-‐1

•  Complete context vector gradient.

LSTMt LSTMt-‐1

•  First, use to compute gradients for .

LSTMt LSTMt-‐1

•  Then, update .

LSTMt LSTMt-‐1

LSTM Backprop

LSTMt LSTMt-‐1

LSTM Backprop

•  Compute gradients for , , .

LSTMt LSTMt-‐1

LSTM Backprop

•  Add gradients from the loss / upper layers.

LSTMt LSTMt-‐1

Summary

•  LSTM backpropagaJon is nasty.

•  But it will be much easier if: – Know your matrix calculus! – Pay adenJon to and .

Other A^en/on Func/ons

•  Content-‐based:

•  LocaJon-‐based: –  (Graves, 2013): hand-‐wriJng synthesis model.

•  Hybrid: –  (Chorowski et al., 2015) for speech recogniJon.

Local A^en/on (Luong et al., 2015b)

•  More focused adenJon. – PotenJally useful for longer text sequences.

aligned posiJons?

Predict aligned posi/ons

Real value in [0, S] Source sentence

How do we learn to the posiJon parameters?

3.5 4 4.5 5 5.5 6 6.5 7 7.50

Alignment Weights

Truncated Gaussian

3.5 4 4.5 5 5.5 6 6.5 7 7.50

•  Favor points close to the center.

Scaled Alignment Weights

3.5 4 4.5 5 5.5 6 6.5 7 7.50

New Peak

DifferenJable almost everywhere!

Neural’Machine’Transla/on’’...Neural’Machine’Transla/on’’ ThangLuong...

Documents

CS224n: Natural Language Processing with Deep Learning

Natural Language Processing with Deep Learningweb.stanford.edu/class/cs224n/slides/cs224n-2020-lecture... · 2020-02-07 · Lecture 10: (Textual) Question Answering Architectures,

cs224n-2018-lecture12-Transformers and CNNs

CS224N/Lin4 with Deep Learning tural Language Pr …web.stanford.edu/class/cs224n/slides/cs224n-2019-lecture...Natural Language Processing with Deep Learning CS224N/Ling284 Lecture

Natural Language Processing with Deep Learning CS224N ...web.stanford.edu/class/cs224n/slides/cs224n-2021-lecture...•On each timestep, each element of the gates can be open(1), closed(0),

CS224n, Winter 2019web.stanford.edu/class/cs224n/posters/15785384.pdf · Improving coreference resolution by learning entity-level distributed representations. CoRR, abs/1606.01323

Natural Language Processing with Deep Learning CS224N ...web.stanford.edu/class/cs224n/slides/cs224n-2021-lecture...PP attachment ambiguities multiply •A key parsing decision is

Natural Language Processing with Deep Learning …web.stanford.edu/class/cs224n/lectures/lecture2.pdf · Natural Language Processing with Deep Learning CS224N/Ling284 ... Lecture

CS224N/Ling284isoft.postech.ac.kr/~gblee/Course/CS730b/2005/Lecture...• Fixed 2D word vectors to classify • Using softmax/logistic regression • Linear decision boundary • Traditional

Natural Language Processing with Deep Learning CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020... · Program synthesis applications from natural language I think it

Natural Language Processing CS224N/Ling237

Natural Language Processing with Deep Learning CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture02-wordve… · Natural Language Processing with Deep Learning

CS224N/Lin4 with Deep Learning tural Language Pr ocessingweb.stanford.edu/class/cs224n/slides/cs224n-2019-lecture07-fancy-rnn.pdfNatural Language Processing with Deep Learning CS224N/Ling284

CS224n: Natural Language Processing with Deep Learning …web.stanford.edu/class/cs224n/readings/cs224n-2019-notes... · 2020-01-02 · cs224n: natural language processing with deep

CS224N Python Introduction - Stanford Universityweb.stanford.edu/class/cs224n/readings/cs224n-python-review-20.pdf · CS224N Python Introduction. Plan for Today Intro to Python Installing

CS224n: Deep Learning for NLP - Stanford University...cs224n: deep learning for nlp lecture notes: part vii 3 We now take h(1) and put it through a softmax layer to get a score over

cs224n-python-review-code-updatedweb.stanford.edu/class/cs224n/readings/cs224n-python... · 2021. 1. 19. · Recommended IDEs Spyder (in-built in Anaconda) Pycharm (the most popular

Natural Language Processing with Deep Learning CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020... · • Commonly in NLP deep learning: • We learn both W and word

Natural Language Processing with Deep Learning CS224N/Ling284leeck/NLP2/03_word_embedding... · 2017-09-18 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher

Visual Dialog - web.stanford.eduweb.stanford.edu/class/archive/cs/cs224n/cs224n... · Aiding visually impaired users in understanding their surroundings or social media content Interacting