Download pdf - Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Recurrent neural networks for statistical machine translation

(Neural Machine Translation)

Outline

● Machine translation● Statistical machine translation● Traditional methods● Why Neural Machine Translation (NMT)● General sequence learning with RNNs● Encoder decoder approach● Attention-based NMT● Using monolingual corpora in NMT● Character-based NMT

Machine Translation

● Translate a source sentence F to a target sentence E

● Define set of rules (Rule based translation)

● We don’t even know the set of rules underlying a single language

● Statistical approaches attempts to automatically extract rules from a large corpus of text, either implicitly or explicitly

Statistical Machine Translation

Corpora

e=(economic, growth, has, slowed, down, in, recent, years, .)

f=(La, croissance, économique, s'est, ralenti, ces, dernières, années, .)

Translation function

Traditional methods

Word-Based Model

● Words may not be the best candidates for the smallest units

● One word may translate to multiple words and vice versa● No word in target translation corresponding to a source

word and vice versa

the house is very small

das Hous ist klitzeklein

Phrase-Based Model

● Translate small word sequences at a time● Foreign input is segmented in phrases● Each phrase is translated into English● Phrases are reordered

of course john has fun with the game

naturerlich hat john spass am spiel

Phrase-Based Model (cont.)

● Fundamental equation of SMT

– Translation model

– Language model

● Decomposition of the translation model

– Phrase translation probability φ

– Reordering probability d

ebest=argmaxe p (e∣f )=argmax e p( f 1I∣e1

I) pLM (e)

p( f 1I∣e1

I )=∏i=1

I

φ(f i∣ei)d (starti−end i−1−1)

p( f 1I∣e1

I)

pLM (e )

Phrase Translation Table

● Main knowledge source: table with phrase translations and their probabilities

● Example: phrase translations for natuerlich

Translation Probability φ(e|f)

of course 0.5

naturally 0.3

of course , 0.15

, of course , 0.05

Weighted Model

● Some sub-models may be more important than others

● Add weights

ebest=argmaxe∏i=1

I

φ( f i∣e i)d (starti−end i−1−1)∏i=1

|e|

pLM (e i∣e1…e i−1)

ebest=argmaxe∏i=1

I

φ( f i∣e i)λφd (start i−end i−1−1)

λd∏i=1

|e|

pLM (e i∣e1…e i−1)λLM

λϕ , λd , λLM

Weighted Model as Log-Linear Model

p(e , a∣f )=exp {λφ∑i=1

I

log φ( f i∣e i)

+λd∑i=1

I

log d (ai−bi−1−1)

+λLM∑i=1

I

log pLM (e i∣e1…e i−1) }

Log-Linear Model

● Such a weighted model is a log-linear model:

● Feature functions– number of feature function

– random variable

– feature function



● More features can be added

p(x )=exp∑i=1

n

λ ihi(x )

x=(e , f , start , end)n=3

h1=log φh2=log dh3=log pLM

Problems with traditional approaches

● Highly Memory Intensive● Initial alignment makes conditional independence

assumption● Word and Phrase translation models only count co-

occurrences of surface form – don’t take word similarity into account

● Highly non-trivial to decode phrase based translation– word alignments + lexical reordering model

– language model

– phrase translations

Evaluation

● BLEU (BiLingual Evaluation Understudy)

● N-gram overlap between machine translation output and reference translation

● Compute precision for n-grams of size 1 to 4

● Add brevity penalty (for too short translations)

BLEU=min (1,output−length

reference−length )(∏i=1

4

precisioni)14

Example

SYSTEM A: Israeli officials responsibility of airport safety

REFERENCE: Israeli officials are responsible for airport security

SYSTEM B: airport security Israeli officials are responsible

2-GRAM MATCH 1-GRAM MATCH

2-GRAM MATCH 4-GRAM MATCH

Metric System A System B

precision (1gram) 3/6 6/6




brevity penalty 6/7 6/7

BLEU 0% 52%

Multiple Reference Translations

● To account for variability, use multiple reference translations

– n-grams may match in any of the references

– closest reference length used● Example

Israeli officials responsibility of airport safety

Israeli officials are responsible for airport securityIsrael is in charge of the security at this airportThe security work for this airport is the responsibility of the Israel governmentIsraeli side was in charge of the security of this airport

REFERENCES::

SYSTEM::2-GRAM MATCH 2-GRAM MATCH 1-GRAM

General sequence learning using RNNs (Text analysis example)

Unfolding in time

Edge to next time step

W W

V

U

W

V

W W W

x t−1

o t−1

st−1

U

V

x t

o t

st

U

V

x t+1

o t+1

st+1

U

V

x

o

s

Weights are tied

How RNN works1

-of-

K c

od

ing

Co

nti

nu

ou

s-sp

ace

Wo

rd

Re

pre

sen

tati

on

W i

S i

hi

Re

curr

ent

Sta

te


softmax

+ive

For example predict sentiment of the input document

-ive

Step 1: Word to one-hot vector1

-of-

K c

od

ing

Co

nti

nu

ou

s-sp

ace

Wo

rd

Re

pre

sen

tati

on

W i

S i

hi


Re

curr

ent

Sta

te

(1)

Step 2: One-hot vector to continuous-space representation

1-o

f-K

co

din

g

Co

nti

nu

ou

s-sp

ace

Wo

rd

Re

pre

sen

tati

on

W i

S i

hi

Re

curr

ent

Sta

te

(2)


s i=Ed× K x i

Projects the 1-of-K coded vector with a matrix E to d-dimensional (typically 100 – 500) continuous word representation

Step 3: Sequence summarization by RNN1

-of-

K c

od

ing

Co

nti

nu

ou

s-sp

ace

Wo

rd

Re

pre

sen

tati

on

W i

S i

hi

Re

curr

ent

Sta

te

(3)


h i=φθ (hi−1 , si)

Sequence of continuous vectors s_i summarized by RNN

Architecture is flexible1

-of-

K c

od

ing

Co

nti

nu

ou

s-sp

ace

Wo

rd

Re

pre

sen

tati

on

W i

S i

hi

Re

curr

ent

Sta

te


Summary sentence representation vectors

Sentence Representations from [Sutskever et al., 2014]. Similar sentences are close together

Step 4: Use summarization for prediction1

-of-

K c

od

ing

Co

nti

nu

ou

s-sp

ace

Wo

rd

Re

pre

sen

tati

on

W i

S i

hi

Re

curr

ent

Sta

te


softmax

+ive -ive

Range of applications

image classification

fixed-sized input to fixed-sized output

Image: http://karpathy.github.io/2015/05/21/rnn-effectiveness/

image captioning

Sequence output

sentiment analysis

Sequence input

Machine Translation

Sequence input and sequence output

video description

Synced sequence input and output

Neural Machine Translation

Inputsentence

Outputsentence

Neuralnet

Inputsentence

Outputsentence

Neural net

Inputsentence

Outputsentence

SMT

Neural net

SMT

(Schwenk et al. 2006) (Devlin et al. 2014)

Different from

SMT+Reranking-by-NN SMT-NN

Neural Machine Translation

(Ñeco&Forcada, 1997)

(Kalchbrenner et al., 2013)

(Cho et al., 2014)

(Sutskever et al., 2014)

Encoder-Decoder Architecture for Machine Translation

Encoder-Decoder Approach

Inputsentence

Encoder

Fixed sizerepresentation

DecoderOutput

sentence

(Ñeco&Forcada, 1997)

(Kalchbrenner et al., 2013)

(Cho et al., 2014)

(Sutskever et al., 2014)

RNN Encoder-Decoder (Cho et al. 2014)

h1 h2 hL

s0 s1 sT−1

Encoder

Decoder

Representation

...

...

y1 yTy2

x1 xLx2

En

cod

erD

ecod

er

1-o

f-K

co

din

g

Co

nti

nu

ou

s-sp

ace

Wo

rd

Rep

rese

nta

tio

nR

ecu

rre

nt

Sta

teR

ecu

rren

tS

tate

Wo

rd

Pro

bab

ility

Wo

rd S

amp

le

W i

S i

z i

hi

Pi

U i


Neural machine translation system


The Encoder

Step 1: Word to one-hot vectorE

nco

der

1-o

f-K

co

din

g

Co

nti

nu

ou

s-sp

ace

Wo

rd

Re

pre

sen

tati

on

W i

S i

hi

Re

curr

ent

Sta

te

(1)


Step 2: One-hot vector to continuous-space representation

En

cod

er

1-o

f-K

co

din

g

Co

nti

nu

ou

s-sp

ace

Wo

rd

Re

pre

sen

tati

on

W i

S i

hi

Re

curr

ent

Sta

te

(2)

s i=Ed× K x i

Projects the 1-of-K coded vector with a matrix E to d-dimensional (typically 100 – 500) continuous word representation

s_i updated to maximize the translation performance


Step 3: Sequence summarization by RNNE

nco

der

1-o

f-K

co

din

g

Co

nti

nu

ou

s-sp

ace

Wo

rd

Re

pre

sen

tati

on

W i

S i

hi

Re

curr

ent

Sta

te

(3)

h i=φθ (hi−1 , si)

Sequence of continuous vectors s_i summarized by RNN


Summary sentence representation vectors

Sentence Representations from [Sutskever et al., 2014]. Similar sentences are close together

The Decoder

Deco

der

Rec

urr

ent

Sta

teW

ord

P

rob

abili

tyW

ord

Sam

ple

z i

Pi

U i

Step 1: Compute internal hidden state of decoder

(1)

z i=φθ '(hT , ui−1 , zi−1)


Deco

der

Rec

urr

ent

Sta

teW

ord

P

rob

abili

tyW

ord

Sam

ple

z i

Pi

U i

Step 2: Next word probability

(2)

e ( k)=wkT z i+bk

p(w i=k ∣w1,w2,… ,w i−1 , hT)=exp(e ( k))

∑ jexp(e ( j))

Softmax

Score and normalize target words


Deco

der

Rec

urr

ent

Sta

teW

ord

P

rob

abili

tyW

ord

Sam

ple

z i

Pi

U i

Step 3: Sample next word

(3)


Training

Training: Maximum Likelihood Estimation

● Given parallel corpus D of training examples

● Compute

● Log-likelihood of training corpus

● Maximize log-likelihood using stochastic gradient descent (SGD)

D={(x1, y1), (x2, y2) ,…,(xn , yn)}

log P( yn∣xn ,θ)

L(D ,θ)=1N∑n=1

N

logP ( yn∨xn ,θ) ,

Problem with simple encoder-decoder architectures

Soft Attention Mechanism for Neural Machine Translation

Bidirectional recurrent neural networks for encoding a source sentence

En

cod

er

1-o

f-K

co

din

g

Co

nti

nu

ou

s-sp

ace

Wo

rd

Re

pre

sen

tati

on

W i

S i

hi

Bid

irec

tio

nal

Rec

urr

ent

Sta

te


Attention Mechanism

● To decide i-th target word, calculate relevance score between i-th target word and every source word

● Attention mechanism is implemented by a neural network

Deco

der

Rec

urr

ent

Sta

teW

ord

Sam

ple

z i

ui

Attention Mechanism (step 1)

hJ

Att

enti

on

M

ech

anis

m

a j

Attention Weight ∑ a j=1

An

no

tati

on

V

ecto

r

(1)e j=W a zi−1+U a h j


Deco

der

Rec

urr

ent

Sta

teW

ord

Sam

ple

z i

ui


hJ

Att

enti

on

M

ech

anis

m

a j


An

no

tati

on

V

ecto

r

(2) α j=exp (e j)

∑ j 'exp( e j ')


Deco

der

Rec

urr

ent

Sta

teW

ord

Sam

ple

z i

ui


hJ

Att

enti

on

M

ech

anis

m

a j


An

no

tati

on

V

ecto

r

c i=∑ j=1

Tα j h j


Effective Approaches to Attention-based Neural Machine Translation (Minh-Thang Luong et al.2015)

Global attentional model

Global align weights

Context vector

Attention Layer

hs ht

~h t

y t

c t

a t

Local attentional model

Local weights

Context vector

Attention Layer

hs ht

~h t

y t

c t

a t

Aligned position

pt

Input-feeding approach

A B C D <eos> X Y Z

<eos>YX Z

Attention Layer

~h t

Results

WMT’14 English-German results

Results (cont)

WMT’15 English-German results

Word-Character Models

Another Character-based NMT

Character-based word embedding

Marta R. Costa-juss` a and Jos e A. R. Fonollosa

Yet another (fully) Character-Level NMT

Fully Character-Level Neural Machine Translation without Explicit Segmentation (Nov 2016)Jason Lee, Kyunghyun Cho, Thomas Hofmann

Using monolingual corpora in NMT(C. Gulcehre et al., 2015)

Integrating Language Model into the decoder

● Pre-train NMT model (on parallel corpora) and (RNNLM (Mikolov et al., 2011), on larger monolingual corpora) separately

● Shallow Fusion● At each time step:

– TM proposes a set of candidate words– TM score: “Candidate sentences” + TM scores of candidate words– Select top-K– rescore these hypotheses with the weighted sum of the scores by the

NMT and RNNLM– The score of the new word

log p( y t=k )=log pTM ( y t=k )+β log pLM ( y t=k )

Shallow fusion (diagram)

stTM

stLM

y t

c t

y≤(i )

β

“Language modelrescoring”

Shallow Fusion

“Candidate sentences” +Translation model scores

Deep fusion

● Deep Fusion– integrate the RNNLM and the decoder of the NMT by

concatenating their hidden states next to each other

– controller mechanism (gate)

p( y t∣y <t , x )∝exp( y tT(W o f o(s t−1

TM , st−1LM , y t−1 , c t)+bo))

gt=σ(vgT st

LM+bg)

Deep fusion (diagram)

s tTM

stLM

y t

c t

y≤(i )

β

“Language modelrescoring”

Shallow Fusion

“Candidate sentences” +Translation model scores

stTM

s tLM

y t

c t

Deep Fusion

+

gt

st+1TM

s t+1LM

y t+1

c t+1

+

gt+1

Results

SMS/CHAT CTS

Dev Test Dev Test

PB 15.5 14.73 21.94 21.68

+CSLM 16.02 15.25 23.05 22.79

HPB 15.33 14.71 21.45 21.43

+CSLM 15.93 15.8 22.61 22.17

NMT 17.32 17.36 23.4 23.59

Shallow 16.59 16.42 22.7 22.83

Deep 17.58 17.64 23.78 23.5

Results on the task of Zh-En

De-En Cs-En

Dev Test Dev Test

NMT 25.51 23.61 21.47 21.89

Shallow 25.53 23.69 21.95 22.18

Deep 25.88 24 22.49 22.36

Results for De-En and Cs-En translation tasks on WMT’15 dataset.

Results (cont)

Development setTest set

dev2010

tst2010 tst2011 tst2012 tst2013 Test 2014

Previous Best (Single) 15.33 17.14 18.62 18.88 18.88 -

Previous Best (Combination)

- 17.34 18.93 18.7 18.7 -

NMT 14.5 18.01 18.77 19.86 19.86 18.64

NMT+LM (Shallow) 14.44 17.99 18.8 19.87 19.87 18.66

NMT+LM (Deep) 15.69 19.34 20.23 21.34 21.34 20.56

Results on Tr-En

Future work: domain adaption of the language model may further improve translations.

Performance improvement from incorporating an external LM was highly dependent on the domain similarity between the monolingual corpus and the target task.

https://arxiv.org/find/cs/1/au:+Lee_J/0/1/0/all/0/1

https://arxiv.org/find/cs/1/au:+Cho_K/0/1/0/all/0/1

https://arxiv.org/find/cs/1/au:+Hofmann_T/0/1/0/all/0/1

References● Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. “Neural machine translation by jointly learning

to align and translate.” arXiv preprint arXiv:1409.0473 (2014).● Cho, Kyunghyun et al. “Learning phrase representations using RNN encoder-decoder for statistical

machine translation.” arXiv preprint arXiv:1406.1078 (2014).● Cho, Kyunghyun, Aaron Courville, and Yoshua Bengio. “Describing Multimedia Content using Attention-

based Encoder–Decoder Networks.” arXiv preprint arXiv:1507.01053 (2015).● Graves, Alex, Greg Wayne, and Ivo Danihelka. “Neural Turing Machines.” arXiv preprint arXiv:1410.5401

(2014).● Gulcehre, Caglar et al. “On Using Monolingual Corpora in Neural Machine Translation.” arXiv preprint

arXiv:1503.03535 (2015).● Kalchbrenner, Nal, and Phil Blunsom. “Recurrent Continuous Translation Models.” EMNLP 2013: 1700-

1709.● Koehn, Philipp. Statistical machine translation. Cambridge University Press, 2009.● Pascanu, Razvan et al. “How to construct deep recurrent neural networks.” arXiv preprint

arXiv:1312.6026 (2013).● Schwenk, Holger. “Continuous space language models.” Computer Speech & Language 21.3 (2007):

492-518.● Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. “Sequence to sequence learning with neural networks.”

Advances in Neural Information Processing Systems 2014: 3104-3112.● http://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-with-gpus/● Minh Thang Luong, Hieu Pham, Christopher D. Manning“Effective Approaches to Attention-based Neural

Machine Translation.” arXiv:1508.04025 (2015).● Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing, “BLEU: a method for

automatic evaluation of machine translation”, Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, 2002