Edinburgh MT lecture 12: Neural encoder-decoder and attention models

Translation with recurrent and attention models

x1

x2

x3

x4

x5

x

y1

y2

y3

W y

y = x

>W

y = g(x>W)

g(u)i =expuiPi0 expui0

“Soft max”

“Neural” Networks

What about probabilities?Let each be an outcome.yi

x1

x2

x3

x4

x5

x

y1

y2

y3

W y

“Deep”

V z

z1

z2

z3

z4

z = g(y>V)

z = g(h(x>W)>V)

z = g(Vh(Wx))

Note:if g(x) = h(x) = x

z = (VW)| {z }U

x

“Soft max” s(u)i =expuiPi0 expui0

Bengio et al. (2003)

p(e) =

|e|Y

i=1

p(ei | ei�n+1, . . . , ei�1)

p(ei | ei�n+1, . . . , ei�1) =

ei

ei�1

ei�2

ei�3

C

C

C

W V

tanhsoftmax

x=x

1 BasicsNeural NetworkComputation GraphWhy Deep is Hard2006 Breakthrough

2 Building BlocksRBMAuto-EncodersRecurrent UnitsConvolution

3 TricksOptimizationRegularizationInfrastructure

4 New Stu↵Encoder-DecoderAttention/MemoryDeep ReinforcementVariationalAuto-Encoder

Training Neural Nets: Back-propagation

x1 x2 x3 x4

h1 h2 h3

y

xi

wij

hj

wj

Predict f (x (m))

Adjust weights

w11 w12

w1 w2 w3

1. For each sample, compute

f (x (m)) = �(P

j wj · �(P

i wijx(m)i ))

2. If f (x (m)) 6= y (m), back-propagate error and adjustweights {wij ,wj}.

(p.14)

1 BasicsNeural NetworkComputation GraphWhy Deep is Hard2006 Breakthrough

2 Building BlocksRBMAuto-EncodersRecurrent UnitsConvolution

3 TricksOptimizationRegularizationInfrastructure

4 New Stu↵Encoder-DecoderAttention/MemoryDeep ReinforcementVariationalAuto-Encoder

Derivatives of the weightsAssume two outputs (y1, y2) per input x ,and loss per sample: Loss =

Pk

12 [�(ink) � yk ]

2

x1 x2 x3 x4

h1 h2 h3

y1 y2

xi

wij

hj

wjk

yk

@Loss@wjk

= @Loss@ink

@ink@wjk

= �k@(

Pj wjkhj )

@wjk= �khj

@Loss@wij

= @Loss@inj

@inj@wij

= �j@(

Pj wij xi )

@wij= �jxi

�k = @@ink

⇣Pk

12 [�(ink) � yk ]

2⌘= [�(ink) � yk ]�0(ink)

�j =P

k@Loss@ink

@ink@inj

=P

k �k · @@inj

⇣Pj wjk�(inj)

⌘= [

Pk �kwjk ]�0(inj)

(p.15)

�0(inj)

@Loss

@wij

Since the derivatives and partials of each function are known, this is fully automatable (Torch, Theano, Tensorflow, etc.)

LMs in MTp(e, f) = pLM (e)⇥ pTM (f | e)


=

|e|Y

i=1

p(ei | ei�1, ..., e1)⇥|f |Y

j=1

p(fj | fj�1, ..., f1, e)


=

|e|Y

i=1

p(ei | ei�1, ..., e1)⇥|f |Y

j=1

p(fj | fj�1, ..., f1, e)

Conditional language model

I'd

I'd →

like

like

a

a

beer

beer

stopSTOP

0

Ich mochte ein Bier

z }| {Attention history:

a>1

a>2

a>3

a>4

a>5

I'd

I'd →

like

like

a

a

beer

beer

stopSTOP

0

Ich mochte ein Bier


a>1

a>2

a>3

a>4

a>5

I'd

I'd →

like

like

a

a

beer

beer

stopSTOP

0

Ich mochte ein Bier


a>1

a>2

a>3

a>4

a>5

I'd

I'd →

like

like

a

a

beer

beer

stopSTOP

0

Ich mochte ein Bier


a>1

a>2

a>3

a>4

a>5

I'd

I'd →

like

like

a

a

beer

beer

stopSTOP

0

Ich mochte ein Bier


a>1

a>2

a>3

a>4

a>5

I'd

I'd →

like

like

a

a

beer

beer

stopSTOP

0

Ich mochte ein Bier


a>1

a>2

a>3

a>4

a>5

I'd

I'd →

like

like

a

a

beer

beer

stopSTOP

0

Ich mochte ein Bier


a>1

a>2

a>3

a>4

a>5

I'd

I'd →

like

like

a

a

beer

beer

stopSTOP

0

Ich mochte ein Bier


a>1

a>2

a>3

a>4

a>5

I'd

I'd →

like

like

a

a

beer

beer

stopSTOP

0

Ich mochte ein Bier


a>1

a>2

a>3

a>4

a>5

I'd

I'd →

like

like

a

a

beer

beer

stopSTOP

0

Ich mochte ein Bier


a>1

a>2

a>3

a>4

a>5

p(e) =

|e|Y

i=1

p(ei|ei�1, ..., e1)

e = (Economic, growth, has, slowed, down, in, recent, years, .)

1-of

-K c

odin

gC

ontin

uous

-spa

ceW

ord

Rep

rese

ntat

ion

si

wi

Rec

urre

ntSt

ate hi

Wor

d Ss

ampl

eui

Rec

urre

ntSt

atez i

f = (La, croissance, économique, s'est, ralentie, ces, dernières, années, .)

ip

Wor

d Pr

obab

ility

Encoder

Decoder

(Forcada&Ñeco, 1997, Kalchbrenner&Blunsom, 2013; Cho et al., 2014; Sutskever

et al., 2014)

p(f |e) =|f |Y

i=1

p(fi|fi�1, ..., f1, e)

Ich möchte ein Bier

x1 x2 x3 x4


x1 x2 x3 x4

�!h 1

�!h 2

�!h 3

�!h 4


x1 x2 x3 x4

�!h 1

�!h 2

�!h 3

�!h 4

�h 1

�h 2

�h 3

�h 4


x1 x2 x3 x4

�!h 1

�!h 2

�!h 3

�!h 4

�h 1

�h 2

�h 3

�h 4

fi = [ �h i;�!h i]


x1 x2 x3 x4

�!h 1

�!h 2

�!h 3

�!h 4

�h 1

�h 2

�h 3

�h 4

fi = [ �h i;�!h i]


x1 x2 x3 x4

�!h 1

�!h 2

�!h 3

�!h 4

�h 1

�h 2

�h 3

�h 4

fi = [ �h i;�!h i]


x1 x2 x3 x4

�!h 1

�!h 2

�!h 3

�!h 4

�h 1

�h 2

�h 3

�h 4

fi = [ �h i;�!h i]


x1 x2 x3 x4

�!h 1

�!h 2

�!h 3

�!h 4

�h 1

�h 2

�h 3

�h 4

Ich mochte ein Bier

F 2 R2n⇥|f |

fi = [ �h i;�!h i]

→

0

Ich mochte ein Bier

→

0

Ich mochte ein Bier


a>1

a>2

a>3

a>4

a>5

→

0

Ich mochte ein Bier


a>1

a>2

a>3

a>4

a>5

I'd

→

0

Ich mochte ein Bier


a>1

a>2

a>3

a>4

a>5

I'd

I'd →

like

0

Ich mochte ein Bier


a>1

a>2

a>3

a>4

a>5

I'd

I'd →

like

like

a

0

Ich mochte ein Bier


a>1

a>2

a>3

a>4

a>5

I'd

I'd →

like

like

a

a

beer

0

Ich mochte ein Bier


a>1

a>2

a>3

a>4

a>5

I'd

I'd →

like

like

a

a

beer

beer

stopSTOP

0

Ich mochte ein Bier


a>1

a>2

a>3

a>4

a>5

Computing Attention• At each time step (one time step = one output word), we want to be able to

“attend” to different words in the source sentence

• We need a weight for every word: this is an |f|-length vector at

• Here is a simplified version of Bahdanau et al.’s solution

• Use an RNN to predict model output, call the hidden states

• At time t compute the expected input embedding

• Take the dot product with every column in the source matrix to compute the attention energy.

• Exponentiate and normalize to 1:

• Finally, the input source vector for time t is

at = softmax(ut)

ct = Fat

(called in the paper)

(called in the paper)

st

ut = F>rt

rt = Vst�1( is a learned parameter)V

et

↵t

(Since F has |f| columns, has |f| rows)ut

(st has a fixed dimensionality, call it m)

English-French

English-German

I confess that I speak neither French nor German. Sorry!

Summary• Attention is closely related to “pooling” operations in convnets (and other

architectures)

• Bahdanau’s attention model seems to only cares about “content”

• No obvious bias in favor of diagonals, short jumps, fertility, etc.

• Some work has begun to add other “structural” biases (Luong et al., 2015; Cohn et al., 2016), but there are lots more opportunities

• Attention is similar to alignment, but there are important differences

• alignment makes stochastic but hard decisions. Even if the alignment probability distribution is “flat”, the model picks one word or phrase at a time

• attention is “soft” (you add together all the words). Big difference between “flat” and “peaked” attention weights

Summary II• Nonlinear classification, automatic feature

induction, and automatic differentiation are good.

• But just how good are they?

Technology

Edinburgh MT lecture 12: Neural encoder-decoder and attention models