61
Recurrent Neural Networks 2 CS 287 (Based on Yoav Goldberg’s notes)

Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

Embed Size (px)

Citation preview

Page 1: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

Recurrent Neural Networks 2

CS 287

(Based on Yoav Goldberg’s notes)

Page 2: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

Review: Representation of Sequence

I Many tasks in NLP involve sequences

w1, . . . ,wn

I Representations as matrix dense vectors X

(Following YG, slight abuse of notation)

x1 = x01W0, . . . , xn = x0nW

0

I Would like fixed-dimensional representation.

Page 3: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

Review: Sequence Recurrence

I Can map from dense sequence to dense representation.

I x1, . . . , xn 7→ s1, . . . , sn

I For all i ∈ {1, . . . , n}

si = R(si−1, xi ; θ)

I θ is shared by all R

Example:

s4 = R(s3, x4)

= R(R(s2, x3), x4)

= R(R(R(R(s0, x1), x2), x3), x4)

Page 4: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

Review: BPTT (Acceptor)

I Run forward propagation.

I Run backward propagation.

I Update all weights (shared)

y

sn

xn. . .

. . .s3

x3s2

x2s1

x1s0

Page 5: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

Review: BPTT (Acceptor)

I Run forward propagation.

I Run backward propagation.

I Update all weights (shared)

y

sn

xn. . .

. . .s3

x3s2

x2s1

x1s0

Page 6: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

Review: BPTT (Acceptor)

I Run forward propagation.

I Run backward propagation.

I Update all weights (shared)

y

sn

xn. . .

. . .s3

x3s2

x2s1

x1s0

Page 7: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

Review: BPTT (Acceptor)

I Run forward propagation.

I Run backward propagation.

I Update all weights (shared)

y

sn

xn. . .

. . .s3

x3s2

x2s1

x1s0

Page 8: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

Review: BPTT (Acceptor)

I Run forward propagation.

I Run backward propagation.

I Update all weights (shared)

y

sn

xn. . .

. . .s3

x3s2

x2s1

x1s0

Page 9: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

Review: BPTT (Acceptor)

I Run forward propagation.

I Run backward propagation.

I Update all weights (shared)

y

sn

xn. . .

. . .s3

x3s2

x2s1

x1s0

Page 10: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

Review: BPTT (Acceptor)

I Run forward propagation.

I Run backward propagation.

I Update all weights (shared)

y

sn

xn. . .

. . .s3

x3s2

x2s1

x1s0

Page 11: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

Review: BPTT (Acceptor)

I Run forward propagation.

I Run backward propagation.

I Update all weights (shared)

y

sn

xn. . .

. . .s3

x3s2

x2s1

x1s0

Page 12: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

Review: BPTT (Acceptor)

I Run forward propagation.

I Run backward propagation.

I Update all weights (shared)

y

sn

xn. . .

. . .s3

x3s2

x2s1

x1s0

Page 13: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

Review: BPTT (Acceptor)

I Run forward propagation.

I Run backward propagation.

I Update all weights (shared)

y

sn

xn. . .

. . .s3

x3s2

x2s1

x1s0

Page 14: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

Review: Issues

I Can be inefficient, but batch/GPUs help.

I Model is much deeper than previous approaches.

I This matters a lot, focus of next class.

I Variable-size model for each sentence.

I Have to be a bit more clever in Torch.

Page 15: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

Quiz

Consider a ReLU version of the Elman RNN with function R defined as

NN(x, s) = ReLU(sWs + xWx + b).

We use this RNN with an acceptor architecture over the sequence

x1, . . . , x5. Assume we have computed the gradient for the final layer

∂L

∂s5

What is the symbolic gradient of the previous state ∂L∂s4

?

What is the symbolic gradient of the first state ∂L∂s1

?

Page 16: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

Answer

∂L

∂s4,i= ∑

j

∂s5,j∂s4,i

∂L

∂s5,j

= ∑j

W si ,j

∂L∂s5,j

s5,j > 0

0 o.w .

= ∑j

1(s5,j > 0)W si ,j

∂L

∂s5,j

I Chain-Rule

I ReLU Gradient rule

I Indicator notation

Page 17: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

Answer

∂L

∂s1,j1= ∑

j2

∂s2,j2∂s1,j1

. . . ∑j5

∂s5,j5∂s4,j4

∂L

∂s5,j5

= ∑j2

. . . ∑j5

1(s2,j2 > 0) . . . 1(s5,j5 > 0)W sj1,j2 . . .W s

j4,j5

∂L

∂s5,j5

= ∑j2 ...j5

5

∏k=2

1(sk,jk > 0)W sjk−1,jk

∂L

∂s5,j5

I Multiple applications of Chain rule

I Combine and multiply.

I Product of weights.

Page 18: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

The Promise of RNNs

I Hope: Learn the long-range interactions of language from data.

I For acceptors this means maintaining early state:

I Example: How can you not see this movie?

You should not see this movie.

I Memory interaction here is at s1, but gradient signal is at sn

Page 19: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

Long-Term Gradients

I Gradients go through multiplicative layers.

I Fine at end layers, but issues with early layers.

I For instance consider quiz with hardtanh

∑j2...j5

5

∏k=2

1(1 > sk,jk > 0)W sjk−1,jk

∂L

∂s5,j5

I The multiplicative effect tends very large exploding or vanishing

gradients.

Page 20: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

Exploding Gradients

Easier case, can be handled heuristically,

I Occurs if there is no saturation but exponential blowup.

∑j2...jn

n

∏k=2

W sjk−1,jk

I In these cases, there are reasonable short-term gradients, but bad

long-term gradients.

I Two practical heuristics:

I gradient clipping, i.e. bounding any gradient by a maximum value

I gradient normalization, i.e. renormalizing the the RNN gradients if

they are above a fixed norm value.

Page 21: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

Vanishing Gradients

I Occurs when combining small weights (or from saturation)

∑j2...jn

n

∏k=2

W sjk−1,jk

I Again, affects mainly long-term gradients.

I However, not as simple as boosting gradients.

Page 22: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

LSTM (Hochreiter and Schmidhuber, 1997)

Page 23: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

LSTM Formally

R(si−1, xi ) = [ci ,hi ]

ci = j� i+ f � ci−1

hi = tanh(ci )� o

i = tanh(xWxi + hi−1Whi + bi )

j = σ(xWxj + hi−1Whj + bj )

f = σ(xWxf + hi−1Whf + bf )

o = tanh(xWxo + hi−1Who + bo)

I (eeks.)

Page 24: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors
Page 25: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

Unreasonable Effectiveness of RNNs

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Page 26: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors
Page 27: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

PANDARUS:

Alas, I think he shall be come approached and the day

When little srain would be attain’d into being never fed,

And who is but a chain and subjects of his death,

I should not sleep.

Second Senator:

They are away this miseries, produced upon my soul,

Breaking and strongly should be buried, when I perish

The earth and thoughts of many states.

DUKE VINCENTIO:

Well, your wit is in the care of side and that.

Page 28: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

Music Composition

http://www.hexahedria.com/2015/08/03/

composing-music-with-recurrent-neural-networks/

Page 29: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

Today’s Lecture

I Build up to LSTMs

I Note: Development is ahistoric, later papers first.

I Main idea: Getting around the vanishing gradient issue.

Page 30: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

Contents

Highway Networks

Gated Recurrent Unit (GRU)

LSTM

Page 31: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

Review: Deep ReLU Networks

NNlayer (x) = ReLU(xW1 + b1)

I Given a input x can build arbitrarily deep fully-connected networks.

I Deep Neural Network (DNN) :

NNlayer (NNlayer (NNlayer (. . .NNlayer (x))))

Page 32: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

Deep Networks

NNlayer (x) = ReLU(xW1 + b1)

hn

hn−1

. . .

h2

h1

x

I Can have similar issues with vanishing gradients.

∂L

∂hn−1,jn−1= ∑

jn

1(hn,jn > 0)Wjn−1,jn∂L

∂hn,jn

Page 33: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

Deep Networks

NNlayer (x) = ReLU(xW1 + b1)

hn

hn−1

. . .

h2

h1

x

I Can have similar issues with vanishing gradients.

∂L

∂hn−1,jn−1= ∑

jn

1(hn,jn > 0)Wjn−1,jn∂L

∂hn,jn

Page 34: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

Review: NLM Skip Connections

Page 35: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

Thought Experiment: Additive Skip-Connections

NNsl1(x) =1

2ReLU(xW1 + b1) +

1

2x

hn

hn−1

. . .

h3

h2

h1

x

Page 36: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

Exercise: Gradients

Exercise: What is the gradient of ∂Lhn−1

with skip-connections?

Inductively what happens?

hn

hn−1

. . .

h3

h2

h1

x

Page 37: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

Exercise

We now have the average of two terms. One with no saturation

condition or multiplicative term.

∂L

∂hn−1,jn−1= 1

2 (∑jn

1(hn,jn > 0)Wjn−1,jn∂L

∂hn,jn) +

12 (hn−1,jn−1

∂L

∂hn,jn−1)

Page 38: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

Thought Experiment 2: Dynamic Skip-Connections

NNsl2(x) = (1− t)ReLU(xW1 + b1) + tx

t = σ(xWt + bt)

W1 ∈ Rdhid×dhid

Wt ∈ Rdhid×1

hn

hn−1

. . .

h3

h2

h1

x

Page 39: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

Dynamic Skip-Connections: intuition

NNsl2(x) = (1− t)ReLU(xW1 + b1) + tx

t = σ(xWt + bt)

I No longer directly computing the next layer hi .

I Instead computing ∆h and dynamic update proportion t.

I Seems like a small change from DNN. Why does this matter?

Page 40: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

Dynamic Skip-Connections Gradient

NNsl2(x) = (1− t)ReLU(xW1 + b1) + tx

t = σ(xWt + bt)

I The t values are saved on the forward pass.

I t allows for a direct flow of gradients to earlier layers.

∂L

∂hn−1,jn−1= (1− t) (∑

jn

1(hn,jn > 0)Wjn−1,jn∂L

∂hn,jn)

+ t (hn−1,jn−1∂L

∂hn,jn−1) + . . .

I (What is the . . .?)

Page 41: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

Dynamic Skip-Connections

NNsl2(x) = (1− t)ReLU(xW1 + b1) + tx

t = σ(xWt + bt)

∂L

∂hn−1,jn−1= (1− t) (∑

jn

1(hn,jn > 0)Wjn−1,jn∂L

∂hn,jn)

+ t (hn−1,jn−1∂L

∂hn,jn−1) + . . .

I Note: h and Wt are also receiving gradients through the sigmoid!

I (Backprop is fun.)

Page 42: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

Highway Network (Srivastava et al., 2015)

I Now add a combination at each dimension.

I � is point-wise multiplication.

NNhighway (x) = (1− t)� h+ t� x

h = ReLU(xW1 + b1)

t = σ(xWt + bt)

Wt ,W1 ∈ Rdhid×dhid

bt ,b1 ∈ R1×dhid

I h; transform (e.g. standard MLP layer)

I t; carry (dimension-specific dynamic skipping)

Page 43: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

Highway Gradients

I The t values are saved on the forward pass.

I tj determines the update of dimension j .

∂L

∂hn−1,jn−1= (∑

jn

(1− tjn)1(hn,jn > 0)Wjn−1,jn∂L

∂hn,jn)

+ tjn−1 (hn−1,jn−1∂L

∂hn,jn−1) + . . .

Page 44: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

Intuition: Gating

I This is known as the gating operation

t� x

I Allows vector t to mask or gate x.

I True gating would have t ∈ {0, 1}dhid

I Approximate with the sigmoid,

t = σ(Wtx+ b)

Page 45: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

Contents

Highway Networks

Gated Recurrent Unit (GRU)

LSTM

Page 46: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

Back To Acceptor RNNs

I Acceptor RNNs are deep networks with shared weights.

I Elman tanh layers

NN(x, s) = tanh(sWs + xWx + b).

sn

xn. . .

. . .s4

x4s3

x3s2

x2s1

x1s0

Page 47: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

RNN with Skip-Connections

I Can replace layer with modified highway layer.

R(si−1, xi ) = (1− t)� h+ t� si−1

h = tanh(xWx + si−1Ws + b)

t = σ(xWxt + si−1Wst + bt)

Wxt ,Wx ∈ Rdin×dhid

Wst ,Ws ∈ Rdhid×dhid

bt ,b ∈ R1×dhid

Page 48: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

RNN with Skip-Connections

sn

xn. . .

. . .s4

x4s3

x3s2

x2s1

x1s0

Page 49: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

Final Idea: Stopping flow

I For many tasks, it is useful to halt propagation.

I Can do this by applying a reset gate r.

h = tanh(xWx + (r� si−1)Ws + b)

r = σ(xWxr + si−1Wsr + br )

Page 50: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

Gated Recurrent Unit (GRU) (Cho et al 2014)

R(si−1, xi ) = (1− t)� h+ t� si−1

h = tanh(xWx + (r� si−1)Ws + b)

r = σ(xWxr + si−1Wsr + br )

t = σ(xWxt + si−1Wst + bt)

Wxt ,Wxr ,Wx ∈ Rdin×dhid

Wst ,Wsr ,Ws ∈ Rdhid×dhid

bt ,b ∈ R1×dhid

I t; dynamic skip-connections

I r; reset gating

I s; hidden state

I h; gated MLP

Page 51: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

Example: Language Modeling

I In non-Markovian language modeling, treat corpus as a sequence

I r allows resetting the state after a sequence.

consumers may want to move their telephones a little closer to

the tv set </s> <unk> <unk> watching abc ’s monday

night football can now vote during <unk> for the greatest

play in N years from among four or five <unk> <unk>

</s> two weeks ago viewers of several nbc <unk> consumer

segments started calling a N number for advice on various

<unk> issues </s> and the new syndicated reality show

hard copy records viewers ’ opinions for possible airing on the

next day ’s show </s>

Page 52: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

Contents

Highway Networks

Gated Recurrent Unit (GRU)

LSTM

Page 53: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

LSTM

I Long Short-Term Memory network uses the same idea.

I Model has three gates, input, output, forget.

I Developed first, but a bit harder to follow.

I Seems to be better at LM, comparable to GRU on other tasks.

Page 54: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

LSTMs State

The state si is made of 2 components :

I ci ; cell

I hi ; hidden

R(si−1, xi ) = [ci ,hi ]

Page 55: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

LSTM Development: Input and Forget Gates

The cell is updated with ∆c, not a convex combination (two gates).

R(si−1, xi ) = [ci ,hi ]

ci = j� h+ f � ci−1

h = tanh(xWxi + hi−1Whi + bi )

j = σ(xWxj + hi−1Whj + bj )

f = σ(xWxf + hi−1Whf + bf )

I ci ; cell

I j; input gate

I f; forget gate

Page 56: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

LSTM Development: Hidden

Hidden hi squashes ci

R(ci−1, xi ) = [ci ,hi ]

hi = tanh(ci )

ci = j� h+ f � ci−1

h = tanh(xWxi + hi−1Whi + bi )

j = σ(xWxj + hi−1Whj + bj )

f = σ(xWxf + hi−1Whf + bf )

I ci ; cell

I hi ; hidden

I j; input gate

I f; forget gate

Page 57: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

Long Short-Term Memory

Finally, another output gate is applied to h

R(si−1, xi ) = [ci ,hi ]

ci = j� i+ f � ci−1

hi = tanh(ci )� o

i = tanh(xWxi + hi−1Whi + bi )

j = σ(xWxj + hi−1Whj + bj )

f = σ(xWxf + hi−1Whf + bf )

o = tanh(xWxo + hi−1Who + bo)

Page 58: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

Output Gate?

I The output gate feels the most ad-hoc (why tanh?)

I Luckily Jozefowicz et al (2015) find output not too important.

Page 59: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors
Page 60: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

Accuracy Results (Jozefowicz et al, 2015)

Accuracy of RNN models on three different tasks. LSTM models

include versions without the forget, input, and output gates.

Page 61: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

English LM Results (Jozefowicz et al, 2015)

NLL (Perplexity)