21
Recurrent Neural Networks (RNNs) Dr. Benjamin Roth, Nina Poerner CIS LMU M¨ unchen Dr. Benjamin Roth, Nina Poerner (CIS LMU M¨ Recurrent Neural Networks (RNNs) 1 / 21

Recurrent Neural Networks (RNNs) - GitHub Pages · 2020-03-25 · Recurrent Neural Networks (RNNs) Dr. Benjamin Roth, Nina Poerner CIS LMU Munchen Dr. Benjamin Roth, Nina Poerner

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Recurrent Neural Networks (RNNs) - GitHub Pages · 2020-03-25 · Recurrent Neural Networks (RNNs) Dr. Benjamin Roth, Nina Poerner CIS LMU Munchen Dr. Benjamin Roth, Nina Poerner

Recurrent Neural Networks (RNNs)

Dr. Benjamin Roth, Nina Poerner

CIS LMU Munchen

Dr. Benjamin Roth, Nina Poerner (CIS LMU Munchen)Recurrent Neural Networks (RNNs) 1 / 21

Page 2: Recurrent Neural Networks (RNNs) - GitHub Pages · 2020-03-25 · Recurrent Neural Networks (RNNs) Dr. Benjamin Roth, Nina Poerner CIS LMU Munchen Dr. Benjamin Roth, Nina Poerner

Heute

9:15 - 10:45: RNN Basics + CNN

11:00 - 11:45: Ubung PyTorch

Statt Ubungsblatt bis nachste Woche durcharbeiten:I http://www.deeplearningbook.org/contents/rnn.html (Abschnitte 10.0

- 10.2.1, 10.7, 10.10)I http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Nachste Woche:I 9:15 - 10:00: “Journal Club” zu LSTMI 10:00 - 10:45: Keras (Teil 2)I 11:00 - 11:45: Ubung Word2Vec

Dr. Benjamin Roth, Nina Poerner (CIS LMU Munchen)Recurrent Neural Networks (RNNs) 2 / 21

Page 3: Recurrent Neural Networks (RNNs) - GitHub Pages · 2020-03-25 · Recurrent Neural Networks (RNNs) Dr. Benjamin Roth, Nina Poerner CIS LMU Munchen Dr. Benjamin Roth, Nina Poerner

Recurrent Neural Networks (RNNs)

Family of neural networks for processing sequential data x(1) . . . x(T ).

Sequences of words, characters, ...

Simplest case: for each time step t, compute representation h(t) fromcurrent input x(t) and previous representation h(t−1).

h(t) = f (h(t−1), x(t); θ)

x(t) can be embeddings, one-hot, output of some previous layer ...

Question: By recursion, what does h(t) depend on?I all previous inputs x(1) . . . x(t)

I the initial state h(0) (typically all-zero, but not necessarily, c.f.encoder-decoder)

I the parameters of θ

Dr. Benjamin Roth, Nina Poerner (CIS LMU Munchen)Recurrent Neural Networks (RNNs) 3 / 21

Page 4: Recurrent Neural Networks (RNNs) - GitHub Pages · 2020-03-25 · Recurrent Neural Networks (RNNs) Dr. Benjamin Roth, Nina Poerner CIS LMU Munchen Dr. Benjamin Roth, Nina Poerner

Parameter Sharing

Going from a time step t − 1 to t is parameterized by the sameparameters θ for all t!

h(t) = f (h(t−1), x(t); θ)

Question: Why is parameter sharing a good idea?I Fewer parametersI Can learn to detect features regardless of their positionI Can generalize to longer sequences than were seen in training

Dr. Benjamin Roth, Nina Poerner (CIS LMU Munchen)Recurrent Neural Networks (RNNs) 4 / 21

Page 5: Recurrent Neural Networks (RNNs) - GitHub Pages · 2020-03-25 · Recurrent Neural Networks (RNNs) Dr. Benjamin Roth, Nina Poerner CIS LMU Munchen Dr. Benjamin Roth, Nina Poerner

RNN: Output

The output at time t is computed from the hidden representation attime t:

o(t) = f (h(t);Vo)

Typically a linear transformation: o(t) = VTo h

(t)

Some RNNs compute o(t) at every time step, others only at the lasttime step o(T )

Dr. Benjamin Roth, Nina Poerner (CIS LMU Munchen)Recurrent Neural Networks (RNNs) 5 / 21

Page 6: Recurrent Neural Networks (RNNs) - GitHub Pages · 2020-03-25 · Recurrent Neural Networks (RNNs) Dr. Benjamin Roth, Nina Poerner CIS LMU Munchen Dr. Benjamin Roth, Nina Poerner

RNN: Output

n-to-1

h(0)

x(1)

h(1)

x(2)

h(2)

x(3)

h(3)

o(3)

Sentiment classification, topic classification ...

tagging

o(1) o(2)

POS tagging, NER tagging, Language Model ...

tagging (selective)Sequence2Sequence

h(4)

o(4)

h(5)

o(5)

h(6)

o(6)

Decoder

Encoder

Machine Translation, Summarization, Inflection, ...

Dr. Benjamin Roth, Nina Poerner (CIS LMU Munchen)Recurrent Neural Networks (RNNs) 6 / 21

Page 7: Recurrent Neural Networks (RNNs) - GitHub Pages · 2020-03-25 · Recurrent Neural Networks (RNNs) Dr. Benjamin Roth, Nina Poerner CIS LMU Munchen Dr. Benjamin Roth, Nina Poerner

Any questions so far?

Dr. Benjamin Roth, Nina Poerner (CIS LMU Munchen)Recurrent Neural Networks (RNNs) 7 / 21

Page 8: Recurrent Neural Networks (RNNs) - GitHub Pages · 2020-03-25 · Recurrent Neural Networks (RNNs) Dr. Benjamin Roth, Nina Poerner CIS LMU Munchen Dr. Benjamin Roth, Nina Poerner

RNN: Loss Function

Loss function:I Several time steps: L(y (1), . . . y (T ); o(1) . . . o(T ))I Last time step: L(y ; o(T ))

Example: POS TaggingI Output o(t) is predicted distribution over POS tags

F o(t) = P(tag =?|h(t))F Typically: o(t) = softmax(VT

o h(t))

I Loss at time t: negative log-likelihood (NLL) of true label y (t)

L(t) = − logP(tag = y (t)|h(t);Vo)

I Overall Loss for all time steps:

L =T∑t=1

L(t)

Dr. Benjamin Roth, Nina Poerner (CIS LMU Munchen)Recurrent Neural Networks (RNNs) 8 / 21

Page 9: Recurrent Neural Networks (RNNs) - GitHub Pages · 2020-03-25 · Recurrent Neural Networks (RNNs) Dr. Benjamin Roth, Nina Poerner CIS LMU Munchen Dr. Benjamin Roth, Nina Poerner

Graphical Notation

Nodes indicate input data (x) or function outputs (otherwise).

Arrows indicate functions arguments.

Compact notation (left):I All time steps conflated.I � indicates “delay” of 1 time unit.

Source: Goodfellow et al.: Deep Learning.

Dr. Benjamin Roth, Nina Poerner (CIS LMU Munchen)Recurrent Neural Networks (RNNs) 9 / 21

Page 10: Recurrent Neural Networks (RNNs) - GitHub Pages · 2020-03-25 · Recurrent Neural Networks (RNNs) Dr. Benjamin Roth, Nina Poerner CIS LMU Munchen Dr. Benjamin Roth, Nina Poerner

Graphical Notation: Including Output and Loss Function

Source: Goodfellow et al.: Deep Learning.

Dr. Benjamin Roth, Nina Poerner (CIS LMU Munchen)Recurrent Neural Networks (RNNs) 10 / 21

Page 11: Recurrent Neural Networks (RNNs) - GitHub Pages · 2020-03-25 · Recurrent Neural Networks (RNNs) Dr. Benjamin Roth, Nina Poerner CIS LMU Munchen Dr. Benjamin Roth, Nina Poerner

Any questions so far?

Dr. Benjamin Roth, Nina Poerner (CIS LMU Munchen)Recurrent Neural Networks (RNNs) 11 / 21

Page 12: Recurrent Neural Networks (RNNs) - GitHub Pages · 2020-03-25 · Recurrent Neural Networks (RNNs) Dr. Benjamin Roth, Nina Poerner CIS LMU Munchen Dr. Benjamin Roth, Nina Poerner

Backpropagation through time

We have calculated loss L at time step T and want to update θ withgradient descent.

For now, imagine that we have time step-specific“dummy”-parameters θ(t), which are identical copies of θ

→ the unrolled RNN looks like a feed-forward-neural-network!

→ we can calculate ∂L∂θ(t)

using standard backpropagation

Question: How to calculate ∂L∂θ ?

Add up the “dummy” gradients:

∂L∂θ

=T∑t=1

∂L∂θ(t)

Dr. Benjamin Roth, Nina Poerner (CIS LMU Munchen)Recurrent Neural Networks (RNNs) 12 / 21

Page 13: Recurrent Neural Networks (RNNs) - GitHub Pages · 2020-03-25 · Recurrent Neural Networks (RNNs) Dr. Benjamin Roth, Nina Poerner CIS LMU Munchen Dr. Benjamin Roth, Nina Poerner

Truncated backpropagation through time

Simple idea: Stop backpropagation through time after k time steps

∂L∂θ

=T∑

t=T−k

∂L∂θ(t)

Question: What are advantages and disadvantages?I Advantage: Faster and parallelizableI Disadvantage: If k is too small, long-range dependencies are hard to

learn

Dr. Benjamin Roth, Nina Poerner (CIS LMU Munchen)Recurrent Neural Networks (RNNs) 13 / 21

Page 14: Recurrent Neural Networks (RNNs) - GitHub Pages · 2020-03-25 · Recurrent Neural Networks (RNNs) Dr. Benjamin Roth, Nina Poerner CIS LMU Munchen Dr. Benjamin Roth, Nina Poerner

Any questions so far?

Dr. Benjamin Roth, Nina Poerner (CIS LMU Munchen)Recurrent Neural Networks (RNNs) 14 / 21

Page 15: Recurrent Neural Networks (RNNs) - GitHub Pages · 2020-03-25 · Recurrent Neural Networks (RNNs) Dr. Benjamin Roth, Nina Poerner CIS LMU Munchen Dr. Benjamin Roth, Nina Poerner

Vanilla RNN

h(t) = f (h(t−1), x(t); θ) = tanh(Ux(t) + Wh(h−1) + b)

θ = {W,U,b}

W: Hidden-to-hidden

U: Input-to-hidden

b: Bias term

Vanilla RNN in keras:

vanilla = SimpleRNN(units=10, use_bias = True)

vanilla.build(input_shape = (None, None, 30))

print([weight.shape for weight in vanilla.get_weights()])

[(30, 10), (10, 10), (10,)]

Question: Which shape belongs to which weight?

Dr. Benjamin Roth, Nina Poerner (CIS LMU Munchen)Recurrent Neural Networks (RNNs) 15 / 21

Page 16: Recurrent Neural Networks (RNNs) - GitHub Pages · 2020-03-25 · Recurrent Neural Networks (RNNs) Dr. Benjamin Roth, Nina Poerner CIS LMU Munchen Dr. Benjamin Roth, Nina Poerner

Bidirectional RNNs

Conceptually: Two RNNs that run in opposite directions over thesame input

Typically, each RNN has its own set of parameters

Two sequences of hidden vectors:→h(1) . . .

→h(T ),

←h(1) . . .

←h(T )

Typically,→h and

←h are concatenated along their hidden dimension

Question: Which hidden vectors should we concatenate if our outputlayer needs a single hidden vector h?

I h =→h(T )||

←h(1)

I Because these are the vectors that have “read” the entire sequence

Question: Which hidden vectors should we concatenate if we needone hidden vector per time step t?

I h(t) =→h(t)||

←h(t)

I Full left context, full right context

Dr. Benjamin Roth, Nina Poerner (CIS LMU Munchen)Recurrent Neural Networks (RNNs) 16 / 21

Page 17: Recurrent Neural Networks (RNNs) - GitHub Pages · 2020-03-25 · Recurrent Neural Networks (RNNs) Dr. Benjamin Roth, Nina Poerner CIS LMU Munchen Dr. Benjamin Roth, Nina Poerner

Multi-Layer RNNs

Conceptually: A stack of L RNNs, such that x(t)l = h

(t)l−1.

h(0)1

h(0)2

x(1)

h(1)1

h(1)2

x(2)

h(2)1

h(2)2

x(3)

h(3)1

h(3)2

x(4)

h(4)1

h(4)2

x(5)

h(5)1

h(5)2

Dr. Benjamin Roth, Nina Poerner (CIS LMU Munchen)Recurrent Neural Networks (RNNs) 17 / 21

Page 18: Recurrent Neural Networks (RNNs) - GitHub Pages · 2020-03-25 · Recurrent Neural Networks (RNNs) Dr. Benjamin Roth, Nina Poerner CIS LMU Munchen Dr. Benjamin Roth, Nina Poerner

Feeding outputs back

What do we do if the input sequence x(1) . . . x(T ) is only given attraining time, but not at test time?

Examples: Machine Translation decoder, (generative) language model

Dr. Benjamin Roth, Nina Poerner (CIS LMU Munchen)Recurrent Neural Networks (RNNs) 18 / 21

Page 19: Recurrent Neural Networks (RNNs) - GitHub Pages · 2020-03-25 · Recurrent Neural Networks (RNNs) Dr. Benjamin Roth, Nina Poerner CIS LMU Munchen Dr. Benjamin Roth, Nina Poerner

Example: Machine Translation

h(0) h(1)

o(1)

w (1)

y (1)

L(1)

h(2)

o(2)

w (2)

y (2)

L(2)

h(3)

o(3)

w (3)

y (3)

L(3)

Last encoder state

(has “read” source)

Predicted

True

w (1) w (2)

x(2) x(3) Embeddings

Predictedy (1) y (2) Oracle

Dr. Benjamin Roth, Nina Poerner (CIS LMU Munchen)Recurrent Neural Networks (RNNs) 19 / 21

Page 20: Recurrent Neural Networks (RNNs) - GitHub Pages · 2020-03-25 · Recurrent Neural Networks (RNNs) Dr. Benjamin Roth, Nina Poerner CIS LMU Munchen Dr. Benjamin Roth, Nina Poerner

Oracle signal

Give Neural Network a signal that it will not have at test time

Can be useful during training (e.g., mix oracle and predicted signal)

Can establish upper bounds of modules

Dr. Benjamin Roth, Nina Poerner (CIS LMU Munchen)Recurrent Neural Networks (RNNs) 20 / 21

Page 21: Recurrent Neural Networks (RNNs) - GitHub Pages · 2020-03-25 · Recurrent Neural Networks (RNNs) Dr. Benjamin Roth, Nina Poerner CIS LMU Munchen Dr. Benjamin Roth, Nina Poerner

Gated RNNs: Teaser

Vanilla RNNs are not frequently used, as they tend to forget pastinformation quickly

Instead: LSTM, GRU, ... (next week!)

Dr. Benjamin Roth, Nina Poerner (CIS LMU Munchen)Recurrent Neural Networks (RNNs) 21 / 21