Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
IN 5400 Machine learning for image classification
Lecture 11: Recurrent neural networks
March 18 , 2020
Tollef Jahren
IN5400
18.03.2020
Outline
• Overview (Feed forward and convolution neural networks)
• Vanilla Recurrent Neural Network (RNN)
• Input-output structure of RNN’s
• Training Recurrent Neural Networks
• Simple examples
• Advanced models
• Advanced examples
Page 2
IN5400
18.03.2020
About today
Page 3
Goals:
• Understand how to build a recurrent neural network
• Understand why long range dependencies is a challenge for recurrent
neural networks
IN5400
18.03.2020
Readings
• Video:
– cs231n: Lecture 10 | Recurrent Neural Networks
• Read:
– The Unreasonable Effectiveness of Recurrent Neural Networks
• Optional:
– Video: cs224d L8: Recurrent Neural Networks and Language Models
– Video: cs224d L9: Machine Translation and Advanced Recurrent LSTMs
and GRUs
Page 4
IN5400
18.03.2020
• Overview (Feed forward and convolution neural networks)
• Vanilla Recurrent Neural Network (RNN)
• Input-output structure of RNN’s
• Training Recurrent Neural Networks
• Simple examples
• Advanced models
• Advanced examples
Page 5
Progress
IN5400
18.03.2020
Overview
• What have we learnt so far?
– Fully connected neural network
– Convolutional neural network
• What worked well with these?
– Fully connected neural network
works well when the input data is
structured
– Convolutional neural network works
well on images
• What does not work well?
– Processing data with unknown length
Page 6
IN5400
18.03.2020
• Overview (Feed forward and convolution neural networks)
• Vanilla Recurrent Neural Network (RNN)
• Input-output structure of RNN’s
• Training Recurrent Neural Networks
• Simple examples
• Advanced models
• Advanced examples
Page 7
Progress
IN5400
18.03.2020
Recurrent Neural Network (RNN)
• Takes a new input
• Manipulate the state
• Reuse weights
• Gives a new output
Page 8
𝑦𝑡
𝑥𝑡
IN5400
18.03.2020
Recurrent Neural Network (RNN)
• Unrolled view
• 𝑥𝑡 : The input vector:
• ℎ𝑡: The hidden state of the RNN
• 𝑦𝑡 : The output vector:
Page 9
𝑦𝑡
𝑥𝑡
IN5400
18.03.2020
Recurrent Neural Network (RNN)
• Recurrence formula:
– Input vector: 𝑥𝑡– Hidden state vector: ℎ𝑡
Page 10
𝑦𝑡
𝑥𝑡
Note: We use the same function and parameters
for every “time” step
IN5400
18.03.2020
(Vanilla) Recurrent Neural Network
• Input vector: 𝑥𝑡• Hidden state vector: ℎ𝑡• Output vector: 𝑦𝑡• Weight matrices: 𝑊ℎℎ, 𝑊𝑥ℎ ,𝑊ℎ𝑦
• Bias vectors: 𝑏ℎ, 𝑏𝑦
General form:
ℎ𝑡 = 𝑓𝑤 ℎ𝑡−1, 𝑥𝑡
Vanilla RNN (simplest form):
ℎ𝑡 = tanh 𝑊ℎℎℎ𝑡−1 +𝑊𝑥ℎ𝑥𝑡 + 𝑏ℎ
𝑦𝑡 = 𝑊ℎ𝑦ℎ𝑡 + 𝑏𝑦
Page 11
𝑦𝑡
𝑥𝑡
IN5400
18.03.2020
RNN: Computational Graph
Page 12
IN5400
18.03.2020
RNN: Computational Graph
Page 13
IN5400
18.03.2020
RNN: Computational Graph
Page 14
IN5400
18.03.2020
RNN: Computational Graph
Page 15
IN5400
18.03.2020
RNN: Computational Graph
Page 16
IN5400
18.03.2020
RNN: Computational Graph
Page 17
IN5400
18.03.2020
RNN: Computational Graph
Page 18
IN5400
18.03.2020
Example: Language model
• Task: Predicting the next character
• Training sequence: “hello”
• Vocabulary:
– [h, e, l, o]
• Encoding: Onehot
Page 19
IN5400
18.03.2020
RNN: Predicting the next character
• Task: Predicting the next character
• Training sequence: “hello”
• Vocabulary:
– [h, e, l, o]
• Encoding: Onehot
• Model:
ℎ𝑡 = tanh 𝑊ℎℎℎ𝑡−1 +𝑊𝑥ℎ𝑥𝑡
Page 20
IN5400
18.03.2020
RNN: Predicting the next character
• Task: Predicting the next character
• Training sequence: “hello”
• Vocabulary:
– [h, e, l, o]
• Encoding: Onehot
• Model:
ℎ𝑡 = tanh 𝑊ℎℎℎ𝑡−1 +𝑊𝑥ℎ𝑥𝑡
𝑦𝑡 = 𝑊ℎ𝑦ℎ𝑡
Page 21
IN5400
18.03.2020
RNN: Predicting the next character
• Vocabulary:
– [h, e, l, o]
• At test time, we can sample from the
model to synthesis new text.
Page 22
IN5400
18.03.2020
RNN: Predicting the next character
• Vocabulary:
– [h, e, l, o]
• At test time, we can sample from the
model to synthesis new text.
Page 23
IN5400
18.03.2020
RNN: Predicting the next character
• Vocabulary:
– [h, e, l, o]
• At test time, we can sample from the
model to synthesis new text.
Page 24
IN5400
18.03.2020
RNN: Predicting the next character
• Vocabulary:
– [h, e, l, o]
• At test time, we can sample from the
model to synthesis new text.
Page 25
IN5400
18.03.2020
• Overview (Feed forward and convolution neural networks)
• Vanilla Recurrent Neural Network (RNN)
• Input-output structure of RNN’s
• Training Recurrent Neural Networks
• Simple examples
• Advanced models
• Advanced examples
Page 26
Progress
IN5400
18.03.2020
Input-output structure of RNN’s
• One-to-one
• one-to-many
• many-to-one
• Many-to-many
• many-to-many (encoder-decoder)
Page 27
IN5400
18.03.2020
RNN: One-to-one
Normal feed forward neural network
• Task:
– Predict housing prices
Page 28
Input layer
Hidden layer
Output layer
IN5400
18.03.2020
RNN: One-to-many
Page 29
Task:
• Image Captioning
– (Image Sequence of words)
Input layer
Hidden layer
Output layer
IN5400
18.03.2020
RNN: Many-to-one
Page 30
Tasks:
• Sentiment classification
• Video classification
Input layer
Hidden layer
Output layer
IN5400
18.03.2020
RNN: Many-to-many
Page 31
Task:
• Video classification on frame level
• Text analysis (nouns, verbs)
Input layer
Hidden layer
Output layer
IN5400
18.03.2020
RNN: Many-to-many (encoder-
decoder)
Page 32
Task:
• Machine Translation
(English French)
Input layer
Hidden layer
Output layer
z
IN5400
18.03.2020
• Overview (Feed forward and convolution neural networks)
• Vanilla Recurrent Neural Network (RNN)
• Input-output structure of RNN’s
• Training Recurrent Neural Networks
• Simple examples
• Advanced models
• Advanced examples
Page 33
Progress
IN5400
18.03.2020
Training Recurrent Neural Networks
• The challenge in training a recurrent neural network is to preserve long range
dependencies.
• Vanilla recurrent neural network
– ℎ𝑡 = 𝑓 𝑊ℎℎℎ𝑡−1 +𝑊ℎ𝑥𝑥𝑡 + 𝑏
– 𝑓 = tanh() can easily cause vanishing gradients.
– 𝑓 = 𝑟𝑒𝑙𝑢() can easily cause exploding values and/or gradients
• Finite memory available
Page 34
IN5400
18.03.2020
Exploding or vanishing gradients
• tanh() solves the exploding value problem
• tanh() does NOT solve the exploding gradient problem, think of a scalar input
and a scalar hidden state.
ℎ𝑡 = 𝑡𝑎𝑛ℎ 𝑊ℎℎℎ𝑡−1 +𝑊ℎ𝑥𝑥𝑡 + 𝑏
𝜕ℎ𝑡𝜕ℎ𝑡−1
= 1 − tanh2(𝑊ℎℎℎ𝑡−1 +𝑊ℎ𝑥𝑥𝑡 + 𝑏) ⋅ 𝑊ℎℎ
The gradient can explode/vanish exponentially in time (steps)
• If |𝑊ℎℎ| < 1, vanishing gradients
• If |𝑊ℎℎ| > 1, exploding gradients
Page 35
IN5400
18.03.2020
Exploding or vanishing gradients
(for reference only)
• Conditions for vanishing and exploding gradients for the weight matrix, 𝑊ℎℎ:
– If largest singular value > 1: Exploding gradients
– If largest singular value < 1: Vanishing gradients
Page 36
“On the difficulty of training Recurrent Neural Networks, Pascanu et al. 2013
IN5400
18.03.2020
Gradient clipping
• To avoid exploding gradients, a simple
trick is to set a threshold of the norm of
the gradient matrix, 𝜕ℎ𝑡
𝜕ℎ𝑡−1.
• Error surface of a single hidden unit RNN
• High curvature walls
• Solid lines: Standard gradient descent
trajectories
• Dashed lines gradient rescaled to fixes
size
Page 37
“On the difficulty of training Recurrent Neural Networks, Pascanu et al. 2013
IN5400
18.03.2020
Vanishing gradients - “less of a
problem”
• In contrast to feed-forward networks, RNNs will not stop learning despite of
vanishing gradients.
• The network gets “fresh” inputs each step, so the weights will be updated.
• The challenge is to learning long range dependencies. This can be improved
using more advanced architectures.
Page 38
IN5400
18.03.2020
Backpropagation through time (BPTT)
• Training of a recurrent neural network is done using backpropagation and
gradient descent algorithms
– To calculate gradients you have to keep the internal RNN values in
memory until you get the backpropagating gradient
– Then what if you are reading a book of 100 million characters?
Page 41
IN5400
18.03.2020
Truncated Backpropagation thought
time
• The solution is that you stop at some point and update the weight as if you
were done.
Page 42
IN5400
18.03.2020
Truncated Backpropagation thought
time
• The solution is that you stop at some point and update the weight as if you
were done.
Page 43
IN5400
18.03.2020
Truncated Backpropagation thought
time
• The solution is that you stop at some point and update the weight as if you
were done.
Page 44
IN5400
18.03.2020
Truncated Backpropagation thought
time
• Advantages:
– Reducing the memory requirement
– Faster parameter updates
• Disadvantage:
– Not able to capture longer dependencies then the truncated length.
Page 45
IN5400
18.03.2020
• Overview (Feed forward and convolution neural networks)
• Vanilla Recurrent Neural Network (RNN)
• Input-output structure of RNN’s
• Training Recurrent Neural Networks
• Simple examples
• Advanced models
• Advanced examples
Page 48
Progress
IN5400
18.03.2020
Watch cs231n
• Watch from 31:35 to 39:00
• cs231n: Lecture 10 | Recurrent Neural Networks
Page 49
IN5400
18.03.2020
Inputting Shakespeare
• Input a lot of text
• Try to predict the next character
• Learning spelling, quotes and
punctuation.
Page 50
https://gist.github.com/karpathy/d4dee566867f8291f086
IN5400
18.03.2020
Inputting Shakespeare
Page 51
https://gist.github.com/karpathy/d4dee566867f8291f086
IN5400
18.03.2020
Inputting latex
Page 52
IN5400
18.03.2020
Inputting C code (linux kernel)
Page 53
IN5400
18.03.2020
Interpreting cell activations
• Semantic meaning of the hidden state vector
Page 54
Visualizing and Understanding Recurrent Networks
IN5400
18.03.2020
Interpreting cell activations
Page 55
Visualizing and Understanding Recurrent Networks
IN5400
18.03.2020
Interpreting cell activations
Page 56
Visualizing and Understanding Recurrent Networks
IN5400
18.03.2020
Interpreting cell activations
Page 57
Visualizing and Understanding Recurrent Networks
IN5400
18.03.2020
Interpreting cell activations
Page 58
Visualizing and Understanding Recurrent Networks
IN5400
18.03.2020
• Overview (Feed forward and convolution neural networks)
• Vanilla Recurrent Neural Network (RNN)
• Input-output structure of RNN’s
• Training Recurrent Neural Networks
• Simple examples
• Advanced models
• Advanced examples
Page 59
Progress
IN5400
18.03.2020
Gated Recurrent Unit (GRU)
• A more advanced recurrent unit
• Uses gates to control information flow
• Used to improve long term dependencies
Page 60
IN5400
18.03.2020
Gated Recurrent Unit (GRU)
Page 61
Vanilla RNN
• Cell ℎ𝑡 = 𝑡𝑎𝑛ℎ(𝑊ℎ𝑥𝑥𝑡 +𝑊ℎℎℎ𝑡−1 + 𝑏)
GRU
• Update gate Γ𝑢 = 𝜎(𝑊𝑢𝑥𝑡 + 𝑈𝑢ℎ𝑡−1 + 𝑏𝑢)
• Reset gate Γ𝑟 = 𝜎(𝑊𝑟𝑥𝑡 + 𝑈𝑟ℎ𝑡−1 + 𝑏𝑟)
• Candidate cell state ෨ℎ𝑡 = 𝑡𝑎𝑛ℎ 𝑊𝑥𝑡 + 𝑈(Γ𝑟∘ ℎ𝑡−1 + 𝑏)
• Final cell state ℎ𝑡 = Γ𝑢 ∘ ℎ𝑡−1 + 1 − Γ𝑢 ∘ ෨ℎ𝑡
The GRU has the ability to add and to remove from the state, not “transforming
the state” only.
With Γ𝑟 as ones and Γ𝑢 as zeros, GRU ՜ Vanilla RNN
IN5400
18.03.2020
Gated Recurrent Unit (GRU)
• Update gate Γ𝑢 = 𝜎(𝑊𝑢𝑥𝑡 + 𝑈𝑢ℎ𝑡−1 + 𝑏𝑢)
• Reset gate Γ𝑟 = 𝜎(𝑊𝑟𝑥𝑡 + 𝑈𝑟ℎ𝑡−1 + 𝑏𝑟)
• Candidate cell state ෨ℎ𝑡 = 𝑡𝑎𝑛ℎ 𝑊𝑥𝑡 + 𝑈(Γ𝑟∘ ℎ𝑡−1 + 𝑏)
• Final cell state ℎ𝑡 = Γ𝑢 ∘ ℎ𝑡−1 + 1 − Γ𝑢 ∘ ෨ℎ𝑡
Page 62
• The update gate , Γ𝑢, can easily be close to 1 due to the sigmoid activation
function. Copying the previous hidden state will prevent vanishing gradients.
• With Γ𝑢 = 0 and Γ𝑟 = 0 the GRU unit can forget the past.
• In a standard RNN, we are transforming the hidden state regardless of how
useful the input is.
IN5400
18.03.2020
Long Short Term Memory (LSTM)
(for reference only)
Page 63
GRU LSTM
Update gate Γ𝑢 = 𝜎(𝑊𝑢𝑥𝑡 + 𝑈𝑢ℎ𝑡−1 + 𝑏𝑢) Forget gate Γ𝑓 = 𝜎(𝑊𝑓𝑥𝑡 + 𝑈𝑓 ℎ𝑡−1 + 𝑏𝑓)
Reset gate Γ𝑟 = 𝜎(𝑊𝑟𝑥𝑡 + 𝑈𝑟ℎ𝑡−1 + 𝑏𝑟) Input gate Γ𝑖 = 𝜎(𝑊𝑖𝑥𝑡 + 𝑈𝑖 ℎ𝑡−1 + 𝑏𝑖)
Output gate Γ𝑜 = 𝜎(𝑊𝑜𝑥𝑡 + 𝑈𝑜ℎ𝑡−1 + 𝑏𝑜)
Candidate cell
state
෨ℎ𝑡 = 𝑡𝑎𝑛ℎ 𝑊𝑥𝑡 + 𝑈(Γ𝑟∘ ℎ𝑡−1 + 𝑏) Potential cell memory ǁ𝑐𝑡 = 𝑡𝑎𝑛ℎ(𝑊𝑐𝑥𝑡 + 𝑈𝑐 ℎ𝑡−1 + 𝑏𝑐)
Final cell state ℎ𝑡 = Γ𝑢 ∘ ℎ𝑡−1 + 1 − Γ𝑢 ∘ ෨ℎ𝑡 Final cell memory 𝑐𝑡 = Γ𝑓 ∘ 𝑐𝑡−1 + Γ𝑖 ∘ ǁ𝑐𝑡
Final cell state ℎ𝑡 = Γ𝑜 ∘ tanh(𝑐𝑡)
LSTM training
• One can initialize the biases of the forget
gate such that the values in the forget gate
are close to one.
IN5400
18.03.2020
Multi-layer Recurrent Neural Networks
• Multi-layer RNNs can be used to enhance
model complexity
• Similar as for feed forward neural networks,
stacking layers creates higher level feature
representation
• Normally, 2 or 3 layers deep, not as deep
as conv nets
• More complex relationships in time
Page 64
IN5400
18.03.2020
Bidirectional recurrent neural network
• The blocks can be vanilla, LSTM and GRU recurrent units
• Real time vs post processing
Page 65
ℎ𝑡 = 𝑓 𝑊ℎ𝑥𝑥𝑡 +𝑊ℎℎℎ𝑡−1
ℎ𝑡 = 𝑓 𝑊ℎ𝑥𝑥𝑡 +𝑊ℎℎℎ𝑡+1
ℎ𝑡 = ℎ𝑡, ℎ𝑡
𝑦𝑡 = 𝑔(𝑊ℎ𝑦ℎ𝑡 + 𝑏)
IN5400
18.03.2020
Bidirectional recurrent neural network
• Example:
– Task: Is the word part of a person’s name?
– Text 1: He said, “Teddy bear are on sale!”
– Text 2: He said, “Teddy Roosevelt was a great President!”
Page 66
ℎ𝑡 = 𝑓 𝑊ℎ𝑥𝑥𝑡 +𝑊ℎℎℎ𝑡−1
ℎ𝑡 = 𝑓 𝑊ℎ𝑥𝑥𝑡 +𝑊ℎℎℎ𝑡+1
ℎ𝑡 = ℎ𝑡, ℎ𝑡
𝑦𝑡 = 𝑔(𝑊ℎ𝑦ℎ𝑡 + 𝑏)
IN5400
18.03.2020
• Overview (Feed forward and convolution neural networks)
• Vanilla Recurrent Neural Network (RNN)
• Input-output structure of RNN’s
• Training Recurrent Neural Networks
• Simple examples
• Advanced models
• Advanced examples
Page 67
Progress
IN5400
18.03.2020
Image Captioning
• Combining a CNN with an RNN to generate descriptive text about an image.
• Learning an RNN to interpret top level features from a CNN
Page 68
Deep Visual-Semantic Alignments for Generating Image Descriptions
IN5400
18.03.2020
Image Captioning
Page 69
IN5400
18.03.2020
Image Captioning
Page 70
IN5400
18.03.2020
Image Captioning
Page 71
IN5400
18.03.2020
Image Captioning
Page 72
IN5400
18.03.2020
Image Captioning
Page 73
IN5400
18.03.2020
Image Captioning
Page 74
IN5400
18.03.2020
Image Captioning: Results
Page 75
NeuralTalk2
IN5400
18.03.2020
Image Captioning: Failures
Page 76
NeuralTalk2
IN5400
18.03.2020
CNN + RNN
• With RNN on top of CNN features,
you capture a larger time horizon
Page 80
Long-term Recurrent Convolutional Networks for Visual Recognition
and Description
IN5400
18.03.2020
Alternatives to RNN
(for reference only)
• Temporal convolutional networks (TCN)
– An Empirical Evaluation of Generic Convolutional and Recurrent Networks
for Sequence Modeling
• Attention based models
– Attention Is All You Need
Page 82