41
RNNs Under the hood On the Surface Elvis Saravia

Introduction to Fundamentals of RNNs

Embed Size (px)

Citation preview

Page 1: Introduction to Fundamentals of RNNs

RNNs

Under the hood On the Surface

Elvis Saravia

Page 2: Introduction to Fundamentals of RNNs

Icebreaker

Can we predict the future based on our current decisions?

2

“Which research direction should I take?”

Elvis Saravia

Page 3: Introduction to Fundamentals of RNNs

Outline

●Part 1: Review Neural Network Essentials

●Part 2: Sequential Modeling

●Part 3: Introduction to Recurrent Neural Networks

●Part 4: RNNs with Tensorflow

3

Elvis Saravia

Page 4: Introduction to Fundamentals of RNNs

Part 1

4

The Neural Network

Elvis Saravia

Page 5: Introduction to Fundamentals of RNNs

Perceptron Forward Pass

5

∑ 𝑓

non-linearity

sum

weights

inputs

𝑥0

𝑥1

𝑥𝑛

𝑏

𝑤0

𝑤1

𝑤𝑛

𝑏

𝑦

output

𝑦 = 𝑓(

𝑖=0

𝑁

𝑥𝑖 ∗ 𝑤𝑖 + 𝑏)

Computing output:

𝑦 = 𝑓(𝑋𝑊 + 𝑏)

𝑋 = 𝑥𝑜, 𝑥1, … , 𝑥𝑛

𝑊 = 𝑤𝑜, 𝑤1, … , 𝑤𝑛

tanh, ReLU, sigmoid

Vector form:

Elvis Saravia

Page 6: Introduction to Fundamentals of RNNs

Multi-Layer Perceptron (MLP)

6

Output layer

Hidden layer

Input layer

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

𝑥0

𝑥1

𝑥2

𝑥3

Page 7: Introduction to Fundamentals of RNNs

Deep Neural Networks (DNN)

7

Output layer

Hidden layers

Input layer

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

𝑥0

𝑥1

𝑥2

𝑥3

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

Page 8: Introduction to Fundamentals of RNNs

Neural Network (Summary)

• Neural Networks learn features after input is fed into the hidden layers while updating weights through backpropagation (SGD)*.

• Data and activation flows in one direction through the hidden layers, but neurons never interact with each other.

8

Output layer

Hidden layers

Input layer

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

𝑥0

𝑥1

𝑥2

𝑥3

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

* Stochastic Gradient Descent (SGD) Elvis Saravia

Page 9: Introduction to Fundamentals of RNNs

Drawbacks of NNs

●Lack sequence modeling capabilities: don’t keep track of past information (i.e., no memory), which is very important to model data with a sequential nature.

● Input is of fixed size (more of this later)

9

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

𝑥0

𝑥1

𝑥2

𝑥𝑛

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

Elvis Saravia

Page 10: Introduction to Fundamentals of RNNs

Part 2

10

Sequential Modeling

Elvis Saravia

Page 11: Introduction to Fundamentals of RNNs

Sequences

“I took the bus this morning because it was very cold.”

11

Stock price

Speech waveform

Sentence

●Current values depend on previous values (e.g., melody notes, language rules)

●Order needs to be maintained to preserve meaning

●Sequences usually vary in length

Elvis Saravia

Page 12: Introduction to Fundamentals of RNNs

Modeling Sequences

How to represent a sequence?

12

I love the coldness [ 1 1 0 1 0 0 1 ]

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

𝑥0

𝑥1

𝑥2

𝑥𝑛

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

𝑇

Bag of Words (BOW)

What’s the problem with the BOW representation?

Page 13: Introduction to Fundamentals of RNNs

Problem with BOW

Bag of words does not preserve order, therefore no semantics can be captured

13

“The food was good, not bad at all”

“The food was bad, not good at all”

vs

How to differentiate meaning of both sentences?

[ 1 1 0 1 1 0 1 0 1 0 0 1 1 ]

Page 14: Introduction to Fundamentals of RNNs

One-Hot Encoding

● Preserve order by maintaining order within feature vector

14

On Monday it was raining

[ 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 ]

We preserved order but what is the problem here?

Page 15: Introduction to Fundamentals of RNNs

Problem with One-Hot Encoding

● One-hot encoding cannot deal with variations of the same sequence.

15

On Monday it was raining

It was raining on Monday

[ 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 ]

[ 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 ]

Elvis Saravia

Page 16: Introduction to Fundamentals of RNNs

Solution

Solution: We need to relearn the rules of language at each point in the sentence to preserve meaning.

16

On Monday it was raining

It was raining on Monday

No idea of state and what comes next!!!

What was learned atthe beginning of thesentence will have to be relearned at the end of the sentence.

Page 17: Introduction to Fundamentals of RNNs

Markov Models

State and transitions can be modeled,therefore it doesn’t matter wherein the sentence we are, we have an ideaof what comes next based on theprobabilities.

17

Problem: Each state depends only on the last state. We can’t model long-term dependencies!

Elvis Saravia

Page 18: Introduction to Fundamentals of RNNs

Long-term dependencies

We need information from the far past and future to accurately model sequences.

18

In Italy, I had a great time and I learnt some of the _____ language

Elvis Saravia

Page 19: Introduction to Fundamentals of RNNs

19

It’s time for Recurrent Neural Networks (RNNs)!!!

Elvis Saravia

Page 20: Introduction to Fundamentals of RNNs

Part 3

20

Recurrent Neural Networks

(RNNs)

Page 21: Introduction to Fundamentals of RNNs

Recurrent Neural Networks (RNNs)

●RNNs model sequential information by assuming long-term dependencies between elements of a sequence.

●RNNs maintain word order and share parameters across the sequence (i.e., no need to relearn rules).

●RNNs are recurrent because they perform the same task for every element of a sequence, with the output being depended on the previous computations.

●RNNs memorize information that has been computed so far, so they deal well with long-term dependencies.

21

Page 22: Introduction to Fundamentals of RNNs

Applications of RNN

●Analyze time series data to predict stock market

●Speech recognition (e.g., Emotion Recognition from Acoustic features)

●Autonomous driving

●Natural Language Processing (e.g., Machine Translation, Question and Answer)?

22

Elvis Saravia

Page 23: Introduction to Fundamentals of RNNs

Some examples

Google Magenta Project (Melody composer) – (https://magenta.tensorflow.org)

Sentence Generator - (http://goo.gl/onkPNd)

Image Captioning – (http://goo.gl/Nwx7Kh)

23

Elvis Saravia

Page 24: Introduction to Fundamentals of RNNs

RNNs Main Components

- Recurrent neurons

- Unrolling recurrent neurons

- Layer of recurrent neurons

- Memory cell containing hidden state

24

Elvis Saravia

Page 25: Introduction to Fundamentals of RNNs

Recurrent Neurons

𝑥

𝑦A simple recurrent neuron:• receives input• produces output• sends output back to itself

𝑓

𝑦𝑥

∑𝑓

- Input- Output (usually a vector of probabilities)- sum(W. x) + bias- Activation function (e.g., ReLU, tanh, sigmoid)- Weights for inputs- Weights for outputs of previous time step- Function of current inputs and previous time step

𝑊𝑥

𝑊𝑦

𝑊𝑦

𝑊𝑥

25

Page 26: Introduction to Fundamentals of RNNs

Unrolling/Unfolding recurrent neuron

𝑥

𝑦

𝑓

𝑥𝑡−3

𝑦𝑡−3

𝑓

𝑥𝑡−2

𝑦𝑡−2

𝑓

𝑥𝑡−1

𝑦𝑡−1

𝑓

𝑥𝑡

𝑦𝑡

𝑓

Unrollingthrough

time

Time stepor

frame 26

𝑊𝑦 𝑊𝑦 𝑊𝑦 𝑊𝑦

𝑊𝑥 𝑊𝑥 𝑊𝑥 𝑊𝑥

𝑊𝑦

ℎ𝑡−3 ℎ𝑡−2 ℎ𝑡−1 ℎ𝑡

time

ℎ𝑡−4

𝑊𝑥

𝑊𝑦

Page 27: Introduction to Fundamentals of RNNs

RNNs remember previous state

27

𝑥0: "𝑖𝑡"

𝑦0

𝑓

𝑊𝑥

𝑊𝑦

𝑥0 : vector representing first word

𝑊𝑥,𝑊𝑦: weight matrices

𝑦0 : output at t=0

ℎ0 = tanh(𝑊𝑥𝑥0 +𝑊𝑦ℎ−1)t=0

Can remember things from the past

ℎ−1 ℎ0

Page 28: Introduction to Fundamentals of RNNs

RNNs remember previous state

28

𝑥1: "𝑤𝑎𝑠"

𝑦1

𝑓

𝑊𝑥

𝑊𝑦

𝑥1 : vector representing second word

𝑊𝑥,𝑊𝑦: weight matrices stay the same so they are shared across sequence

𝑦1 : output at t=1

ℎ1 = tanh(𝑊𝑥𝑥1 +𝑊𝑦ℎ0)t=1

ℎ0 ℎ1

ℎ0 = tanh(𝑊𝑥𝑥0 +𝑊𝑦ℎ−1)

Can remember things from t=0

Page 29: Introduction to Fundamentals of RNNs

Overview

𝑥

𝑦

𝑓

𝑥𝑡−3

𝑦𝑡−3

𝑓

𝑥𝑡−2

𝑦𝑡−2

𝑓

𝑥𝑡−1

𝑦𝑡−1

𝑓

𝑥𝑡

𝑦𝑡

𝑓

Unrollingthrough

time

scalar

[0,1,2] [3,4,5] [6,7,8] [9,0,1]

𝑦𝑡 = 𝑓 𝑊𝑥𝑥𝑡 +𝑊𝑦𝑦𝑡−1 + 𝑏

29[9,8,7] [0,0,0] [6,5,4] [3,2,1] batch

Page 30: Introduction to Fundamentals of RNNs

Code Example

30

Where all the magic happens!

𝑦𝑡 = 𝑓 𝑊𝑥𝑥𝑡 +𝑊𝑦𝑦𝑡−1 + 𝑏

Page 31: Introduction to Fundamentals of RNNs

Layer of Recurrent Neurons

𝑥

𝑦

𝑓

𝑓

𝑓

𝑓

𝑓

𝑓

𝑥

𝑦

scalar

vector

31

Elvis Saravia

Page 32: Introduction to Fundamentals of RNNs

Unrolling Layer

𝑓∑𝑓

𝑓

∑𝑓

∑𝑓

𝑥

𝑦

Unrollingthrough

time

𝑥0

𝑦0

𝑥1

𝑦1

𝑥2

𝑦2

[1,2,3] [1,2,3] [1,2,3]

[4,5,6] [7,8,9] [0,0,0]

Yt = 𝑓 𝑋𝑡 𝑌𝑡−1 .𝑊 + 𝑏 , 𝑊 =𝑊𝑥𝑊𝑦

Y𝑡 = 𝑓 𝑋𝑡.𝑊𝑥 + 𝑌𝑡−1.𝑊𝑦 + 𝑏

vectorform

32

𝑊𝑥 𝑊𝑥 𝑊𝑥

𝑊𝑦 𝑊𝑦 𝑊𝑦ℎ0 ℎ1 ℎ2

Page 33: Introduction to Fundamentals of RNNs

Variations of RNNs: Input / Output

𝑋0

𝑌0

𝑋1

𝑌1

𝑋2

𝑌2

𝑋3

𝑌3

sequence to sequence- Stock price- Other time series

𝑋0

𝑌0

0

𝑌1

0

𝑌2

0

𝑌3

vector to sequence

- Image captioning

𝑥0

𝑦0

𝑥1

𝑦1

0

𝑌′2

0

𝑌′3

0

𝑌′3

Encoder Decoder

- Translation:- E: Sequence to vector- D: Vector to sequence

33

𝑋0

𝑌0

𝑋1

𝑌1

𝑋2

𝑌2

𝑋3

𝑌3

sequence to vector

- Sentiment analysis ([-1,+1])- Other classification tasks

vector ofprobabilities over classesa.k.a softmax

Page 34: Introduction to Fundamentals of RNNs

Training

𝑋0

𝑌0

𝑋1

𝑌1

𝑋2

𝑌2

𝑋3

𝑌3

𝑊,𝑏 𝑊,𝑏 𝑊,𝑏 𝑊, 𝑏

𝑋4

𝑌4

𝑊,𝑏

𝐶(𝑌2, 𝑌3, 𝑌4)

Forward pass

Backpropagation34

- Forward pass- Compute Loss via cost function C- Minimize Loss by backpropagation

through time (BPTT)

𝜕𝐽4𝜕𝑊=

𝑘=0

4𝜕𝐽4𝜕𝑦4

𝜕𝑦4𝜕ℎ4

𝜕ℎ4𝜕ℎ𝑘

𝜕ℎ𝑘𝜕𝑊

𝜕𝐽

𝜕𝑊=

𝑡

𝜕𝐽𝑡𝜕𝑊

For one single time step t:

𝐽2, 𝐽3, 𝐽4

Counting the contributions of W in previous time-stepsto the error at time-step t (using Chain rule)

ℎ0 ℎ1 ℎ2 ℎ3 ℎ4

Page 35: Introduction to Fundamentals of RNNs

Drawbacks

Main problem: Vanishing gradients (gradients gets too small)

Intuition: As sequences get longer, gradients tend to get too small during backpropagation process.

35

𝜕𝐽𝑛𝜕𝑊=

𝑘=0

𝑛𝜕𝐽𝑛𝜕𝑦𝑛

𝜕𝑦𝑛𝜕ℎ𝑛

𝜕ℎ𝑛𝜕ℎ𝑘

𝜕ℎ𝑘𝜕𝑊

𝜕ℎ𝑛𝜕ℎ𝑛−1

𝜕ℎ𝑛−1𝜕ℎ𝑛−2

. . .𝜕ℎ3𝜕ℎ2

𝜕ℎ2𝜕ℎ1

𝜕ℎ1𝜕ℎ0

We are just multiplying a lot of small numbers together

Elvis Saravia

Page 36: Introduction to Fundamentals of RNNs

Solutions

Long Short-Term Memory Networks –

Deal with vanishing gradient problem, therefore more reliable to model long-term dependencies, especially for very long sequences.

36

Elvis Saravia

Page 37: Introduction to Fundamentals of RNNs

RNN Extensions

Extended Readings:

○ Bidirectional RNNs – passing states in both directions

○ Deep (Bidirectional) RNNs – stacking RNNs

○ LSTM networks – Adaptation of RNNs

37

Elvis Saravia

Page 38: Introduction to Fundamentals of RNNs

In general

●RNNs are great for analyzing sequences of any arbitrary length.

●RNNs are considered “anticipatory” models

●RNNs are also considered creative learning models as they can, for example, predict set of musical notes to play next in melody, and selects an appropriate one.

38

Elvis Saravia

Page 39: Introduction to Fundamentals of RNNs

Part 3

39

RNNs in Tensorflow

Elvis Saravia

Page 40: Introduction to Fundamentals of RNNs

Demo

●Building RNNS in Tensorflow

●Trainining RNNS in Tensorflow

● Image Classification

●Text Classification

40

Elvis Saravia

Page 41: Introduction to Fundamentals of RNNs

References

● Introduction to RNNs - http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/

●NTHU Machine Learning Course - https://goo.gl/B4EqMi

●Hands-On Machine Learning with Scikit-Learn and Tensorflow (Book)

41

Elvis Saravia