Upload
elvis-saravia
View
48
Download
1
Embed Size (px)
Citation preview
Icebreaker
Can we predict the future based on our current decisions?
2
“Which research direction should I take?”
Elvis Saravia
Outline
●Part 1: Review Neural Network Essentials
●Part 2: Sequential Modeling
●Part 3: Introduction to Recurrent Neural Networks
●Part 4: RNNs with Tensorflow
3
Elvis Saravia
Perceptron Forward Pass
5
∑ 𝑓
non-linearity
sum
weights
inputs
𝑥0
𝑥1
𝑥𝑛
𝑏
𝑤0
𝑤1
𝑤𝑛
𝑏
𝑦
output
𝑦 = 𝑓(
𝑖=0
𝑁
𝑥𝑖 ∗ 𝑤𝑖 + 𝑏)
Computing output:
𝑦 = 𝑓(𝑋𝑊 + 𝑏)
𝑋 = 𝑥𝑜, 𝑥1, … , 𝑥𝑛
𝑊 = 𝑤𝑜, 𝑤1, … , 𝑤𝑛
tanh, ReLU, sigmoid
Vector form:
Elvis Saravia
Multi-Layer Perceptron (MLP)
6
Output layer
Hidden layer
Input layer
∑ 𝑓
∑ 𝑓
∑ 𝑓
∑ 𝑓
∑ 𝑓
∑ 𝑓
∑ 𝑓
∑ 𝑓
𝑥0
𝑥1
𝑥2
𝑥3
Deep Neural Networks (DNN)
7
Output layer
Hidden layers
Input layer
∑ 𝑓
∑ 𝑓
∑ 𝑓
∑ 𝑓
∑ 𝑓
∑ 𝑓
∑ 𝑓
∑ 𝑓
𝑥0
𝑥1
𝑥2
𝑥3
∑ 𝑓
∑ 𝑓
∑ 𝑓
∑ 𝑓
∑ 𝑓
∑ 𝑓
…
Neural Network (Summary)
• Neural Networks learn features after input is fed into the hidden layers while updating weights through backpropagation (SGD)*.
• Data and activation flows in one direction through the hidden layers, but neurons never interact with each other.
8
Output layer
Hidden layers
Input layer
∑ 𝑓
∑ 𝑓
∑ 𝑓
∑ 𝑓
∑ 𝑓
∑ 𝑓
∑ 𝑓
∑ 𝑓
𝑥0
𝑥1
𝑥2
𝑥3
∑ 𝑓
∑ 𝑓
∑ 𝑓
∑ 𝑓
∑ 𝑓
∑ 𝑓
…
* Stochastic Gradient Descent (SGD) Elvis Saravia
Drawbacks of NNs
●Lack sequence modeling capabilities: don’t keep track of past information (i.e., no memory), which is very important to model data with a sequential nature.
● Input is of fixed size (more of this later)
9
∑ 𝑓
∑ 𝑓
∑ 𝑓
∑ 𝑓
∑ 𝑓
∑ 𝑓
∑ 𝑓
∑ 𝑓
𝑥0
𝑥1
𝑥2
𝑥𝑛
∑ 𝑓
∑ 𝑓
∑ 𝑓
∑ 𝑓
∑ 𝑓
∑ 𝑓
…
Elvis Saravia
Sequences
“I took the bus this morning because it was very cold.”
11
Stock price
Speech waveform
Sentence
●Current values depend on previous values (e.g., melody notes, language rules)
●Order needs to be maintained to preserve meaning
●Sequences usually vary in length
Elvis Saravia
Modeling Sequences
How to represent a sequence?
12
I love the coldness [ 1 1 0 1 0 0 1 ]
∑ 𝑓
∑ 𝑓
∑ 𝑓
∑ 𝑓
∑ 𝑓
∑ 𝑓
∑ 𝑓
∑ 𝑓
𝑥0
𝑥1
𝑥2
𝑥𝑛
∑ 𝑓
∑ 𝑓
∑ 𝑓
∑ 𝑓
∑ 𝑓
∑ 𝑓
…
𝑇
Bag of Words (BOW)
What’s the problem with the BOW representation?
Problem with BOW
Bag of words does not preserve order, therefore no semantics can be captured
13
“The food was good, not bad at all”
“The food was bad, not good at all”
vs
How to differentiate meaning of both sentences?
[ 1 1 0 1 1 0 1 0 1 0 0 1 1 ]
One-Hot Encoding
● Preserve order by maintaining order within feature vector
14
On Monday it was raining
[ 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 ]
We preserved order but what is the problem here?
Problem with One-Hot Encoding
● One-hot encoding cannot deal with variations of the same sequence.
15
On Monday it was raining
It was raining on Monday
[ 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 ]
[ 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 ]
Elvis Saravia
Solution
Solution: We need to relearn the rules of language at each point in the sentence to preserve meaning.
16
On Monday it was raining
It was raining on Monday
No idea of state and what comes next!!!
What was learned atthe beginning of thesentence will have to be relearned at the end of the sentence.
Markov Models
State and transitions can be modeled,therefore it doesn’t matter wherein the sentence we are, we have an ideaof what comes next based on theprobabilities.
17
Problem: Each state depends only on the last state. We can’t model long-term dependencies!
Elvis Saravia
Long-term dependencies
We need information from the far past and future to accurately model sequences.
18
In Italy, I had a great time and I learnt some of the _____ language
Elvis Saravia
Part 3
20
Recurrent Neural Networks
(RNNs)
Recurrent Neural Networks (RNNs)
●RNNs model sequential information by assuming long-term dependencies between elements of a sequence.
●RNNs maintain word order and share parameters across the sequence (i.e., no need to relearn rules).
●RNNs are recurrent because they perform the same task for every element of a sequence, with the output being depended on the previous computations.
●RNNs memorize information that has been computed so far, so they deal well with long-term dependencies.
21
Applications of RNN
●Analyze time series data to predict stock market
●Speech recognition (e.g., Emotion Recognition from Acoustic features)
●Autonomous driving
●Natural Language Processing (e.g., Machine Translation, Question and Answer)?
22
Elvis Saravia
Some examples
Google Magenta Project (Melody composer) – (https://magenta.tensorflow.org)
Sentence Generator - (http://goo.gl/onkPNd)
Image Captioning – (http://goo.gl/Nwx7Kh)
23
Elvis Saravia
RNNs Main Components
- Recurrent neurons
- Unrolling recurrent neurons
- Layer of recurrent neurons
- Memory cell containing hidden state
24
Elvis Saravia
Recurrent Neurons
𝑥
𝑦A simple recurrent neuron:• receives input• produces output• sends output back to itself
∑
𝑓
𝑦𝑥
∑𝑓
- Input- Output (usually a vector of probabilities)- sum(W. x) + bias- Activation function (e.g., ReLU, tanh, sigmoid)- Weights for inputs- Weights for outputs of previous time step- Function of current inputs and previous time step
𝑊𝑥
𝑊𝑦
𝑊𝑦
𝑊𝑥
25
ℎ
ℎ
Unrolling/Unfolding recurrent neuron
𝑥
𝑦
∑
𝑓
𝑥𝑡−3
𝑦𝑡−3
∑
𝑓
𝑥𝑡−2
𝑦𝑡−2
∑
𝑓
𝑥𝑡−1
𝑦𝑡−1
∑
𝑓
𝑥𝑡
𝑦𝑡
∑
𝑓
Unrollingthrough
time
Time stepor
frame 26
𝑊𝑦 𝑊𝑦 𝑊𝑦 𝑊𝑦
𝑊𝑥 𝑊𝑥 𝑊𝑥 𝑊𝑥
𝑊𝑦
ℎ𝑡−3 ℎ𝑡−2 ℎ𝑡−1 ℎ𝑡
time
ℎ𝑡−4
𝑊𝑥
𝑊𝑦
ℎ
RNNs remember previous state
27
𝑥0: "𝑖𝑡"
𝑦0
∑
𝑓
𝑊𝑥
𝑊𝑦
𝑥0 : vector representing first word
𝑊𝑥,𝑊𝑦: weight matrices
𝑦0 : output at t=0
ℎ0 = tanh(𝑊𝑥𝑥0 +𝑊𝑦ℎ−1)t=0
Can remember things from the past
ℎ−1 ℎ0
RNNs remember previous state
28
𝑥1: "𝑤𝑎𝑠"
𝑦1
∑
𝑓
𝑊𝑥
𝑊𝑦
𝑥1 : vector representing second word
𝑊𝑥,𝑊𝑦: weight matrices stay the same so they are shared across sequence
𝑦1 : output at t=1
ℎ1 = tanh(𝑊𝑥𝑥1 +𝑊𝑦ℎ0)t=1
ℎ0 ℎ1
ℎ0 = tanh(𝑊𝑥𝑥0 +𝑊𝑦ℎ−1)
Can remember things from t=0
Overview
𝑥
𝑦
∑
𝑓
𝑥𝑡−3
𝑦𝑡−3
∑
𝑓
𝑥𝑡−2
𝑦𝑡−2
∑
𝑓
𝑥𝑡−1
𝑦𝑡−1
∑
𝑓
𝑥𝑡
𝑦𝑡
∑
𝑓
Unrollingthrough
time
scalar
[0,1,2] [3,4,5] [6,7,8] [9,0,1]
𝑦𝑡 = 𝑓 𝑊𝑥𝑥𝑡 +𝑊𝑦𝑦𝑡−1 + 𝑏
29[9,8,7] [0,0,0] [6,5,4] [3,2,1] batch
Code Example
30
Where all the magic happens!
𝑦𝑡 = 𝑓 𝑊𝑥𝑥𝑡 +𝑊𝑦𝑦𝑡−1 + 𝑏
Layer of Recurrent Neurons
𝑥
𝑦
∑
𝑓
∑
𝑓
∑
𝑓
∑
𝑓
∑
𝑓
∑
𝑓
𝑥
𝑦
scalar
vector
31
ℎ
Elvis Saravia
Unrolling Layer
∑
𝑓∑𝑓
∑
𝑓
∑𝑓
∑𝑓
𝑥
𝑦
Unrollingthrough
time
𝑥0
𝑦0
𝑥1
𝑦1
𝑥2
𝑦2
[1,2,3] [1,2,3] [1,2,3]
[4,5,6] [7,8,9] [0,0,0]
Yt = 𝑓 𝑋𝑡 𝑌𝑡−1 .𝑊 + 𝑏 , 𝑊 =𝑊𝑥𝑊𝑦
Y𝑡 = 𝑓 𝑋𝑡.𝑊𝑥 + 𝑌𝑡−1.𝑊𝑦 + 𝑏
vectorform
32
𝑊𝑥 𝑊𝑥 𝑊𝑥
𝑊𝑦 𝑊𝑦 𝑊𝑦ℎ0 ℎ1 ℎ2
Variations of RNNs: Input / Output
𝑋0
𝑌0
𝑋1
𝑌1
𝑋2
𝑌2
𝑋3
𝑌3
sequence to sequence- Stock price- Other time series
𝑋0
𝑌0
0
𝑌1
0
𝑌2
0
𝑌3
vector to sequence
- Image captioning
𝑥0
𝑦0
𝑥1
𝑦1
0
𝑌′2
0
𝑌′3
0
𝑌′3
Encoder Decoder
- Translation:- E: Sequence to vector- D: Vector to sequence
33
𝑋0
𝑌0
𝑋1
𝑌1
𝑋2
𝑌2
𝑋3
𝑌3
sequence to vector
- Sentiment analysis ([-1,+1])- Other classification tasks
vector ofprobabilities over classesa.k.a softmax
Training
𝑋0
𝑌0
𝑋1
𝑌1
𝑋2
𝑌2
𝑋3
𝑌3
𝑊,𝑏 𝑊,𝑏 𝑊,𝑏 𝑊, 𝑏
𝑋4
𝑌4
𝑊,𝑏
𝐶(𝑌2, 𝑌3, 𝑌4)
Forward pass
Backpropagation34
- Forward pass- Compute Loss via cost function C- Minimize Loss by backpropagation
through time (BPTT)
𝜕𝐽4𝜕𝑊=
𝑘=0
4𝜕𝐽4𝜕𝑦4
𝜕𝑦4𝜕ℎ4
𝜕ℎ4𝜕ℎ𝑘
𝜕ℎ𝑘𝜕𝑊
𝜕𝐽
𝜕𝑊=
𝑡
𝜕𝐽𝑡𝜕𝑊
For one single time step t:
𝐽2, 𝐽3, 𝐽4
Counting the contributions of W in previous time-stepsto the error at time-step t (using Chain rule)
ℎ0 ℎ1 ℎ2 ℎ3 ℎ4
Drawbacks
Main problem: Vanishing gradients (gradients gets too small)
Intuition: As sequences get longer, gradients tend to get too small during backpropagation process.
35
𝜕𝐽𝑛𝜕𝑊=
𝑘=0
𝑛𝜕𝐽𝑛𝜕𝑦𝑛
𝜕𝑦𝑛𝜕ℎ𝑛
𝜕ℎ𝑛𝜕ℎ𝑘
𝜕ℎ𝑘𝜕𝑊
𝜕ℎ𝑛𝜕ℎ𝑛−1
𝜕ℎ𝑛−1𝜕ℎ𝑛−2
. . .𝜕ℎ3𝜕ℎ2
𝜕ℎ2𝜕ℎ1
𝜕ℎ1𝜕ℎ0
We are just multiplying a lot of small numbers together
Elvis Saravia
Solutions
Long Short-Term Memory Networks –
Deal with vanishing gradient problem, therefore more reliable to model long-term dependencies, especially for very long sequences.
36
Elvis Saravia
RNN Extensions
Extended Readings:
○ Bidirectional RNNs – passing states in both directions
○ Deep (Bidirectional) RNNs – stacking RNNs
○ LSTM networks – Adaptation of RNNs
37
Elvis Saravia
In general
●RNNs are great for analyzing sequences of any arbitrary length.
●RNNs are considered “anticipatory” models
●RNNs are also considered creative learning models as they can, for example, predict set of musical notes to play next in melody, and selects an appropriate one.
38
Elvis Saravia
Demo
●Building RNNS in Tensorflow
●Trainining RNNS in Tensorflow
● Image Classification
●Text Classification
40
Elvis Saravia
References
● Introduction to RNNs - http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/
●NTHU Machine Learning Course - https://goo.gl/B4EqMi
●Hands-On Machine Learning with Scikit-Learn and Tensorflow (Book)
41
Elvis Saravia