Recurrent Neural Networks (RNN) - Mines ParisTechperso.mines-paristech.fr/fabien.moutarde/ES... ·...

Deep-Learning: Recurrent Neural Networks (RNN), Pr. Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, May 2019 1

Deep-Learning:

Recurrent Neural Networks (RNN)

Pr. Fabien MOUTARDECenter for Robotics

MINES ParisTech

PSL Université Paris

Fabien.Moutarde@mines-paristech.fr

http://people.mines-paristech.fr/fabien.moutarde

Acknowledgements

During preparation of these slides, I got inspiration and borrowedsome slide content from several sources, in particular:

• Fei-Fei Li + J.Johnson + S.Yeung: slides on “Recurrent Neural Networks”from the “Convolutional Neural Networks for Visual Recognition” course atStanfordhttp://cs231n.stanford.edu/slides/2019/cs231n_2019_lecture10.pdf

• Yingyu Liang: slides on “Recurrent Neural Networks” from the “DeepLearning Basics” course at Princeton

https://www.cs.princeton.edu/courses/archive/spring16/cos495/slides/DL_lecture9_RNN.pdf

• Arun Mallya: slides “Introduction to RNNs” from the “Trends in DeepLearning and Recognition” course of Svetlana LAZEBNIK at University ofIllinois at Urbana-Champaign

http://slazebni.cs.illinois.edu/spring17/lec02_rnn.pdf

• Tingwu Wang: slides on “Recurrent Neural Network” for a course atUniversity of Toronto

https://www.cs.toronto.edu/%7Etingwuwang/rnn_tutorial.pdf

• Christopher Olah: online tutorial “Understanding LSTM Networks”https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Outline

• Standard Recurrent Neural Networks

• Training RNN: BackPropagation Through Time

• LSTM and GRU

• Applications of RNNs

Recurrent Neural Networks (RNN)

Time-delayfor each connection

Equivalent form

output

x2(t-1)

x3(t-1)

x2(t-1)

x2(t-2)

Canonical form of RNN

Non-recurrent network

U(t)External input

............

..................

Output Y(t)

1 1 1.......

........

X(t-1)State variables

X(t)State variables

Time unfolding of RNN

State variablesat time tOutput at time t :

External input at time t :State variablesat time t-1Output at time t-1 :

External input at time t-1 :State variablesat time t-2Output at time t-2 :

External input at time t-2 :

State variablesat time t-3

Y(t-1)

Y(t-2)

U(t-2)

U(t-1)

X(t-2)

X(t-3)

X(t-1)

Dynamic systems & RNN

If using a Neural Net for f, this is EXACTLY a RNN!

Figures from Deep Learning, Goodfellow, Bengio and Courville

!("#$) = % !("), &("#$)

Standard (“vanilla”) RNN

State vector s ßà vector h of hidden neurons

ou yt=softMax(Whyht)

Advantages of RNN

The hidden state s of the RNN builds a kind of

lossy summary of the past

RNN totally adapted to processing SEQUENTIAL

data (same computation formula applied at each

time step, but modulated by the evolving “memory”

contained in state s)

Universality of RNNs: any function computable by

a Turing Machine can be computed by a finite-size

RNN (Siegelmann and Sontag, 1995)

RNN hyper-parameters

• As for MLP, main hyperparameter =

size of hidden layer (=size of vector h)

Outline

• LSTM and GRU

RNN training

• BackPropagation Through Time (BPTT)

gradients update for a whole sequence

• or Real Time Recurrent Learning (RTRL)

gradients update for each frame in a sequence

t+2 t+3

Temporal sequence

W(t)W(t)W(t)

e3 e4W(t)

W(t+4)

Horizon Nt = 4

BackPropagationTHROUGH TIME (BPTT)

• Forward through entire sequence to compute SUM of losses at ALL (or part of) time steps

• Then backprop through ENTIRE sequence to compute gradients

BPTT computation principle

Xd(t-1)

bloc 1

U(t) U(t+1)

bloc 2

X(t+1)

U(t+2)

bloc 3

X(t+2)

D(t+2)D(t+1)

dE/dXn+1dE/dXn

dW3dW2dW1

dW = dW1 + dW2 + dW3

BPTT algorithm

W(t+Nt) = W(t) - l gradW(E) avec E = St (Y

'*, +-.+/ = +-.+0.

+0.+1.23

+1.23+/

+-+/ =4

6 +-.+/

Feedforward

Network

U(t) Y(t)

X(t-1)

+1.+/ = 4

.23 +1.+1.27

+1.27+/

and (chain rule)

+1.+1.27 =8

. +19+1923

Jacobian matrix of the

Feedforward net

Vanishing/exploding gradient problem

• If eigenvalues of Jacobian matrix >1, then gradients tend

to EXPLODE

è Learning will never converge.

• Conversely, if eigenvalues of Jacobian matrix <1,

then gradients tend to VANISH

è Error signals can only affect small time lags

è short-term memory.

èPossible solutions for exploding gradient: CLIPPING trick

è Possible solutions for vanishing gradient: – use ReLU instead of tanh– change what is inside the RNN!

Outline

• LSTM and GRU

Long Short-Term Memory (LSTM)

LSTM = RNN variant for solving this issue(proposed by Hochreiter & Schmidhuber in 1997)

• Key idea = use “gates” that modulate respective influences of input and memory

Problem of standard RNNs =

no actual LONG-TERM memory

[Figures from https://colah.github.io/posts/2015-08-Understanding-LSTMs/]

LSTM gates

Gate = pointwise multiplication by s in ]0;1[è modulate between “let nothing through”

and “let everything through”

• FORGET gate

• INPUT gate

è next state = mix betweenpure memory or pure new

[Figures from https://colah.github.io/posts/2015-08-Understanding-LSTMs/]

LSTM summary

• OUTPUT gate

ALL weigths Wf, Wi, Wc and Wo(and biases) are LEARNT

[Figure from Deep Learning book by I. Goodfellow, Y. Bengio & A. Courville]

Why LSTM avoids vanishing gradients?

Gated Recurrent Unit (GRU)

Simplified variant of LSTM, with only 2 gates:a RESET gate & an UPDATE gate

(proposed by Cho, et al. in 2014)

Outline

• LSTM and GRU

Typical usages of RNNs

Sequence

Sequence to Sequence

Combining RNN with CNN

Input into RNN the features from

last convolutional layer

For example,

for image captioning

Deep RNNs

Several RNNs stacked (like layers in MLP)

Bi-directional RNNs

(e.g. for offline classification of sequence of words)

Encoder-decoder RNN

Applications of RNN/LSTM

Wherever data is intrinsicly SEQUENTIAL

• Speech recognition• Natural Language Processing (NLP)

– Machine-Translation

– Image caption generator

• Gesture recognition

• Music generation

• Potentially any kind of time-series!!

Summary and perspectives on Recurrent Neural Networks

• For SEQUENTIAL data

(speech, text, …, gestures, …)

• Impressive results in

Natural Language Processing (in particular

Automated Real-Time Translation)

• Training of standard RNNs can be tricky

(vanishing gradient…)

• LSTM / GRU now more used than standard RNNs

Any QUESTIONS ?

Recurrent Neural Networks (RNN) - Mines ParisTechperso.mines-paristech.fr/fabien.moutarde/ES... ·...

Documents

Lecture 10: Recurrent Neural Networksvision.stanford.edu/teaching/cs231n/slides/2017/cs231n_2017_lectur… · Lecture 10 - 21 May 4, 2017 Recurrent Neural Network x RNN y We can process

Reduced-Order Modeling of Subsurface Multi-phase Flow ...€¦ · RNN) (Nagoor Kani et al. in DR-RNN: a deep residual recurrent neural network for model reduction. ArXiv e-prints,

Linked Recurrent Neural Networks - arXiv · 2018-08-21 · Linked Recurrent Neural Networks WOODSTOCK’97, July 1997, El Paso, Texas USA RNN Input Layer RNN Layer Link Layer Output

Recurrent Neural Networks · Recurrent Neural Networks 11-785 / 2020 Spring / Recitation 7 Vedant Sanil, David Park “Drop your RNN and LSTM, they are no good!” The fall of RNN

Recurrent Neural Network - University of Torontotingwuwang/rnn_tutorial.pdf · Recurrent Neural Network ... RNN could do a lot more than modeling language 1. ... (Gated Feedback Recurrent

Recurrent Neural Network shad.pcs15@iitp.acshad.pcs15/data/rnn-shad.pdf · Recurrent Neural Network (RNN) Training of RNNs BPTT Visualization of RNN through Feed-Forward Neural Network

Exploring Recurrent Neural Networks to Detect Named ...cips-cl.org/static/anthology/CCL-2015/CCL-15-077.pdf · 2 Recurrent neural network model 2.1 Two types of RNN architectures

DR-RNN: A deep residual recurrent neural network for … · DR-RNN: A deep residual recurrent neural network for model reduction J.Nagoor Kania,, Ahmed H. Elsheikha a School of Energy,

Neural Network Part 4: Recurrent Neural Networkspages.cs.wisc.edu/~jerryzhu/cs760/10_neural-networks-4.pdf · •recurrent neural networks (RNN) and the advantage •training recurrent

Fundamentals of Recurrent Neural Network (RNN) and Long

Recurrent Neural Network (RNN) - NTU Speech …speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/RNN...Recurrent Neural Network (RNN) x 1 x 2 y 1 y 2 a 1 a 2 Memory can be considered

Development of A New Recurrent Neural Network Toolbox (RNN-Tool)

Lecture 5: Recurrent Neural Networkswavelab.uwaterloo.ca/wp-content/uploads/2017/07/Lecture_2.pdf · 2 RNN Architectures for Learning Long Term Dependencies 3 Other RNN Architectures

Recurrent Neural Network (RNN) - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RNN.pdf · Recurrent Neural Network (RNN) x 1 x 2 y 1 y 2 a 1 a 2 Memory can

E-RNN: Design Optimization for Efficient Recurrent Neural ...wenyaoxu/papers/... · Abstract—Recurrent Neural Networks (RNNs) are becom-ing increasingly important for time series-related

YoTube: Searching Action Proposal Via Recurrent and Static ... Searching... · A. Recurrent Neural Network Recurrent Neural Network (RNN), especially Long-Short Term Memory (LSTM)

Personalized Computer Architecture as Contextual ...€¦ · 1.2 Example Implementation ... RNN – Recurrent Neural Network SIMD – Single Instruction Multiple Data ... processing

Deep Neural Networks - theodore bluchewith multidimensional recurrent neural networks. In Advances in neural information processing systems (pp. 545-552). ((MDLSTM-RNN -- the state-of-the-art,

Stacking velocity estimation using recurrent neural network … · 2018-09-17 · Recurrent Neural Network (RNN) A RNN is a type of Neural Network which is quite similar to a feedforward

Deep Learning in Natural Language Processing · Recurrent Neural Networks A Recurrent Neural Network (RNN) is a type of neural network where connections between units form a directed