Deep Learning Basics Lecture 9: Recurrent Neural … Learning Basics Lecture 9: Recurrent Neural Networks Princeton University COS 495 Instructor: Yingyu Liang Introduction Recurrent

Deep Learning Basics Lecture 9: Recurrent Neural Networks

Princeton University COS 495

Instructor: Yingyu Liang

Introduction

Recurrent neural networks

• Dates back to (Rumelhart et al., 1986)

• A family of neural networks for handling sequential data, which involves variable length inputs or outputs

• Especially, for natural language processing (NLP)

Sequential data

• Each data point: A sequence of vectors 𝑥(𝑡), for 1 ≤ 𝑡 ≤ 𝜏

• Batch data: many sequences with different lengths 𝜏

• Label: can be a scalar, a vector, or even a sequence

• Example• Sentiment analysis

• Machine translation

Example: machine translation

Figure from: devblogs.nvidia.com

More complicated sequential data

• Data point: two dimensional sequences like images

• Label: different type of sequences like text sentences

• Example: image captioning

Image captioning

Figure from the paper “DenseCap: Fully Convolutional Localization Networks for Dense Captioning”, by Justin Johnson, Andrej Karpathy, Li Fei-Fei

Computational graphs

A typical dynamic system

𝑠(𝑡+1) = 𝑓(𝑠 𝑡 ; 𝜃)

Figure from Deep Learning, Goodfellow, Bengio and Courville

A system driven by external data

𝑠(𝑡+1) = 𝑓(𝑠 𝑡 , 𝑥(𝑡+1); 𝜃)


Compact view

𝑠(𝑡+1) = 𝑓(𝑠 𝑡 , 𝑥(𝑡+1); 𝜃)


Compact view

𝑠(𝑡+1) = 𝑓(𝑠 𝑡 , 𝑥(𝑡+1); 𝜃)


Key: the same 𝑓 and 𝜃for all time steps

square: one step time delay

Recurrent neural networks (RNN)


• Use the same computational function and parameters across different time steps of the sequence

• Each time step: takes the input entry and the previous hidden state to compute the output entry

• Loss: typically computed every time step


Figure from Deep Learning, by Goodfellow, Bengio and Courville

Label

Loss

Output

State

Input



Math formula:

Advantage

• Hidden state: a lossy summary of the past

• Shared functions and parameters: greatly reduce the capacity and good for generalization in learning

• Explicitly use the prior knowledge that the sequential data can be processed by in the same way at different time step (e.g., NLP)

Advantage

• Hidden state: a lossy summary of the past

• Shared functions and parameters: greatly reduce the capacity and good for generalization in learning

• Explicitly use the prior knowledge that the sequential data can be processed by in the same way at different time step (e.g., NLP)

• Yet still powerful (actually universal): any function computable by a Turing machine can be computed by such a recurrent network of a finite size (see, e.g., Siegelmann and Sontag (1995))

Training RNN

• Principle: unfold the computational graph, and use backpropagation

• Called back-propagation through time (BPTT) algorithm

• Can then apply any general-purpose gradient-based techniques

Training RNN

• Principle: unfold the computational graph, and use backpropagation

• Called back-propagation through time (BPTT) algorithm

• Can then apply any general-purpose gradient-based techniques

• Conceptually: first compute the gradients of the internal nodes, then compute the gradients of the parameters



Math formula:



Gradient at 𝐿(𝑡): (total loss is sum of those at differenttime steps)



Gradient at 𝑜(𝑡):



Gradient at 𝑠(𝜏):



Gradient at 𝑠(𝑡):



Gradient at parameter 𝑉:

Variants of RNN

RNN

• Use the same computational function and parameters across different time steps of the sequence

• Each time step: takes the input entry and the previous hidden state to compute the output entry

• Loss: typically computed every time step

• Many variants• Information about the past can be in many other forms

• Only output at the end of the sequence


Example: use the output at the previous step


Example: only output at the end

Bidirectional RNNs

• Many applications: output at time 𝑡 may depend on the whole input sequence

• Example in speech recognition: correct interpretation of the current sound may depend on the next few phonemes, potentially even the next few words

• Bidirectional RNNs are introduced to address this

BiRNNs


Encoder-decoder RNNs

• RNNs: can map sequence to one vector; or to sequence of same length

• What about mapping sequence to sequence of different length?

• Example: speech recognition, machine translation, question answering, etc


Documents

Deep Learning Basics Lecture 9: Recurrent Neural … Learning Basics Lecture 9: Recurrent Neural Networks Princeton University COS 495 Instructor: Yingyu Liang Introduction Recurrent