97
DEEP LEARNING PART TWO - CONVOLUTIONAL & RECURRENT NETWORKS CS/CNS/EE 155 - MACHINE LEARNING & DATA MINING

DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

D E E P L E A R N I N GPA RT T W O - C O N V O L U T I O N A L & R E C U R R E N T N E T W O R K S

C S / C N S / E E 1 5 5 - M A C H I N E L E A R N I N G & D ATA M I N I N G

Page 2: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

R E V I E W

Page 3: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!3

x1

x2y = 0y = 1

x = (x1, x2)

we want to learn non-linear decision boundaries

we can do this by composing linear decision boundaries

Page 4: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!4

x1

x2

xM

input features

weights

sums

non- linearities

1

1

hidden features

weights

sumsnon-

linearities

neural networks formalize a method for building these composed functions

deep networks are universal function approximators

Page 5: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!5

each artificial neuron defines a (hyper)plane:

0 = w0 + w1x1 + w2x2 + . . . wMxM

summation: distance from plane to input

non-linearity: convert distance into non-linear field

distance

plane example transformed

distance

plane

the dot product is the shortest distance between a point and a plane

a geometric interpretation

Page 6: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!6

1. cut the space up with hyperplanes 2. evaluate distances of points to hyperplanes 3. non-linearly transform these distances to get new points

repeat until data have been linearized

Page 7: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!7

x1

x2

xM

1

1

“Alexa, what is the weather going to be like today?”

x1

x2

xM

1

1

⌃t

torques

x1

x2

xM

1

1

cat

Page 8: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!8

images

“hello”

audio & text virtual/physical control tasks

to scale deep networks to these domains, we often need to use inductive biases

today

Page 9: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

I N D U C T I V E B I A S E S

Page 10: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!10

ultimately, we care about solving tasks

Krizhevsky et al., 2012

object recognition

Ren et al., 2016

object detection object segmentation

He et al., 2017

text translation

Wu et al., 2016

Weston et al., 2015

text question answering

Page 11: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!11

ultimately, we care about solving tasks

atari

Minh, et al., 2013

object manipulation

Levine, Finn, et al., 2016

go

Silver, Huang, et al., 2016

autonomous driving

Waymo

Page 12: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!12

ultimately, we care about solving tasks

survival & reproduction

cellular signaling, maintenance

organ function

navigation

hunting, foraging

muscle actuation

vision

social/mating behavior

Page 13: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!13

two components for solving any task

priors learning

Page 14: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!14

priorsknowledge assumed beforehand

x1

x2

xM

1

1

architecture activities, outputs

w1

w2

L2

param. constraints

model 1

model 2

param. values

Page 15: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!15

learningknowledge extracted from data

Loss

Weight

LOSS/ERROR

GRADIENT

IMPROVEMENT

filter

Page 16: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!16

it’s a balance!

strong priors, minimal learning • fast/easy to learn and deploy • may be too rigid, unadaptable

weak priors, much learning • slow/difficult to learn and deploy • flexible, adaptable

for a desired level of performance on a task…

choose priors and collect data to obtain a model that achieves that performance in the minimal amount of time

datax1

x2

xM

1

1

Page 17: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!17

priors are essential - always have to make some assumptions, cannot integrate over all possible models

we are all initialized from evolutionary priors

livescience.com

humans seem to have a larger capacity for learning than other organisms

Page 18: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!18

up until now, all of our machines have been purely based on priors

these machines can perform tasks that are impossible to hand-design

for the first time in history, we can now create machines that also learn

…but they are mostly still based on priors!

Kormushev et al.

Page 19: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!19

we can exploit known structure in spatial and sequential data to impose priors (i.e. inductive biases) on our models

this allows us to learn models in complex, high-dimensional domains while limiting the number of parameters and data examples

x

y

t

inductive: inferring general laws from examples

Page 20: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

C O N V O L U T I O N A L N E U R A L N E T W O R K S

Page 21: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!21

task: object recognition

Yisong

discriminative mapping from image to object identity

Page 22: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!22

images contain all of the information about the binary latent variable Yisong/Not Yisong

extract the relevant information about this latent variable to form conditional probability

p(Yisong| )inference:

notice that images also contain other nuisance information, such as pose, lighting, background, etc.

want to be invariant to nuisance information

Page 23: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!23

data, label collection

the mapping is too difficult to define by hand,

need to learn from data

Yisong Not Yisong

then, we need to choose a model architecture…

x1

x2

xM

1

1

Page 24: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!24

standard neural networks require a fixed input size…

clearer patterns, but more parameters

fewer parameters, but unclear patterns

280 x 280 x 3205 x 205 x 3150 x 150 x 3

15 x

15 x 3

50 x 50 x 3

35 x

35 x 3

25 x

25 x 3

100 x 100 x 375 x 75 x 3

235,200

126,075

67,500

30,000

16,8757,500

3,675

1,875675

Page 25: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!25

convert to grayscale…

clearer patterns, but more parameters

fewer parameters, but unclear patterns

280 x 280 x 1205 x 205 x 1150 x 150 x 1

15 x

15 x 1

50 x 50 x 1

35 x

35 x 1

25 x

25 x 1

100 x 100 x 175 x 75 x 1

78,400

42,025

22,500

10,000

5,6252,500

1,225

625225

Page 26: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!26

10,000

1

100 x 100 x 1

10,000

reshape

Page 27: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!27

?

how many units do we need?

x1

x2

xM

1

10,0

00INPUT

1

10

100

1,000

10,000

100,000

1,000,000

10,000

100,000

1,000,000

10,000,000

100,000,000

1,000,000,000

10,000,000,000

# weights# units x 10,000 =

if we want to recognize even a few basic patterns at each location, the number of parameters will explode!

Page 28: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!28

to reduce the amount of learning, we can introduce inductive biases

exploit the spatial structure of image data

Page 29: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!29

localitynearby areas tend to contain stronger patterns

nearby patches tend to share characteristics and are combined in particular ways

nearby pixels tend to be similar and vary in particular ways

nearby regions tend to be found in particular arrangements

Page 30: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!30

translation invariancerelative (rather than absolute) positions are relevant

Yisong’s identity is independent of absolute location of his pixels

Page 31: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!31

let’s convert locality and translation invariance into inductive biases

localitynearby areas tend to contain stronger

patterns

inputs can be restricted to regions

maintain spatial ordering

translation invariancerelative positions

are relevant

same filters can be applied throughout the input

same weights

Page 32: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!32

these are the inductive biases of convolutional neural networks

special case of standard (fully-connected) neural networks

fully-connected

convolutional

weight savings

convolutional

(same weights)

fully-connected

(different weights)

weight savings

these inductive biases make the number of weights independent of the input size!

Page 33: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!33

convolve a set of filters with the input

filter weights:

http://deeplearning.net/software/theano_versions/dev/tutorial/conv_arithmetic.html

take inner (dot) product of filter and each input location

measures degree of filter feature at input location

feature map

Page 34: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!34

use padding to preserve spatial size

http://deeplearning.net/software/theano_versions/dev/tutorial/conv_arithmetic.html

typically add zeros around the perimeter

Page 35: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!35

use stride to downsample the input

http://deeplearning.net/software/theano_versions/dev/tutorial/conv_arithmetic.html

only compute output at some integer interval

stride = 2

Page 36: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!36

filters are applied to all input channels

3 x 3 x 3 filter tensor

each filter results in a new output channel

channel 2

channel 1

channel 3

Page 37: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!37

can be applied with padding and stride

5

0

0

0

21

1

8

3

3

4

3

2

1

4

2

5 8

2 4

pooling locally aggregates values in each feature map

predefined operation: maximum, average, etc.

downsampling and invariance

Page 38: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!38

convolutional pop-quiz

5

5

16

input feature map

3

3

?

filters

?*36

?

?

output feature map

if we use stride=1 and padding=0 then…

how many filters are there?

what size is each filter?

what is the output filter map size?

36 same as the number of output channels

3 x 3 x 16 channels match the number of input channels

3 x 3 x 36 result of only valid convolutions

Page 39: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!39

Caltech-101

101 classes, 9,146 images

Caltech-256

256 classes, 30,607 images

CIFAR-10

10 classes, 60,000 images

CIFAR-100

100 classes, 60,000 images

ImageNetCompetition

1,000 classes, 1.2 million images

Full

21,841 classes, 14 million images

natural image datasets

Page 40: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!40

convolutional models for classification

LeNetAlexNet

VGG

GoogLeNet ResNet Inception v4

DenseNetResNeXt

Page 41: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!41

convolutional models for detection, segmentation, etc.

R-CNN Fast R-CNN Faster R-CNN

FCN Hourglass U-Net

YOLOMask R-CNN

Page 42: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!42 https://www.youtube.com/watch?v=OOT3UIXZztE

Page 43: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!43 https://www.youtube.com/watch?v=pW6nZXeWlGM

Page 44: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!44

convolutional models for image generation

DC-GAN convolutional VAE Pixel CNN

Page 45: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!45 https://www.youtube.com/watch?v=XOxxPcy5Gr4

Page 46: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!46

filter visualization

Zeiler, 2013

Page 47: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!47

filter visualization

Zeiler, 2013

Page 48: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!48

filter visualization

Zeiler, 2013

Page 49: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!49

filter visualization

Zeiler, 2013

Page 50: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!50

filter visualization

Zeiler, 2013

Page 51: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!51

filter visualization

Zeiler, 2013

Page 52: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!52

filter visualization

Zeiler, 2013

Page 53: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!53

filter visualization

Zeiler, 2013

Page 54: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!54

filter visualization

Zeiler, 2013

Page 55: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!55

filter visualization

https://distill.pub/2017/feature-visualization/

Page 56: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!56

convolutions applied to sequences

WaveNet

https://deepmind.com/blog/wavenet-launches-google-assistant/

Page 57: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!57

convolutions in non-euclidean spaces

Spline CNN Graph Convolutional Network

Fey et al., 2017 Kipf & Welling, 2016

Page 58: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!58

recapitulation

we can exploit spatial structure to impose inductive biases on the model

locality

translation invariance

this limits the number of parameters required, reducing flexibility in reasonable ways

can then scale these models to complex data sets to perform difficult tasks

ImageNet

recognition detection segmentation generation

Page 59: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

R E C U R R E N T N E U R A L N E T W O R K S

Page 60: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!60

task: speech recognition

mapping from input waveform to sequence of characters

Graves & Jaitly, 2014

Page 61: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!61

the input waveform contains all of the information about the corresponding transcribed text

form a discriminative mapping: p(text sequence| )

again, there is nuisance information in the waveform coming from the speaker’s voice characteristics, volume, background, etc.

Page 62: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!62

the mapping is too difficult to define by hand,

need to learn from data

data, label collection

Audio Transcriptions

“OK Google…”

“Hey Siri…”

“Yo Alexa…”

x1

x2

xM

1

1

but how do we define the network architecture?

Page 63: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!63

problem: inputs can be of variable size

standard neural networks can only handle data of a fixed input size

?

x1

x2⌃

1

1

⌃x?

Page 64: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!64

wait, but convolutional networks can handle variable input sizes…can’t we just use them?

yes, we could

however, this relies on a fixed input window size

we may be able to exploit additional structure in sequence data to impose better inductive biases

Page 65: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!65

t

the structure of sequence data

sequence data also tends to obey

locality: nearby regions tend to form stronger patterns

translation invariance: patterns are relative rather than absolute

but has a single axis on which extended patterns occur

Page 66: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!66

to mirror the sequential structure of the data, we can process the data sequentially

maintain an internal representation during processing

potentially infinite effective input windowfixed number of parameters

teach set of colored arrows denotes shared weights

INPUT

HIDDEN

OUTPUT

Page 67: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!67

a recurrent neural network (RNN) can be expressed as

Hidden State

ht = �(W|h[ht�1,xt])

Output

yt = �(W|yht)

Page 68: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!68

basic recurrent networks are also a special case of standard neural networks with skip connections and shared weights

same

Depth = Steps

Page 69: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!69

therefore, we can use standard backpropagation to train, resulting in backpropagation through time (BPTT)

Gradient

Page 70: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!70

primary difficulty of training RNNs involves propagating information over long horizons

e.g. input at one step is predictive of output at much later step

learning extended sequential dependencies requires propagating gradients over long horizons

• vanishing / exploding gradients • large memory/computational footprint

Page 71: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!71

naïve attempt to fix information propagation issue

add skip connections across steps

information, gradients can propagate more easily

• increases computation • must set limit on window size

but…

Page 72: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!72

add trainable memory to the networkread from and write to “cell” state

Output Gate

Cell State

Long Short-Term Memory (LSTM)

Input Gate

Forget Gate

Hidden State

it = �(W|i [ht�1,xt])

ft = �(W|f [ht�1,xt])

ot = �(W|o[ht�1,xt])

ht = ot � tanh(ct)

ct = ft � ct�1 + it � tanh(W|c [ht�1,xt])

Output

yt = �(W|yht)

ft = �(W|f [ht�1,xt])it = �(W|

i [ht�1,xt])ot = �(W|o[ht�1,xt])

ht�1

ct�1 ct

ht

xt

yt

Page 73: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

Output Gate

Cell State

!73

add trainable memory to the network

Long Short-Term Memory (LSTM)

Input Gate

Forget Gate

Hidden State

it = �(W|i [ht�1,xt])

ft = �(W|f [ht�1,xt])

ot = �(W|o[ht�1,xt])

ht = ot � tanh(ct)

read from and write to “cell” state

ft it ot

ht�1

ct�1 ct

ht

ct = ft � ct�1 + it � tanh(W|c [ht�1,xt])

Output

yt = �(W|yht)xt

yt

Page 74: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

Output Gate

Cell State

!74

add trainable memory to the network

Long Short-Term Memory (LSTM)

Input Gate

Forget Gate

Hidden State

it = �(W|i [ht�1,xt])

ft = �(W|f [ht�1,xt])

ot = �(W|o[ht�1,xt])

ht = ot � tanh(ct)

read from and write to “cell” state

ft it ot

ht�1

ct�1 ct

ht

ct = ft � ct�1 + it � tanh(W|c [ht�1,xt])

Output

yt = �(W|yht)xt

yt

Page 75: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

Output Gate

Cell State

!75

add trainable memory to the network

Long Short-Term Memory (LSTM)

Input Gate

Forget Gate

Hidden State

it = �(W|i [ht�1,xt])

ft = �(W|f [ht�1,xt])

ot = �(W|o[ht�1,xt])

ht = ot � tanh(ct)

read from and write to “cell” state

ft = �(W|f [ht�1,xt])it ot

ht�1

ct�1 ct

ht

ct = ft � ct�1 + it � tanh(W|c [ht�1,xt])

Output

yt = �(W|yht)xt

yt

Page 76: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

Output Gate

Cell State

!76

add trainable memory to the network

Long Short-Term Memory (LSTM)

Input Gate

Forget Gate

Hidden State

it = �(W|i [ht�1,xt])

ft = �(W|f [ht�1,xt])

ot = �(W|o[ht�1,xt])

ht = ot � tanh(ct)

read from and write to “cell” state

ft it = �(W|i [ht�1,xt])ot

ht�1

ct�1 ct

ht

ct = ft � ct�1 + it � tanh(W|c [ht�1,xt])

Output

yt = �(W|yht)xt

yt

Page 77: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

Output Gate

Cell State

!77

add trainable memory to the network

Long Short-Term Memory (LSTM)

Input Gate

Forget Gate

Hidden State

it = �(W|i [ht�1,xt])

ft = �(W|f [ht�1,xt])

ot = �(W|o[ht�1,xt])

ht = ot � tanh(ct)

read from and write to “cell” state

ft = �(W|f [ht�1,xt])it = �(W|

i [ht�1,xt])ot

ht�1

ct�1 ct

ht

ct = ft � ct�1 + it � tanh(W|c [ht�1,xt])

Output

yt = �(W|yht)xt

yt

Page 78: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

Output Gate

Cell State

!78

add trainable memory to the network

Long Short-Term Memory (LSTM)

Input Gate

Forget Gate

Hidden State

it = �(W|i [ht�1,xt])

ft = �(W|f [ht�1,xt])

ot = �(W|o[ht�1,xt])

ht = ot � tanh(ct)

read from and write to “cell” state

ft it ot = �(W|o[ht�1,xt])

ht�1

ct�1 ct

ht

ct = ft � ct�1 + it � tanh(W|c [ht�1,xt])

Output

yt = �(W|yht)xt

yt

Page 79: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

Output Gate

Cell State

!79

add trainable memory to the network

Long Short-Term Memory (LSTM)

Input Gate

Forget Gate

Hidden State

it = �(W|i [ht�1,xt])

ft = �(W|f [ht�1,xt])

ot = �(W|o[ht�1,xt])

ht = ot � tanh(ct)

read from and write to “cell” state

ft it ot = �(W|o[ht�1,xt])

ht�1

ct�1 ct

ht

ct = ft � ct�1 + it � tanh(W|c [ht�1,xt])

Output

yt = �(W|yht)xt

yt

Page 80: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

Output Gate

Cell State

!80

add trainable memory to the network

Long Short-Term Memory (LSTM)

Input Gate

Forget Gate

Hidden State

it = �(W|i [ht�1,xt])

ft = �(W|f [ht�1,xt])

ot = �(W|o[ht�1,xt])

ht = ot � tanh(ct)

read from and write to “cell” state

ft it ot

ht�1

ct�1 ct

ht

ct = ft � ct�1 + it � tanh(W|c [ht�1,xt])

Outputyt = �(W|

yht)xt

yt

Page 81: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!81

memory networks

Gated Recurrent Unit (GRU) Cho et al., 2014

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Hopfield Network Hopfield, 1982

Neural Turing Machine (NTM) Graves et al., 2014

Differentiable Neural Computer (DNC) Graves, Wayne, et al., 2016

Memory Networks (MemNN) Weston et al., 2015

Page 82: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!82

bi-directional RNNs

t

INPUT

HIDDEN

OUTPUT

up until now, we have considered the output of the network to only be a function of the preceding inputs (filtering)

but future inputs may help in determining this output (smoothing)

can we make the output a function of both the future and the past inputs?

Page 83: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!83

bi-directional RNNs

HIDDEN

INPUT

t

INPUT

HIDDEN

OUTPUT

Page 84: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!84

audio classification handwriting classification

text classification

Graves, et al., 2013Eyolfsdottir, et al., 2017

Others

Page 85: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!85

tons of options!

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Page 86: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!86

deep recurrent neural networks

Page 87: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!87

auto-regressive generative modeling

output becomes next input

Page 88: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!88

auto-regressive generative language modeling

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Page 89: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!89

Pixel RNN uses recurrent networks to perform auto-regressive image generation

condition the generation of each pixel on a sequence of past pixels

contextgenerated samples

van den Oord et al., 2016

Page 90: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!90

MIDI music generation

Page 91: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!91

recapitulation

we can exploit sequential structure to impose inductive biases on the model

this limits the number of parameters required, reducing flexibility in reasonable ways

t

INPUT

HIDDEN

OUTPUT

can then scale these models to complex data sets to perform difficult tasks

Page 92: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

R E C A P

Page 93: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!93

we used additional priors (inductive biases) to scale deep networks up to handle spatial and sequential data

recapitulation

without these priors, we would need more parameters and data

Page 94: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!94

we live in a spatiotemporal world

we are constantly getting sequences of spatial sensory inputs

embodied intelligent machines need to learn from spatial and temporal patterns

Berkeley AI Research

Page 95: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!95

CNNs and RNNs are building blocks for machines that can use spatiotemporal data to solve tasks

Jaderberg, Minh, Czarnecki et al., 2016

Page 96: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!96 Jaderberg, Minh, Czarnecki et al., 2016

Page 97: DEEP LEARNING - Yisong Yue · standard neural networks require a fixed input size… clearer patterns, but more parameters fewer parameters, but unclear patterns 150 x 150 x 3 205

!97