Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua...

Deep Learning tutorial

Nicolas Le Roux

Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng

DL/UFL workshop

16/12/11

Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng (DL/UFL workshop)Deep Learning tutorial 16/12/11 1 / 31

Disclaimer

A lot of the slides of this tutorial have been taken from

presentations from:

Honglak Lee

Yann LeCun

Geoff Hinton

Andrew Ng

Yoshua BengioNicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 2 / 31

Outline1 Introduction

What is deep?

Training problems2 Unsupervised pretraining

Pretraining

The denoising autoencoder

Predictive sparse coding

The deep belief network3 ConclusionNicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 3 / 31

What does deep mean?

Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 4 / 31

Properties of deep models

Many non-linearities = One non-linearity

More constrained space of transformations

Deep model = prior on the type of transformations

May need fewer computational units for the same function

Features for face recognition

Image courtesy of Honglak Lee

Features for face recognition

Image courtesy of Honglak Lee

“First-level” features are not

task-dependent

“First-level” features are easy to

“Higher-level” features are

easier to learn given the

low-level features

A deep neural network

H1 = g(W1X)

H2 = g(W2H1)

Y = W3H2

g = sigmoid or tanh

H1 = g(W1X)

H2 = g(W2H1)

Y = W3H2

g = sigmoid or tanh

H1 = g(W1X)

H2 = g(W2H1)

Y = W3H2

g = sigmoid or tanh

H1 = g(W1X)

H2 = g(W2H1)

Y = W3H2

g = sigmoid or tanh

Training issues

W3 is a lot easier to learn than W2 and W1

Compare

Given set of features, find combination

Given combination, find set of features

Training issues

W3 is a lot easier to learn than W2 and W1

Compare

Given set of features, find combination

Given combination, find set of features

Practical results

Parameters randomly initialized

They do not change much (bad local minima)

A random transformation is not very good

Bengio et al., 2007

Practical results

Parameters randomly initialized

They do not change much (bad local minima)

A random transformation is not very good

Bengio et al., 2007

Practical results - TIMIT

Mohamed et al., Deep Belief Networks for phone recognition

What is deep?

Pretraining

The gist of pretraining

1 We want (potentially) deep networks

2 Training the lower layers is hard.

Solution: optimize each layer without relying on the layersabove.

1 We want (potentially) deep networks

2 Training the lower layers is hard.

Solution: optimize each layer without relying on the layersabove.

1 Learn W1 first

2 Given W1, learn W2

3 Given W1 and W2, learn W3

1 Learn W1 first

Greedy learning

1 Train each layer in a supervised way

2 Use alternate cost hoping that it will be useful.

Alternate cost: modeling the data well.

Greedy learning

1 Train each layer in a supervised way

2 Use alternate cost hoping that it will be useful.

Alternate cost: modeling the data well.

Denoising autoencoders

Vincent et al, 2008

Corrupt the input (e.g. set 25% of inputs to 0)

Reconstruct the uncorrupted input

Use uncorrupted encoding as input to next level

A representation is good if it works well with noisy data

Denoising autoencoders

Vincent et al, 2008

Corrupt the input (e.g. set 25% of inputs to 0)

Reconstruct the uncorrupted input

Use uncorrupted encoding as input to next levelA representation is good if it works well with noisy dataNicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 16 / 31

DAE - A graphic explanation

Projects a “corrupted point” onto its position on the manifold

Two corrupted versions of the same point have the same representation

Stacking DAEs

1 Start with the lowest level and stack upwards

2 Train each layer on the intermediate code (features) from the layer below

3 Top layer can have a different output (e.g., softmax non-linearity) to provide an output for

classification

Objective function for sparse coding:

Z (X ) = argminZ12‖X − DZ‖2

2 + λ‖Z‖1

Z must:

Be sparse

Reconstruct X well

Be well predicted by a feedforward model

Objective function for sparse coding:

2 + λ‖Z‖1 + α‖Z − F (X , θ)‖22

F (X , θ) = Gtanh(WeX )

Z must:

Be sparse

Reconstruct X well

Be well predicted by a feedforward modelNicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 19 / 31

Stacking PSDs

2 + λ‖Z‖1 + α‖Z − F1(X , θ)‖22

F1(X , θ1) = Gtanh(W 1e X )

1 Train the first layer using PSD on X

2 Use H1 = |F1(X , θ)| as features

3 Train the second layer using PSD on H1

4 Use H2 = |F2(H1, θ)| as features

5 Keep going...

6 Use H37 as input to your final classifier

Stacking PSDs

2 + λ‖Z‖1 + α‖Z − F1(X , θ)‖22

F1(X , θ1) = Gtanh(W 1e X )

5 Keep going...

Stacking PSDs

Z (H1) = argminZ12‖H1 − DZ‖2

2 + λ‖Z‖1 + α‖Z − F2(H1, θ)‖22

F2(H1, θ2) = Gtanh(W 2e H1)

5 Keep going...

Stacking PSDs

2 + λ‖Z‖1 + α‖Z − F2(H1, θ)‖22

5 Keep going...

Stacking PSDs

2 + λ‖Z‖1 + α‖Z − F2(H1, θ)‖22

5 Keep going...

Stacking PSDs

2 + λ‖Z‖1 + α‖Z − F2(H1, θ)‖22

5 Keep going...

The mothership: the DBN

Unsupervised pretraining works by reconstructing the data

What if we tried to have a full generative model of the data?

The mothership: the DBN

Unsupervised pretraining works by reconstructing the data

What if we tried to have a full generative model of the data?

Restricted Boltzmann Machine (RBM)

E(x,h) = −∑

Wijxihj = −x>Wh

P(x,h) =exp

]∑x0,h0

exp[x>

0 Wh0]

Type of variables in an RBM

P(x,h) =exp

]∑x0,h0

exp[x>

0 Wh0]

This formulation is correct when x and h are binary

variables

You have no choice over the type of x (this is your data)

You can choose what type h is

There are other formulations when x is not binary

They typically do not work that well (with notable

exceptions)

Training an RBM

P(x,h) =exp

]∑x0,h0

exp[x>

0 Wh0]

We have a joint probability of data and “features”

Maximizing P(x) will give us meaningful features

There are efficient ways to do that (check my webpage)

Stacking RBMs

Learn the first layer by maximizing P(x) =∑

h1 P(x,h1)

Once trained, the features of x are EP [h1|x]

Learn the second layer by maximizing

P(h1) =∑

h2 P(h1,h2)

Keep going...Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 25 / 31

Results - 1

Results - 2

What is deep?

Pretraining

Take-home messages - Deep learning

Building a hierarchy of features can be beneficial

Training all layers at once is very hard

Pretraining alleviates the problem of bad local minima

Regularizes the model

Take-home messages - Pretraining

Pretraining requires building blocks

ALL pretraining methods share the same idea: model the

input while having somewhat invariant features

By no means is this necessarily the best method

And now for some applications...

Take-home messages - Pretraining

Pretraining requires building blocks

ALL pretraining methods share the same idea: model the

input while having somewhat invariant features

By no means is this necessarily the best method

And now for some applications...

This is the end...

Questions?

Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua...

Documents

A Geometric Perspective on Optimal Representations for ... · Pablo Samuel Castro 1, Nicolas Le Roux , Dale Schuurmans;4, Tor Lattimore2, Clare Lyle5 Abstract We propose a new perspective

Coates and Humphreys

Michel Roux Jr-FMM-Recipe - The Kent Farmers Market ... · Michel Roux Jr. kJnt Farmers' Market Month ...with our patron, Michel Roux Jr. Michel Roux Jr. Special Farmers' Market Month

Sporting CP - Coates

Impaired GABAergic transmission disrupts normal ... · 1 Impaired GABAergic transmission disrupts normal homeostatic plasticity in cortical networks LE ROUX Nicolas, AMAR Muriel,

Heating Coates - perricom.com.hk

magazine - Vandoren Paris … · magazine N°2 . Jean-Paul GAUVIN with Giora FEIDMAN. Espace Vandoren Paris Taking advantage of the visit of Nicolas ROUX, our photographer, we asked

Claude Roux - NFSTC

Coates Product Brochure

Guy Coates

Wfs Conference Andre Roux

ROUX ASSOCIATES INCrouxinc.com/wp-content/uploads/2014/09/2014-Roux-Corporate-SOQ.pdf · About Roux Associates The Roux Team Professional Services Environmental Compliance Services

Ellen Coates

COATES, ASSIGNEE, &C

Minimizing Finite Sums with the Stochastic Average Gradient · Minimizing Finite Sums with the Stochastic Average Gradient Mark Schmidt mark.schmidt@inria.fr Nicolas Le Roux nicolas@le-roux.name

Nahmias Coates Kvaran

ROUX California Grown

Coates Harvest Presentation

Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15

Coates Cinderella Phantasy