Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua...

Preview:

Citation preview

Deep Learning tutorial

Nicolas Le Roux

Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng

DL/UFL workshop

16/12/11

Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng (DL/UFL workshop)Deep Learning tutorial 16/12/11 1 / 31

Disclaimer

A lot of the slides of this tutorial have been taken from

presentations from:

Honglak Lee

Yann LeCun

Geoff Hinton

Andrew Ng

Yoshua BengioNicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 2 / 31

Outline1 Introduction

What is deep?

Training problems2 Unsupervised pretraining

Pretraining

The denoising autoencoder

Predictive sparse coding

The deep belief network3 ConclusionNicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 3 / 31

What does deep mean?

Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 4 / 31

What does deep mean?

Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 4 / 31

What does deep mean?

Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 4 / 31

Properties of deep models

Many non-linearities = One non-linearity

More constrained space of transformations

Deep model = prior on the type of transformations

May need fewer computational units for the same function

Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 5 / 31

Properties of deep models

Many non-linearities = One non-linearity

More constrained space of transformations

Deep model = prior on the type of transformations

May need fewer computational units for the same function

Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 5 / 31

Properties of deep models

Many non-linearities = One non-linearity

More constrained space of transformations

Deep model = prior on the type of transformations

May need fewer computational units for the same function

Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 5 / 31

Features for face recognition

Image courtesy of Honglak Lee

Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 6 / 31

Features for face recognition

Image courtesy of Honglak Lee

“First-level” features are not

task-dependent

“First-level” features are easy to

learn

“Higher-level” features are

easier to learn given the

low-level features

Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 6 / 31

A deep neural network

H1 = g(W1X)

H2 = g(W2H1)

Y = W3H2

g = sigmoid or tanh

Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 7 / 31

A deep neural network

H1 = g(W1X)

H2 = g(W2H1)

Y = W3H2

g = sigmoid or tanh

Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 7 / 31

A deep neural network

H1 = g(W1X)

H2 = g(W2H1)

Y = W3H2

g = sigmoid or tanh

Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 7 / 31

A deep neural network

H1 = g(W1X)

H2 = g(W2H1)

Y = W3H2

g = sigmoid or tanh

Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 7 / 31

Training issues

W3 is a lot easier to learn than W2 and W1

Compare

Given set of features, find combination

Given combination, find set of features

Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 8 / 31

Training issues

W3 is a lot easier to learn than W2 and W1

Compare

Given set of features, find combination

Given combination, find set of features

Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 9 / 31

Practical results

Parameters randomly initialized

They do not change much (bad local minima)

A random transformation is not very good

Bengio et al., 2007

Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 10 / 31

Practical results

Parameters randomly initialized

They do not change much (bad local minima)

A random transformation is not very good

Bengio et al., 2007

Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 10 / 31

Practical results - TIMIT

Mohamed et al., Deep Belief Networks for phone recognition

Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 11 / 31

Outline1 Introduction

What is deep?

Training problems2 Unsupervised pretraining

Pretraining

The denoising autoencoder

Predictive sparse coding

The deep belief network3 ConclusionNicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 12 / 31

The gist of pretraining

1 We want (potentially) deep networks

2 Training the lower layers is hard.

Solution: optimize each layer without relying on the layersabove.

Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 13 / 31

The gist of pretraining

1 We want (potentially) deep networks

2 Training the lower layers is hard.

Solution: optimize each layer without relying on the layersabove.

Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 13 / 31

The gist of pretraining

1 Learn W1 first

2 Given W1, learn W2

3 Given W1 and W2, learn W3

Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 14 / 31

The gist of pretraining

1 Learn W1 first

2 Given W1, learn W2

3 Given W1 and W2, learn W3

Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 14 / 31

The gist of pretraining

1 Learn W1 first

2 Given W1, learn W2

3 Given W1 and W2, learn W3

Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 14 / 31

The gist of pretraining

1 Learn W1 first

2 Given W1, learn W2

3 Given W1 and W2, learn W3

Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 14 / 31

The gist of pretraining

1 Learn W1 first

2 Given W1, learn W2

3 Given W1 and W2, learn W3

Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 14 / 31

Greedy learning

1 Train each layer in a supervised way

2 Use alternate cost hoping that it will be useful.

Alternate cost: modeling the data well.

Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 15 / 31

Greedy learning

1 Train each layer in a supervised way

2 Use alternate cost hoping that it will be useful.

Alternate cost: modeling the data well.

Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 15 / 31

Denoising autoencoders

Vincent et al, 2008

Corrupt the input (e.g. set 25% of inputs to 0)

Reconstruct the uncorrupted input

Use uncorrupted encoding as input to next level

A representation is good if it works well with noisy data

Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 16 / 31

Denoising autoencoders

Vincent et al, 2008

Corrupt the input (e.g. set 25% of inputs to 0)

Reconstruct the uncorrupted input

Use uncorrupted encoding as input to next levelA representation is good if it works well with noisy dataNicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 16 / 31

DAE - A graphic explanation

Projects a “corrupted point” onto its position on the manifold

Two corrupted versions of the same point have the same representation

Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 17 / 31

Stacking DAEs

1 Start with the lowest level and stack upwards

2 Train each layer on the intermediate code (features) from the layer below

3 Top layer can have a different output (e.g., softmax non-linearity) to provide an output for

classification

Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 18 / 31

Predictive sparse coding

Objective function for sparse coding:

Z (X ) = argminZ12‖X − DZ‖2

2 + λ‖Z‖1

Z must:

Be sparse

Reconstruct X well

Be well predicted by a feedforward model

Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 19 / 31

Predictive sparse coding

Objective function for sparse coding:

Z (X ) = argminZ12‖X − DZ‖2

2 + λ‖Z‖1 + α‖Z − F (X , θ)‖22

F (X , θ) = Gtanh(WeX )

Z must:

Be sparse

Reconstruct X well

Be well predicted by a feedforward modelNicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 19 / 31

Stacking PSDs

Z (X ) = argminZ12‖X − DZ‖2

2 + λ‖Z‖1 + α‖Z − F1(X , θ)‖22

F1(X , θ1) = Gtanh(W 1e X )

1 Train the first layer using PSD on X

2 Use H1 = |F1(X , θ)| as features

3 Train the second layer using PSD on H1

4 Use H2 = |F2(H1, θ)| as features

5 Keep going...

6 Use H37 as input to your final classifier

Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 20 / 31

Stacking PSDs

Z (X ) = argminZ12‖X − DZ‖2

2 + λ‖Z‖1 + α‖Z − F1(X , θ)‖22

F1(X , θ1) = Gtanh(W 1e X )

1 Train the first layer using PSD on X

2 Use H1 = |F1(X , θ)| as features

3 Train the second layer using PSD on H1

4 Use H2 = |F2(H1, θ)| as features

5 Keep going...

6 Use H37 as input to your final classifier

Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 20 / 31

Stacking PSDs

Z (H1) = argminZ12‖H1 − DZ‖2

2 + λ‖Z‖1 + α‖Z − F2(H1, θ)‖22

F2(H1, θ2) = Gtanh(W 2e H1)

1 Train the first layer using PSD on X

2 Use H1 = |F1(X , θ)| as features

3 Train the second layer using PSD on H1

4 Use H2 = |F2(H1, θ)| as features

5 Keep going...

6 Use H37 as input to your final classifier

Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 20 / 31

Stacking PSDs

Z (H1) = argminZ12‖H1 − DZ‖2

2 + λ‖Z‖1 + α‖Z − F2(H1, θ)‖22

F2(H1, θ2) = Gtanh(W 2e H1)

1 Train the first layer using PSD on X

2 Use H1 = |F1(X , θ)| as features

3 Train the second layer using PSD on H1

4 Use H2 = |F2(H1, θ)| as features

5 Keep going...

6 Use H37 as input to your final classifier

Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 20 / 31

Stacking PSDs

Z (H1) = argminZ12‖H1 − DZ‖2

2 + λ‖Z‖1 + α‖Z − F2(H1, θ)‖22

F2(H1, θ2) = Gtanh(W 2e H1)

1 Train the first layer using PSD on X

2 Use H1 = |F1(X , θ)| as features

3 Train the second layer using PSD on H1

4 Use H2 = |F2(H1, θ)| as features

5 Keep going...

6 Use H37 as input to your final classifier

Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 20 / 31

Stacking PSDs

Z (H1) = argminZ12‖H1 − DZ‖2

2 + λ‖Z‖1 + α‖Z − F2(H1, θ)‖22

F2(H1, θ2) = Gtanh(W 2e H1)

1 Train the first layer using PSD on X

2 Use H1 = |F1(X , θ)| as features

3 Train the second layer using PSD on H1

4 Use H2 = |F2(H1, θ)| as features

5 Keep going...

6 Use H37 as input to your final classifier

Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 20 / 31

The mothership: the DBN

Unsupervised pretraining works by reconstructing the data

What if we tried to have a full generative model of the data?

Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 21 / 31

The mothership: the DBN

Unsupervised pretraining works by reconstructing the data

What if we tried to have a full generative model of the data?

Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 21 / 31

Restricted Boltzmann Machine (RBM)

E(x,h) = −∑

ij

Wijxihj = −x>Wh

P(x,h) =exp

[x>Wh

]∑x0,h0

exp[x>

0 Wh0]

Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 22 / 31

Type of variables in an RBM

P(x,h) =exp

[x>Wh

]∑x0,h0

exp[x>

0 Wh0]

This formulation is correct when x and h are binary

variables

You have no choice over the type of x (this is your data)

You can choose what type h is

There are other formulations when x is not binary

They typically do not work that well (with notable

exceptions)

Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 23 / 31

Training an RBM

P(x,h) =exp

[x>Wh

]∑x0,h0

exp[x>

0 Wh0]

We have a joint probability of data and “features”

Maximizing P(x) will give us meaningful features

There are efficient ways to do that (check my webpage)

Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 24 / 31

Stacking RBMs

Learn the first layer by maximizing P(x) =∑

h1 P(x,h1)

Once trained, the features of x are EP [h1|x]

Learn the second layer by maximizing

P(h1) =∑

h2 P(h1,h2)

Keep going...Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 25 / 31

Results - 1

Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 26 / 31

Results - 2

Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 27 / 31

Outline1 Introduction

What is deep?

Training problems2 Unsupervised pretraining

Pretraining

The denoising autoencoder

Predictive sparse coding

The deep belief network3 ConclusionNicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 28 / 31

Take-home messages - Deep learning

Building a hierarchy of features can be beneficial

Training all layers at once is very hard

Pretraining alleviates the problem of bad local minima

Regularizes the model

Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 29 / 31

Take-home messages - Pretraining

Pretraining requires building blocks

ALL pretraining methods share the same idea: model the

input while having somewhat invariant features

By no means is this necessarily the best method

And now for some applications...

Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 30 / 31

Take-home messages - Pretraining

Pretraining requires building blocks

ALL pretraining methods share the same idea: model the

input while having somewhat invariant features

By no means is this necessarily the best method

And now for some applications...

Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 30 / 31

This is the end...

Questions?

Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 31 / 31

Recommended