45
Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny

Greedy Layer-Wise Training of Deep Networks

  • Upload
    afi

  • View
    28

  • Download
    0

Embed Size (px)

DESCRIPTION

Greedy Layer-Wise Training of Deep Networks. Yoshua Bengio , Pascal Lamblin , Dan Popovici , Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny. Story so far …. - PowerPoint PPT Presentation

Citation preview

Page 1: Greedy Layer-Wise Training of Deep Networks

Greedy Layer-Wise Training of Deep Networks

Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo LarochelleNIPS 2007

Presented by Ahmed Hefny

Page 2: Greedy Layer-Wise Training of Deep Networks

Story so far …

• Deep neural nets are more expressive: Can learn wider classes of functions with less hidden units (parameters) and training examples.

• Unfortunately they are not easy to train with randomly initialized gradient-based methods.

Page 3: Greedy Layer-Wise Training of Deep Networks

Story so far …

• Hinton et. al. (2006) proposed greedy unsupervised layer-wise training:• Greedy layer-wise: Train layers sequentially starting from bottom

(input) layer.• Unsupervised: Each layer learns a higher-level representation of

the layer below. The training criterion does not depend on the labels.

• Each layer is trained as a Restricted Boltzman Machine. (RBM is the building block of Deep Belief Networks).

• The trained model can be fine tuned using a supervised method.

RBM 0

RBM 1

RBM 2

Page 4: Greedy Layer-Wise Training of Deep Networks

This paper

• Extends the concept to:• Continuous variables• Uncooperative input distributions• Simultaneous Layer Training

• Explores variations to better understand the training method:• What if we use greedy supervised layer-wise training ?• What if we replace RBMs with auto-encoders ?RBM 0

RBM 1

RBM 2

Page 5: Greedy Layer-Wise Training of Deep Networks

Outline• Review• Restricted Boltzman Machines• Deep Belief Networks• Greedy layer-wise Training

• Supervised Fine-tuning• Extensions• Continuous Inputs• Uncooperative Input Distributions• Simultaneous Training

• Analysis Experiments

Page 6: Greedy Layer-Wise Training of Deep Networks

Outline• Review• Restricted Boltzman Machines• Deep Belief Networks• Greedy layer-wise Training

• Supervised Fine-tuning• Extensions• Continuous Inputs• Uncooperative Input Distributions• Simultaneous Training

• Analysis Experiments

Page 7: Greedy Layer-Wise Training of Deep Networks

Restricted Boltzman Machine

𝑣h

Undirected bipartite graphical model with connections between visible nodes and hidden nodes.

Corresponds to joint probability distribution

Page 8: Greedy Layer-Wise Training of Deep Networks

Restricted Boltzman Machine

𝑣h

Undirected bipartite graphical model with connections between visible nodes and hidden nodes.

Corresponds to joint probability distribution

Factorized Conditionals

Page 9: Greedy Layer-Wise Training of Deep Networks

Restricted Boltzman Machine (Training)• Given input vectors , adjust to increase

Page 10: Greedy Layer-Wise Training of Deep Networks

Restricted Boltzman Machine (Training)• Given input vectors , adjust to increase

Page 11: Greedy Layer-Wise Training of Deep Networks

Restricted Boltzman Machine (Training)• Given input vectors , adjust to increase

Sample given Sample and using Gibbs sampling

Page 12: Greedy Layer-Wise Training of Deep Networks

Restricted Boltzman Machine (Training)• Now we can perform stochastic gradient descent on data log-

likelihood

• Stop based on some criterion (e.g. reconstruction error )

Page 13: Greedy Layer-Wise Training of Deep Networks

Deep Belief Network• A DBN is a model of the form

denotes input variables denotes hidden layers of causal variables

Page 14: Greedy Layer-Wise Training of Deep Networks

Deep Belief Network• A DBN is a model of the form

denotes input variables denotes hidden layers of causal variables

Page 15: Greedy Layer-Wise Training of Deep Networks

Deep Belief Network• A DBN is a model of the form

denotes input variables denotes hidden layers of causal variables

is an RBM RBM = Infinitely

Deep network with tied weights

Page 16: Greedy Layer-Wise Training of Deep Networks

Greedy layer-wise training• is intractable• Approximate with • Treat bottom two layers as an RBM• Fit parameters using contrastive divergence

Page 17: Greedy Layer-Wise Training of Deep Networks

Greedy layer-wise training• is intractable• Approximate with • Treat bottom two layers as an RBM• Fit parameters using contrastive divergence

• That gives an approximate • We need to match it with

Page 18: Greedy Layer-Wise Training of Deep Networks

Greedy layer-wise training• Approximate • Treat layers as an RBM• Fit parameters using contrastive divergence• Sample recursively using starting from

Page 19: Greedy Layer-Wise Training of Deep Networks

Outline• Review• Restricted Boltzman Machines• Deep Belief Networks• Greedy layer-wise Training

• Supervised Fine-tuning• Extensions• Continuous Inputs• Uncooperative Input Distributions• Simultaneous Training

• Analysis Experiments

Page 20: Greedy Layer-Wise Training of Deep Networks

Supervised Fine Tuning (In this paper)• Use greedy layer-wise training to initialize weights of all layers except

output layer.

• For fine-tuning, use stochastic gradient descent of a cost function on the outputs where the conditional expected values of hidden nodes are approximated using mean-field.

Page 21: Greedy Layer-Wise Training of Deep Networks

Supervised Fine Tuning (In this paper)• Use greedy layer-wise training to initialize weights of all layers except

output layer.

• Use backpropagation

Page 22: Greedy Layer-Wise Training of Deep Networks

Outline• Review• Restricted Boltzman Machines• Deep Belief Networks• Greedy layer-wise Training

• Supervised Fine-tuning• Extensions• Continuous Inputs• Uncooperative Input Distributions• Simultaneous Training

• Analysis Experiments

Page 23: Greedy Layer-Wise Training of Deep Networks

Continuous Inputs• Recall RBMs:

• If we restrict then normalization gives us binomial with given by sigmoid.• Instead, if we get exponential density• If is closed interval then we get truncated exponential

Page 24: Greedy Layer-Wise Training of Deep Networks

Continuous Inputs (Case for truncated exponential [0,1])• Sampling

For truncated exponential, inverse CDF can be used

where is sampled uniformly from [0,1]

• Conditional Expectation

Page 25: Greedy Layer-Wise Training of Deep Networks

Continuous Inputs• To handle Gaussian inputs, we need to augment the energy function

with a term quadratic in .• For a diagonal covariance matrix

Giving

Page 26: Greedy Layer-Wise Training of Deep Networks

Continuous Hidden Nodes ?

Page 27: Greedy Layer-Wise Training of Deep Networks

Continuous Hidden Nodes ?• Truncated Exponential

• Gaussian

Page 28: Greedy Layer-Wise Training of Deep Networks

Uncooperative Input Distributions• Setting

• No particular relation between p and f, (e.g. Gaussian and sinus)

Page 29: Greedy Layer-Wise Training of Deep Networks

Uncooperative Input Distributions• Setting

• No particular relation between p and f, (e.g. Gaussian and sinus)

• Problem: Unsupvervised pre-training may not help prediction

Page 30: Greedy Layer-Wise Training of Deep Networks

Outline• Review• Restricted Boltzman Machines• Deep Belief Networks• Greedy layer-wise Training

• Supervised Fine-tuning• Extensions• Analysis Experiments

Page 31: Greedy Layer-Wise Training of Deep Networks

Uncooperative Input Distributions• Proposal: Mix unsupervised and supervised training for each layer

Temp. Ouptut Layer

Stochastic Gradient of input log likelihood by Contrastive Divergence

Stochastic Gradient of prediction error

Combined Update

Page 32: Greedy Layer-Wise Training of Deep Networks

Simultaneous Layer Training• Greedy Layer-wise Training• For each layer• Repeat Until Criterion Met

• Sample layer input (by recursively applying trained layers to data)• Update parameters using contrastive divergence

Page 33: Greedy Layer-Wise Training of Deep Networks

Simultaneous Layer Training• Simultaneous Training• Repeat Until Criterion Met• Sample input to all layers• Update parameters of all layers using contrastive divergence

• Simpler: One criterion for the entire network• Takes more time

Page 34: Greedy Layer-Wise Training of Deep Networks

Outline• Review• Restricted Boltzman Machines• Deep Belief Networks• Greedy layer-wise Training

• Supervised Fine-tuning• Extensions• Continuous Inputs• Uncooperative Input Distributions• Simultaneous Training

• Analysis Experiments

Page 35: Greedy Layer-Wise Training of Deep Networks

Experiments• Does greedy unsupervised pre-training help ?• What if we replace RBM with auto-encoders ?• What if we do greedy supervised pre-training ?

• Does continuous variable modeling help ?• Does partially supervised pre-training help ?

Page 36: Greedy Layer-Wise Training of Deep Networks

Experiment 1• Does greedy unsupervised pre-training help ?• What if we replace RBM with auto-encoders ?• What if we do greedy supervised pre-training ?

• Does continuous variable modeling help ?• Does partially supervised pre-training help ?

Page 37: Greedy Layer-Wise Training of Deep Networks

Experiment 1

Page 38: Greedy Layer-Wise Training of Deep Networks

Experiment 1

Page 39: Greedy Layer-Wise Training of Deep Networks

Experiment 1(MSE and Training Errors)

Partially Supervised < Unsupervised Pre-training < No Pre-training

Gaussian < Binomial

Page 40: Greedy Layer-Wise Training of Deep Networks

Experiment 2• Does greedy unsupervised pre-training help ?• What if we replace RBM with auto-encoders ?• What if we do greedy supervised pre-training ?

• Does continuous variable modeling help ?• Does partially supervised pre-training help ?

Page 41: Greedy Layer-Wise Training of Deep Networks

Experiment 2• Auto Encoders• Learn a compact representation to reconstruct X

• Trained to minimize reconstruction cross-entropyX X

Page 42: Greedy Layer-Wise Training of Deep Networks

Experiment 2(500~1000) layer width 20 nodes in last two layers

Page 43: Greedy Layer-Wise Training of Deep Networks

Experiment 2• Auto-encoder pre-training outperforms supervised pre-training but is

still outperformed by RBM.

• Without pre-training, deep nets do not generalize well, but they can still fit the data if the output layers are wide enough.

Page 44: Greedy Layer-Wise Training of Deep Networks

Conclusions• Unsupervised pre-training is important for deep networks.

• Partial supervision further enhances results, especially when input distribution and the function to be estimated are not closely related.

• Explicitly modeling conditional inputs is better than using binomial models.

Page 45: Greedy Layer-Wise Training of Deep Networks

Thanks