Convolutional Restricted Boltzmann Machines for Feature Learning

Convolutional RestrictedBoltzmann Machines for

Feature Learning

Mohammad NorouziAdvisor: Dr. Greg Mori

CS @ Simon Fraser University27 Nov 2009

1

CRBMs forFeature Learning

Mohammad NorouziAdvisor: Dr. Greg Mori

CS @ Simon Fraser University27 Nov 2009

2

Problems

Human detectionHandwritten digit classification

3

Sliding Window Approach

4

Sliding Window Approach (Cont’d)

5

[INRIA Person Dataset]

Decisi

on B

ound

ary

Success or Failure of an object recognition algorithm hinges on the features used

Input Feature representation Label

Our Focus Classifier? HumanBackground

0 / 1 / 2 / 3 / …

6

Learning

Local Feature Detector Hierarchies

7

Larger More complicated Less frequent

Generative & Layerwise Learning

8

?

?

?

?

?

?

??

?

?

?

?

?

?

?

?Generative

CRBM

?

?

? ?

?

??

?

? ?

?

?

Visual Features: Filtering

9

1 0 -1

2 0 -2

1 0 -1Filter Kernel (Feature)

-1 0 1

-2 0 2

-1 0 1

0 -1 -2

1 0 -1

2 1 0Filter Response

1W

V

2W 2W

),( 1WVFilter ),( 2WVFilter ),( 3WVFilter

Our approach to feature learningis generative

?

?

?

1H

2H

3H

V

Binary HiddenVariables

10

1W

2W

3W

(CRBM model)

Related Work

11

Related Work

• Convolutional Neural Network (CNN)– Filtering layers are bundled with a classifier, and all

the layers are learned together using error backpropagation.

– Does not perform well on natural images

• Biologically plausible models– Hand-crafted first layer vs. Randomly selected

prototypes for second layer.

[Lecun et al. 98]

[Ranzato et al. CVPR'07]

[Serre et al., PAMI'07] [Mutch and Lowe, CVPR'06]

12

Discrim

inative

No Learning

Related Work (cont’d)

• Deep Belief Net– A two layer partially observed MRF, called RBM, is

the building block– Learning is performed unsupervised and layer-by-

layer from bottom layer upwards

• Our contributions: We incorporate spatial locality into RBMs and adapt the learning algorithm accordingly

• We add more complicated components such as pooling and sparsity into deep belief nets

[Hinton et al., NC'2006]

13

Generative &

Unsupervi

sed

Why Generative &Unsupervised

• Discriminative learning of deep and large neural networks has not been successful– Requires large training sets– Easily gets over-fitted for large models– First layer gradients are relatively small

• Alternative hybrid approach– Learn a large set of first layer features generatively– Switch to a discriminative model to select the

discriminative features from those that are learned– Discriminative fine-tuning is helpful

Details

15

CRBM

• Image is the visible layer and hidden layer is related to filter responses

• An energy based probabilistic model

16Dot product of vectorized matrices

),();,(

);,();,(

,exp1

kkkk

k kk

H

WVFilterHWHVE

WHVEWHVE

H;WVEZ

=V;WP

Training CRBMs

• Maximum likelihood learning of CRBMs is difficult• Contrastive Divergence (CD) learning is applicable

• For CD learning we need to compute the conditionals and .

data

17

sample

HVP | VHP |

CRBM (Backward)

• Nearby hidden variablescooperate in reconstruction

• Conditional Probabilities take the form

18

)exp1(

1

*

)(

),()|(

),()|(

x

k kk

kk

x

WHFilterHVP

WVFilterVHP

Learning the Hierarchy

• The structure is trained bottom up and layerwise• The CRBM model for training filtering layers • Filtering layers are followed by down-sampling

CRBM CRBMClassifier

Pooling Pooling

19FilteringNon-linearity

Reduce thedimensionality

layers

Input

1st Filters 2nd Filters

ResponsesResponses

1 32 4

Experiments

21

Evaluation

MNIST digit dataset• Training set: 60,000 image

of digits of size 28x28• Test set: 10,000 images

INRIA person dataset• Training set: 2416 person

windows of size 128 x 64 pixels and 4.5x106 negative windows

• Test set: 1132 positive and 2x106 negative windows

22

First layer filters

• Gray-scale images of INRIA positive set

• 15 filters of 7x7

23

• MNIST unlabeled digits• 15 filters of 5x5

Second Layer Features (MNIST)• Hard to visualize the filters• We show patches highly responded to filters:

2424

Second Layer Features (INRIA)

25

MNIST Results

• MNIST error rate when model is trained on the full training set

26

Results

27

False Positive

1st

28

2nd

29

3rd

30

4th

31

5th

32

INRIA Results

• Adding our large-scale features significantly improves performance of the baseline (HOG)

33

Conclusion

• We extended the RBM model to Convolutional RBM, useful for domains with spatial locality

• We exploited CRBMs to train local hierarchical feature detectors one layer at a time and generatively

• This method obtained results comparable to state-of-the-art in digit classification and human detection

34

Thank You

35

Hierarchical Feature Detector

36

? ? ?

? ? ?

? ? ?

Contrastive Divergence Learning

37

data

1kdata

0kkk H,VFilterH,VFilterη+W=W )()( 10

kk

HV,Filter=W

θH;V,E

Training CRBMs (Cont'd)

• The problem of reconstructing border region becomes severe when number of Gibbs sampling steps > 1.– Partition visible units into middle and border

regions

• Instead of maximizing thelikelihood, we (approximately)maximize bm v|vp

Enforcing Feature Sparsity

• The CRBM's representation is K (number of filters) times overcomplete

• After a few CD learning iterations, V is perfectly reconstructed

• Enforce sparsity to tackle this problem– Hidden bias terms were frozen at large negative values

• Having a single non-sparse hidden unit improves the learned features– Might be related to the ergodicity condition

Probabilistic Meaning of Max

1 2 3 4 5 6

1 2 3 4

Max

1 2 3 4 5 6

1 1 2 2h

h'

v

6453

4231

:T

4:T

3

:T

2:T

1

vwh+vwh+

vwh+vwh=hv,E

h'

v

6453

4231

:T

2:T

2

:T

1:T

1

vwh'+vwh'max+

vwh',vwh'max=hv,E

The Classifier Layer

• We used SVM as our final classifier– RBF kernel for MNIST– Linear kernel for INRIA– For INRIA we combined our 4th layer outputs and HOG

features

• We experimentally observed that relaxing the sparsity of CRBM's hidden units yields better results– This lets the discriminative model to set the thresholds

itself

Why HOG features are added?

• Because part-like features are very sparse

• Having a template of the human figure helps a lot

f

RBM

• Two layer pairwise MRF with a full setof hidden-visible connections

• RBM Is an energy based model

• Hidden random variables are binary, Visible variables can be binary or continuous

• Inference is straightforward: and• Contrastive Divergence learning for training

h

v

w

θh;v,EθZ

=θh;v,p exp1

22

1ijjiijiji v+hcvbhwv=θh;v,E

v|hp h|vp

Why Unsupervised Bottom-Up

• Discriminative learning of deep structure has not been successful– Requires large training sets– Easily is over-fitted for large models– First layer gradients are relatively small

• Alternative hybrid approach– Learn a large set of first layer features generatively– Later, switch to a discriminative model to select the

discriminative features from those learned– Fine-tune the features using

INRIA Results (Cont'd)

• Missrate at different FPPW rates

• FPPI is a better indicator of performance• More experiments on size of features and

number of layers are desired

Documents

Convolutional Restricted Boltzmann Machines for Feature Learning