Learning Convolutional Feature Hierarchies for Visual …koray/files/poster.pdfKoray Kavukcuoglu1, Pierre Sermanet1, Y-Lan Boureau1,2, Karol Gregor1, Michael Mathieu1, Yann LeCun1

---

0 20 40 60 80100

101

102

103

104

deg

# of

cro

ss c

orr >

deg

Patch Based TrainingConvolutional Training

10 2 10 1 100 101

0.05

0.1

0.2

0.3

0.40.50.60.70.80.9

1

false positives per image

miss

rate

Shapelet orig (90.5%)PoseInvSvm (68.6%)VJ OpenCv (53.0%)PoseInv (51.4%)Shapelet (50.4%)VJ (47.5%)FtrMine (34.0%)Pls (23.4%)HOG (23.1%)HikSvm (21.9%)LatSvm V1 (17.5%)MultiFtr (15.6%)R+R+ (14.8%)U+U+ (11.5%)MultiFtr+CSS (10.9%)LatSvm V2 (9.3%)FPDW (9.3%)ChnFtrs (8.7%)

10 2 10 1 100 101

0.05

0.1

0.2

0.3

0.40.50.60.70.80.9

1


miss

rate

Shapelet orig (90.5%)PoseInvSvm (68.6%)VJ OpenCv (53.0%)PoseInv (51.4%)Shapelet (50.4%)VJ (47.5%)FtrMine (34.0%)Pls (23.4%)HOG (23.1%)HikSvm (21.9%)LatSvm V1 (17.5%)MultiFtr (15.6%)R+R+ (14.8%)U+U+ (11.5%)MultiFtr+CSS (10.9%)LatSvm V2 (9.3%)FPDW (9.3%)ChnFtrs (8.7%)

14.8%

11.5%

10 2 10 1 100 101

0.05

0.1

0.2

0.3

0.40.50.60.70.80.9

1


mis

s ra

te

U+U+ bt0 (23.6%)U+U+ bt1 (16.5%)U+U+ bt2 (13.8%)U+U+ bt6 (12.4%)U+U+ bt3 (11.9%)U+U+ bt5 (11.7%)U+U+ bt4 (11.5%)

Learning Convolutional Feature Hierarchies for Visual RecognitionKoray Kavukcuoglu1, Pierre Sermanet1, Y-Lan Boureau1,2, Karol Gregor1, Michael Mathieu1, Yann LeCun1

1Courant Institute of Mathematical Sciences - NYU, 2INRIA - Willow Project Team

Overview

Object Recognition on Caltech 101

Pedestrian Detection on INRIA• One of the most widely used pedestrian

detection benchmark dataset• Detect and match bounding boxes with 50%

overlap

• 4 bootstrapping passes

Predictor Dictionary

Predictor

Dictionary

1st Layer 2nd Layer

• Sparse coding is a popular method for learning features in an unsupervised manner [Olshausenʼ97, Ranzatoʼ07, Kavukcuogluʼ08,ʼ09].

• Sparse coding is often trained on isolated image patches, which produces Gabor filters at various orientations and positions to cover the patch.

• This produces highly-redundant representations.• We use convolutional sparse coding, which is trained on large

image regions, produces more diverse filters, and less redundant representations [Zeilerʼ10, Leeʼ10].

• We use a feed-forward encoder to produce fast approximations of the sparse code [Ranzatoʼ07, Kavukcuogluʼ08,ʼ09, Jarrettʼ09].

• The method is used to pre-train the filters of convolutional networks that are subsequently fine-tuned with supervised back-prop [Hintonʼ06, Ranzatoʼ07, Bengioʼ07, Kavukcuogluʼ08,ʼ09, Jarrettʼ09].

• competitive accuracies are achieved on object recognition and detection tasks.

Model• Sparse Modeling (patch-based)

Input Dictionary Representation

• Sparse Modeling (convolutional)

Input Representation

x ∈ Rs×s D ∈ RK×s×s z ∈ RK

1

2�x−

�

k

Dk ∗ zk�22 + λ�

k,i,j

|zkij |

1

2�x−Dz�22 + λ

�

i

|zi|

x ∈ Rw×h z ∈ RK×(w−s+1)×(h−s+1)

• Each dictionary item k is a convolutional kernel connected to a feature map k

• Convolutional Training yields a more diverse set of filters

Convolutional Predictive Sparse Decomposition

• Cumulative histogram of angle between every dictionary item pair

• Dictionary learned with convolutional training produces less redundant filters

• Minimum angle between any two convolutional dictionary item is 40

acos(abs(max(Di ∗DTj )))

• Reduce cost of inference by training a feed-forward predictor function

1

2�x−

�

k

Dk ∗ zk�22 + λ|z|1 + β�

k

||zk − f(W k ∗ x)||22

• Even a simple tanh non-linearity produces good accuracy for recognition

• Second order derivative information is important (Levenberg-Marquardt)

• Better sparse predictions can be obtained by using a shrinking non-linearity

z̃ = gk × tanh(W k ∗ x)

shβk,bk(s) = sign(s)× 1/βk log(exp(βk × bk) + exp(βk × |s|)− 1)− bk

3 2 1 0 1 2 33

2

1

0

1

2

3

s

sh,b

=10,b=1=3,b=1=1,b=1=10,b=2

0 2000 4000 6000 8000 10000 12000 14000 16000 18000Iteration

Loss

gk × tanh (x Wk)sh ,b(x Wk)

shβk,bk(s) = sign(s)× 1/βk log(exp(βk × bk) + exp(βk × |s|)− 1)− bk

Filter BankNon-

LinearityPooling

Unsupervised Pre-Training

x z1

Filter BankNon-

LinearityPooling

Unsupervised Pre-Training

z2

Supervised Refinement

40455055606570

66.365.3

57.657.1

65.563.7

54.252.2

Patch Based Training Convolutional Training

Unsupervised Unsupervised+

Supervised

Unsupervised Unsupervised+

Supervised

1 Stage 1 Stage 2 Stages 2 Stages

• Build 2 stage model using predictor function followed by absolute value rectification and local contrast normalization at each stage

• Predictor function is initialized using Convolutional Predictive Sparse Decomposition (ConvPSD)

• Complete system is fine-tuned together with a linear logistic regression classifer

Training Efficient Predictors

Unsupervised initialization using ConvPSD yields better accuracy than patch-based PSD

Unsupervised Training

Supervised Fine-Tuning

Documents

Learning Convolutional Feature Hierarchies for Visual …koray/files/poster.pdfKoray Kavukcuoglu1, Pierre Sermanet1, Y-Lan Boureau1,2, Karol Gregor1, Michael Mathieu1, Yann LeCun1