Upload
nguyendang
View
217
Download
3
Embed Size (px)
Citation preview
---
0 20 40 60 80100
101
102
103
104
deg
# of
cro
ss c
orr >
deg
Patch Based TrainingConvolutional Training
10 2 10 1 100 101
0.05
0.1
0.2
0.3
0.40.50.60.70.80.9
1
false positives per image
miss
rate
Shapelet orig (90.5%)PoseInvSvm (68.6%)VJ OpenCv (53.0%)PoseInv (51.4%)Shapelet (50.4%)VJ (47.5%)FtrMine (34.0%)Pls (23.4%)HOG (23.1%)HikSvm (21.9%)LatSvm V1 (17.5%)MultiFtr (15.6%)R+R+ (14.8%)U+U+ (11.5%)MultiFtr+CSS (10.9%)LatSvm V2 (9.3%)FPDW (9.3%)ChnFtrs (8.7%)
10 2 10 1 100 101
0.05
0.1
0.2
0.3
0.40.50.60.70.80.9
1
false positives per image
miss
rate
Shapelet orig (90.5%)PoseInvSvm (68.6%)VJ OpenCv (53.0%)PoseInv (51.4%)Shapelet (50.4%)VJ (47.5%)FtrMine (34.0%)Pls (23.4%)HOG (23.1%)HikSvm (21.9%)LatSvm V1 (17.5%)MultiFtr (15.6%)R+R+ (14.8%)U+U+ (11.5%)MultiFtr+CSS (10.9%)LatSvm V2 (9.3%)FPDW (9.3%)ChnFtrs (8.7%)
14.8%
11.5%
10 2 10 1 100 101
0.05
0.1
0.2
0.3
0.40.50.60.70.80.9
1
false positives per image
mis
s ra
te
U+U+ bt0 (23.6%)U+U+ bt1 (16.5%)U+U+ bt2 (13.8%)U+U+ bt6 (12.4%)U+U+ bt3 (11.9%)U+U+ bt5 (11.7%)U+U+ bt4 (11.5%)
Learning Convolutional Feature Hierarchies for Visual RecognitionKoray Kavukcuoglu1, Pierre Sermanet1, Y-Lan Boureau1,2, Karol Gregor1, Michael Mathieu1, Yann LeCun1
1Courant Institute of Mathematical Sciences - NYU, 2INRIA - Willow Project Team
Overview
Object Recognition on Caltech 101
Pedestrian Detection on INRIA• One of the most widely used pedestrian
detection benchmark dataset• Detect and match bounding boxes with 50%
overlap
• 4 bootstrapping passes
Predictor Dictionary
Predictor
Dictionary
1st Layer 2nd Layer
• Sparse coding is a popular method for learning features in an unsupervised manner [Olshausenʼ97, Ranzatoʼ07, Kavukcuogluʼ08,ʼ09].
• Sparse coding is often trained on isolated image patches, which produces Gabor filters at various orientations and positions to cover the patch.
• This produces highly-redundant representations.• We use convolutional sparse coding, which is trained on large
image regions, produces more diverse filters, and less redundant representations [Zeilerʼ10, Leeʼ10].
• We use a feed-forward encoder to produce fast approximations of the sparse code [Ranzatoʼ07, Kavukcuogluʼ08,ʼ09, Jarrettʼ09].
• The method is used to pre-train the filters of convolutional networks that are subsequently fine-tuned with supervised back-prop [Hintonʼ06, Ranzatoʼ07, Bengioʼ07, Kavukcuogluʼ08,ʼ09, Jarrettʼ09].
• competitive accuracies are achieved on object recognition and detection tasks.
Model• Sparse Modeling (patch-based)
Input Dictionary Representation
• Sparse Modeling (convolutional)
Input Representation
x ∈ Rs×s D ∈ RK×s×s z ∈ RK
1
2�x−
�
k
Dk ∗ zk�22 + λ�
k,i,j
|zkij |
1
2�x−Dz�22 + λ
�
i
|zi|
x ∈ Rw×h z ∈ RK×(w−s+1)×(h−s+1)
• Each dictionary item k is a convolutional kernel connected to a feature map k
• Convolutional Training yields a more diverse set of filters
Convolutional Predictive Sparse Decomposition
• Cumulative histogram of angle between every dictionary item pair
• Dictionary learned with convolutional training produces less redundant filters
• Minimum angle between any two convolutional dictionary item is 40
acos(abs(max(Di ∗DTj )))
• Reduce cost of inference by training a feed-forward predictor function
1
2�x−
�
k
Dk ∗ zk�22 + λ|z|1 + β�
k
||zk − f(W k ∗ x)||22
• Even a simple tanh non-linearity produces good accuracy for recognition
• Second order derivative information is important (Levenberg-Marquardt)
• Better sparse predictions can be obtained by using a shrinking non-linearity
z̃ = gk × tanh(W k ∗ x)
shβk,bk(s) = sign(s)× 1/βk log(exp(βk × bk) + exp(βk × |s|)− 1)− bk
3 2 1 0 1 2 33
2
1
0
1
2
3
s
sh,b
=10,b=1=3,b=1=1,b=1=10,b=2
0 2000 4000 6000 8000 10000 12000 14000 16000 18000Iteration
Loss
gk × tanh (x Wk)sh ,b(x Wk)
shβk,bk(s) = sign(s)× 1/βk log(exp(βk × bk) + exp(βk × |s|)− 1)− bk
Filter BankNon-
LinearityPooling
Unsupervised Pre-Training
x z1
Filter BankNon-
LinearityPooling
Unsupervised Pre-Training
z2
Supervised Refinement
40455055606570
66.365.3
57.657.1
65.563.7
54.252.2
Patch Based Training Convolutional Training
Unsupervised Unsupervised+
Supervised
Unsupervised Unsupervised+
Supervised
1 Stage 1 Stage 2 Stages 2 Stages
• Build 2 stage model using predictor function followed by absolute value rectification and local contrast normalization at each stage
• Predictor function is initialized using Convolutional Predictive Sparse Decomposition (ConvPSD)
• Complete system is fine-tuned together with a linear logistic regression classifer
Training Efficient Predictors
Unsupervised initialization using ConvPSD yields better accuracy than patch-based PSD
Unsupervised Training
Supervised Fine-Tuning