Feature Evaluation of Deep Convolutional Neural Networks for Object Recognition and Detection
Hirokatsu KATAOKA, Kenji Iwata, Yutaka SATOH
National Institute of Advanced Industrial Science and Technology (AIST)
http://www.hirokatsukataoka.net/
arXiv preprint arXiv:1509.07627 http://arxiv.org/abs/1509.07627
Feature Evaluation • Significant task in computer vision – Based on the DeCAF [Donahue+, ICML2014], we evaluate several CNN
features + SVM classifier – The representative architecture: AlexNet [Krizhevsky+, NIPS2012] &
VGGNet[Simonyan+, ICLR2015] – Basic Idea1: Which layer has better feature in CNN architecture? – Basic Idea2: Mid- & High-level CNN features should be concatenated! (e.g. Layer 3 + Layer 5 + Layer 7)
CNN Architecture & Feature Extraction • AlexNet & VGGNet – AlexNet: 8-layer architecture – VGGNet: 16-layer arhitecture (each pooling layer and last 2 FC layers are
applied as feature vector)
Input
Conv
Conv
Pool
Conv
Pool
FC
FC
So.max
Input
Conv
Conv
Pool
FC
FC
AlexNet
VGGNet
Conv
Conv
Pool
Conv
Conv
Pool
Conv
Conv
Pool
Conv
Conv
Pool
FC
So.max
Input
Conv
Pool
FC
So.max
: Image input
: Convolu:onal layer
: Max-‐pooling layer
: Fully-‐connected layer
: So.max layer
Layer1
Layer2
Layer3
Layer4
Layer5
Layer6
Layer7
Layer1
Layer2
Layer3
Layer4
Layer5
Layer6
Layer7
Experiment • Settings – Layer: 3 – 7 (middle and deeper layers) • Conv., pooling and fully-connected layers
– Concatenation and transformation • Layer 345, 456, 567, 357 • Principal component analysis (PCA): 1500dims
– Classifier • Support vector machine (SVM) • The parameters are based on DeCAF [Donahue+, ICML2014]
• Datasets – Daimler pedestrian benchmark dataset (pedestrian detection) [Munder+,
TPAMI2006] – Caltech 101 dataset (object classification) [Fei-Fei+, CVPRW2004]
Results on the Daimler dataset • Daimler pedestrian benchmark dataset – VGGNet Layer 5 (original vector) is the best rate (99.35%) – In AlexNet, Layer 3 with PCA is the best rate (98.71%)
Mid-layer is tend to be better rate on the pedestrian detection data
Results on the Caltech 101 dataset • Caltech 101 dataset – VGGNet Layer 5 (original vector) is the best rate (91.80%) – In AlexNet, Layer 5 with PCA is the best rate (78.37%)
The layer before FC layer performs good rate in object classification
Feature Concatenation • Three-layer connection with PCA – Layer 345, 456, 567, 357 – 4,500 dimensions (1,500dims at each vector) – Left: Daimler – Right: Caltech 101
Daimler Caltech 101
VGGNet layer 567 is the significant tuning Pedestrian detection: mid-level feature Object classification: high-level feature
Conclusion • Feature evaluation with AlexNet & VGGNet – VGGNet is better than AlexNet
– Mid-level feature is good for pedestrian detection, and high-level feature is
good for object classification task
– Concatenation of VGGNet - 5th Pooling, last 2 FC layers is the best setting on
the Daimler pedestrian benchmark and Caltech 101 dataset
– PCA is effective transformation for CNN feature