Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Very Deep ConvNets for Large-‐Scale Image Recogni;on
Karen Simonyan, Andrew Zisserman University of Oxford
Overview
• ConvNet design choices: depth, width, recep=ve fields • Variety of models with different depth:
• How does ConvNet depth affect accuracy on ImageNet?
2
ConvNet Model Dataset Conv. & FC layers
LeNet-‐5 [LeCun et al., 1998] MNIST 5
[Ciresan et al., 2012] CIFAR 7
AlexNet [Krizhevsky et al., 2012] [Zeiler & Fergus, 2012] ImageNet
Challenge
8
9 OverFeat [Sermanet et al., 2012]
[Goodfellow et al., 2013] SVHN 11
Contribu=ons
• Comparison of very deep ConvNets • simple architecture design • increasing depth: from 11 to 19 layers
• Evalua=on of very deep features on other datasets
• The models are publicly available
3
image conv-‐64
FC-‐4096 FC-‐4096 FC-‐1000
conv-‐192 conv-‐384 conv-‐256 conv-‐256
AlexNet image conv-‐64
FC-‐4096 FC-‐4096 FC-‐1000
conv-‐128 conv-‐256 conv-‐256
conv-‐512 conv-‐512 conv-‐512 conv-‐512
conv-‐64 conv-‐128
conv-‐256 conv-‐256 conv-‐512 conv-‐512 conv-‐512 conv-‐512
19-‐layer
Network Design
• Single family of networks
• Key design choices: • 3x3 conv. kernels – very small • stacks of conv. layers w/o pooling (but with ReLU) • conv. stride 1 – no skipping
• Other details are conven=onal: • Rec=fica=on (ReLU) non-‐linearity • 5 max-‐pool layers (x2 downsampling) • no normalisa=on • 3 fully-‐connected layers
4
image
conv-‐64 conv-‐64 maxpool
FC-‐4096 FC-‐4096 FC-‐1000 soImax
conv-‐128 conv-‐128 maxpool
conv-‐256 conv-‐256 maxpool
conv-‐512 conv-‐512 maxpool
conv-‐512 conv-‐512 maxpool
13-‐layer
• A stack of 3x3 layers has a large recep=ve field • two 3x3 layers – 5x5 recep=ve field • three 3x3 layers – 7x7 recep=ve field
• More ReLU non-‐lineari=es • more discrimina=ve
• Less parameters than a single layer
• Design simplicity • no need to tune layer parameters
Why Stacks of 3x3 Filters? 5
1st 3x3 conv. layer
2nd 3x3 conv. layer
5
5
The Networks
• Start from 11 layers, inject more conv. layers
6
11-‐layer
image
FC-‐4096 FC-‐4096 FC-‐1000 soImax
maxpool
conv-‐512 conv-‐512
maxpool
conv-‐512 conv-‐512
maxpool
conv-‐256 conv-‐256
maxpool conv-‐128
maxpool conv-‐64
The Networks
• Start from 11 layers, inject more conv. layers
7
11-‐layer
image
FC-‐4096 FC-‐4096 FC-‐1000 soImax
maxpool
conv-‐512 conv-‐512
maxpool
conv-‐512 conv-‐512
maxpool
conv-‐256 conv-‐256
maxpool conv-‐128
maxpool conv-‐64 conv-‐64
conv-‐128
The Networks
• Start from 11 layers, inject more conv. layers
8
11-‐layer
image
FC-‐4096 FC-‐4096 FC-‐1000 soImax
maxpool
conv-‐512 conv-‐512
maxpool
conv-‐512 conv-‐512
maxpool
conv-‐256 conv-‐256
maxpool conv-‐128
maxpool conv-‐64
image
FC-‐4096 FC-‐4096 FC-‐1000 soImax
maxpool
conv-‐512 conv-‐512
maxpool
conv-‐512 conv-‐512
maxpool
conv-‐256 conv-‐256
maxpool
conv-‐128 conv-‐128
maxpool
conv-‐64 conv-‐64
13-‐layer
The Networks
• Start from 11 layers, inject more conv. layers
9
11-‐layer
image
FC-‐4096 FC-‐4096 FC-‐1000 soImax
maxpool
conv-‐512 conv-‐512
maxpool
conv-‐512 conv-‐512
maxpool
conv-‐256 conv-‐256
maxpool conv-‐128
maxpool conv-‐64
image
FC-‐4096 FC-‐4096 FC-‐1000 soImax
maxpool
conv-‐512 conv-‐512
maxpool
conv-‐512 conv-‐512
maxpool
conv-‐256 conv-‐256
maxpool
conv-‐128 conv-‐128
maxpool
conv-‐64 conv-‐64
13-‐layer
conv-‐256
conv-‐512
conv-‐512
The Networks
• Start from 11 layers, inject more conv. layers
10
11-‐layer
image
FC-‐4096 FC-‐4096 FC-‐1000 soImax
maxpool
conv-‐512 conv-‐512
maxpool
conv-‐512 conv-‐512
maxpool
conv-‐256 conv-‐256
maxpool conv-‐128
maxpool conv-‐64
image
FC-‐4096 FC-‐4096 FC-‐1000 soImax
maxpool
conv-‐512 conv-‐512
maxpool
conv-‐512 conv-‐512
maxpool
conv-‐256 conv-‐256
maxpool
conv-‐128 conv-‐128
maxpool
conv-‐64 conv-‐64
13-‐layer
image
FC-‐4096 FC-‐4096 FC-‐1000 soImax
conv-‐512
maxpool
conv-‐512 conv-‐512
conv-‐512
maxpool
conv-‐512 conv-‐512
conv-‐256
maxpool
conv-‐256 conv-‐256
maxpool
conv-‐128 conv-‐128
maxpool
conv-‐64 conv-‐64
16-‐layer
The Networks
• Start from 11 layers, inject more conv. layers
11
11-‐layer
image
FC-‐4096 FC-‐4096 FC-‐1000 soImax
maxpool
conv-‐512 conv-‐512
maxpool
conv-‐512 conv-‐512
maxpool
conv-‐256 conv-‐256
maxpool conv-‐128
maxpool conv-‐64
image
FC-‐4096 FC-‐4096 FC-‐1000 soImax
maxpool
conv-‐512 conv-‐512
maxpool
conv-‐512 conv-‐512
maxpool
conv-‐256 conv-‐256
maxpool
conv-‐128 conv-‐128
maxpool
conv-‐64 conv-‐64
13-‐layer
image
FC-‐4096 FC-‐4096 FC-‐1000 soImax
conv-‐512
maxpool
conv-‐512 conv-‐512
conv-‐512
maxpool
conv-‐512 conv-‐512
conv-‐256
maxpool
conv-‐256 conv-‐256
maxpool
conv-‐128 conv-‐128
maxpool
conv-‐64 conv-‐64
16-‐layer
conv-‐256
conv-‐512
conv-‐512
The Networks
• Start from 11 layers, inject more conv. layers
12
11-‐layer
image
FC-‐4096 FC-‐4096 FC-‐1000 soImax
maxpool
conv-‐512 conv-‐512
maxpool
conv-‐512 conv-‐512
maxpool
conv-‐256 conv-‐256
maxpool conv-‐128
maxpool conv-‐64
image
FC-‐4096 FC-‐4096 FC-‐1000 soImax
maxpool
conv-‐512 conv-‐512
maxpool
conv-‐512 conv-‐512
maxpool
conv-‐256 conv-‐256
maxpool
conv-‐128 conv-‐128
maxpool
conv-‐64 conv-‐64
13-‐layer
image
FC-‐4096 FC-‐4096 FC-‐1000 soImax
conv-‐512
maxpool
conv-‐512 conv-‐512
conv-‐512
maxpool
conv-‐512 conv-‐512
conv-‐256
maxpool
conv-‐256 conv-‐256
maxpool
conv-‐128 conv-‐128
maxpool
conv-‐64 conv-‐64
16-‐layer
image
FC-‐4096 FC-‐4096 FC-‐1000 soImax
conv-‐512 conv-‐512
maxpool
conv-‐512 conv-‐512
conv-‐512 conv-‐512
maxpool
conv-‐512 conv-‐512
conv-‐256 conv-‐256
maxpool
conv-‐256 conv-‐256
maxpool
conv-‐128 conv-‐128
maxpool
conv-‐64 conv-‐64
19-‐layer
Training
• ConvNet input during training: fixed-‐size 224x224 crop • But images have various sizes…
• Solu=on: • rescale images to a certain size
(preserving aspect ra=o)
• random 224x224 crop
• Image size affects the scale of image sta=s=cs seen by a ConvNet • single-‐scale: 256xN or 384xN • mul=-‐scale: randomly sample the size
for each image from 256xN to 512xN
• Standard augmenta=on: random flip and RGB shik
13
256
N≥256
224
224 384
N≥384
Training (2)
• Mini-‐batch gradient descent with momentum • Regularisa=on: dropout and weight decay • Fast convergence (74 training epochs) • Ini=alisa=on • deep nets are prone to vanishing/exploding gradient • 11-‐layer net: random ini=alisa=on from N(0;0.01) • deeper nets
• top and bolom layers ini=alised with 11-‐layer net • other layers – random ini=alisa=on
• also possible to ini=alise all layers randomly [Glorot & Bengio, 2010]
14
Tes=ng
• ConvNet evalua=on on variable-‐size images: • Tes=ng on mul=ple 224x224 crops [AlexNet] • Dense applica=on over the whole image: [OverFeat]
• all layers are treated as convolu=onal • sum-‐pooling of class score maps • more efficient than mul=ple crops
• Combina=on of both
• Mul=ple image sizes: • 256xN, 384xN, 512xN • class scores averaged
15
image
class score map
conv. layers
global pooling
class scores
Implementa=on
• Heavily-‐modified Caffe toolbox • Mul=ple GPU support • 4 x NVIDIA Titan in an off-‐the-‐shelf worksta=on • synchronous data parallelism • ~3.75 =mes speed-‐up, 2-‐3 weeks for training
16
image mini-‐batch
17
Evalua=on on ImageNet Classifica=on
• Error decreases with depth • Using mul=ple scales is important • mul=-‐scale training outperforms single-‐scale • mul=-‐scale tes=ng further improves the results
18
10.4
9.4
8.8 9
8.8
8.1 8 8.2
7.5 7.5 7
7.5
8
8.5
9
9.5
10
10.5
11 layers 13 layers 16 layers 19 layers
Top-‐5 Classifica;on Error (Val. Set)
single/single
mul=/single
mul=/mul=
Effect of Depth and Scale
train/test scales
• Dense evalua=on is on par with mul=-‐crop • Dense & mul=-‐crop are complementary • Combining predic=ons from 2 nets is beneficial
19
7.5 7.5
7.2
7.5 7.4
7.1 7.1 7.2
6.8
6.6
6.8
7
7.2
7.4
7.6
dense 150 crops dense & 150 crops
Top-‐5 Classifica;on Error (Val. Set)
16-‐layer
19-‐layer
16 & 19-‐layer
Evalua=on: Dense vs Mul=-‐Crop
networks
ImageNet Challenge 2014 (Classifica=on) 20
12.5
9.1
7.9 8.4
7
11.7
8.1
6.7 7.3 6.8 6 7 8 9
10 11 12
Top-‐5 Classifica;on Error (Test Set)
single net
mul=ple nets
• 1st place: GoogLeNet (6.7% with 7 nets, 7.9% with 1 net) • 2nd place: this work
• submission: 7.3% with 2 nets, 8.4% with 1 net • post-‐submission: 6.8% with 2 nets, 7.0% with 1 net
Comparison with GoogLeNet • models developed independently • both are very deep:
• 19 (VGG) vs 22 (GoogLeNet) weight layers • mi=ga=ng gradient instability:
• pre-‐training (VGG) vs auxiliary losses (GoogLeNet) • mul=-‐scale training of both • filter configura=on:
• VGG: only 3x3, stride 1 convolu=on • GoogLeNet: 5x5, 3x3, 1x1 filters combined into complex
mul=-‐branch modules to reduce computa=on complexity
• speed: GoogLeNet is several =mes faster • single-‐model accuracy: VGG is ~0.5% beler
21
image
FC-‐4096 FC-‐4096 FC-‐1000 soImax
conv-‐512 conv-‐512
maxpool
conv-‐512 conv-‐512
conv-‐512 conv-‐512
maxpool
conv-‐512 conv-‐512
conv-‐256 conv-‐256
maxpool
conv-‐256 conv-‐256
maxpool
conv-‐128 conv-‐128
maxpool
conv-‐64 conv-‐64
Recent Advances
• Deep Image: Scaling up Image Recogni=on • 5.33% error • models based on VGG-‐16, but wider
• Delving Deep into Rec=fiers: Surpassing Human-‐Level Performance on ImageNet Classifica=on • 4.94% error • models based on VGG-‐19, but deeper and wider with PReLU
• Batch Normaliza=on: Accelera=ng Deep Network Training by Reducing Internal Covariate Shik • 4.82% error • Incep=on layers with batch normalisa=on
22
23
Beyond ImageNet Classifica=on…
ImageNet Challenge 2014 (Localisa=on) Object bounding box predic=on • Very deep loca=on regressor (similar to OverFeat) • Last layer predicts a box for each class • Ini=alised with classifica=on models • Fine-‐tuning of all layers
• 1st place: VGG (25.3% error), 2nd place: GoogLeNet (26.4%) • Outperforms OverFeat by using deeper representa=ons
24
0 224x224 crop
object box
29.9
26.4 25.3 23
28
OverFeat (2013) GoogLeNet VGG
Top-‐5 Localisa;on Error (Test Set)
Very Deep Features on Other Datasets
• ImageNet-‐pretrained ConvNets are excellent feature extractors on other datasets
• We compare against the state of the art, based on less deep ConvNet features, trained on ILSVRC
• Simple recogni=on pipeline • last fully-‐connected (classifica=on) layer is removed • dense evalua=on over the whole image → global sum-‐pooling • aggrega=on over several scales • L2 normalisa=on → linear SVM • no fine-‐tuning
25
Results on Caltech & PASCAL VOC
• Very deep features are compe==ve or beler than the state of the art despite a simple pipeline
• Two ConvNet extractors are complementary • Even beler results have been recently reported using our released models
26
Method Caltech 101
Caltech 256
PASCAL VOC 2007
PASCAL VOC 2012 (class-‐n)
PASCAL VOC 2012 (ac=ons)
Best published result using shallow ConvNets 93.4 77.6 81.5 81.7 76.3
19-‐layer 92.3 85.1 89.3 89.0 -‐ 16 & 19-‐layer 92.7 86.2 89.7 89.3 84.0
Other Tasks…
Since the public release of models, they have been successfully applied by the community to: • object detec=on • seman=c segmenta=on • image cap=on genera=on • texture classifica=on • etc.
27
Summary
• Representa=on depth is important • Excellent results using very deep ConvNets • simple architecture, small recep=ve fields • but very deep → lots of non-‐linearity
• VGG-‐16 & VGG-‐19 models publicly available from: • hlp://www.robots.ox.ac.uk/~vgg/research/very_deep/ • formats: Caffe and MatConvNet • can be used with any package with a cuDNN back-‐end, e.g. Torch and Theano
28
We gratefully acknowledge the support of NVIDIA Corpora;on with the dona;on of the GPUs used for this research.