VeryDeepConvNets for%Large3Scale%Image%Recogni;on% · • Using"mul=ple"scales"is"important • mul=Lscale"training"outperforms"singleLscale" • mul=Lscale"tes=ng"further"improves"the"results"

Very Deep ConvNets for Large-‐Scale Image Recogni;on

Karen Simonyan, Andrew Zisserman University of Oxford

Overview

•  ConvNet design choices: depth, width, recep=ve fields •  Variety of models with different depth:

•  How does ConvNet depth affect accuracy on ImageNet?

2

ConvNet Model Dataset Conv. & FC layers

LeNet-‐5 [LeCun et al., 1998] MNIST 5

[Ciresan et al., 2012] CIFAR 7

AlexNet [Krizhevsky et al., 2012] [Zeiler & Fergus, 2012] ImageNet

Challenge

8

9 OverFeat [Sermanet et al., 2012]

[Goodfellow et al., 2013] SVHN 11

Contribu=ons

•  Comparison of very deep ConvNets •  simple architecture design •  increasing depth: from 11 to 19 layers

•  Evalua=on of very deep features on other datasets

•  The models are publicly available

3

image conv-‐64

FC-‐4096 FC-‐4096 FC-‐1000

conv-‐192 conv-‐384 conv-‐256 conv-‐256

AlexNet image conv-‐64

FC-‐4096 FC-‐4096 FC-‐1000

conv-‐128 conv-‐256 conv-‐256

conv-‐512 conv-‐512 conv-‐512 conv-‐512

conv-‐64 conv-‐128

conv-‐256 conv-‐256 conv-‐512 conv-‐512 conv-‐512 conv-‐512

19-‐layer

Network Design

•  Single family of networks

•  Key design choices: •  3x3 conv. kernels – very small •  stacks of conv. layers w/o pooling (but with ReLU) •  conv. stride 1 – no skipping

•  Other details are conven=onal: •  Rec=fica=on (ReLU) non-‐linearity •  5 max-‐pool layers (x2 downsampling) •  no normalisa=on •  3 fully-‐connected layers

4

image

conv-‐64 conv-‐64 maxpool

FC-‐4096 FC-‐4096 FC-‐1000 soImax





13-‐layer

•  A stack of 3x3 layers has a large recep=ve field •  two 3x3 layers – 5x5 recep=ve field •  three 3x3 layers – 7x7 recep=ve field

•  More ReLU non-‐lineari=es •  more discrimina=ve

•  Less parameters than a single layer

•  Design simplicity •  no need to tune layer parameters

Why Stacks of 3x3 Filters? 5

1st 3x3 conv. layer

2nd 3x3 conv. layer

5

5

The Networks

•  Start from 11 layers, inject more conv. layers

6

11-‐layer

image

FC-‐4096 FC-‐4096 FC-‐1000 soImax

maxpool


maxpool


maxpool


maxpool conv-‐128

maxpool conv-‐64

The Networks


7

11-‐layer

image

FC-‐4096 FC-‐4096 FC-‐1000 soImax

maxpool


maxpool


maxpool


maxpool conv-‐128

maxpool conv-‐64 conv-‐64

conv-‐128

The Networks


8

11-‐layer

image

FC-‐4096 FC-‐4096 FC-‐1000 soImax

maxpool


maxpool


maxpool


maxpool conv-‐128

maxpool conv-‐64

image

FC-‐4096 FC-‐4096 FC-‐1000 soImax

maxpool


maxpool


maxpool


maxpool


maxpool


13-‐layer

The Networks


9

11-‐layer

image

FC-‐4096 FC-‐4096 FC-‐1000 soImax

maxpool


maxpool


maxpool


maxpool conv-‐128

maxpool conv-‐64

image

FC-‐4096 FC-‐4096 FC-‐1000 soImax

maxpool


maxpool


maxpool


maxpool


maxpool


13-‐layer

conv-‐256

conv-‐512

conv-‐512

The Networks


10

11-‐layer

image

FC-‐4096 FC-‐4096 FC-‐1000 soImax

maxpool


maxpool


maxpool


maxpool conv-‐128

maxpool conv-‐64

image

FC-‐4096 FC-‐4096 FC-‐1000 soImax

maxpool


maxpool


maxpool


maxpool


maxpool


13-‐layer

image

FC-‐4096 FC-‐4096 FC-‐1000 soImax

conv-‐512

maxpool


conv-‐512

maxpool


conv-‐256

maxpool


maxpool


maxpool


16-‐layer

The Networks


11

11-‐layer

image

FC-‐4096 FC-‐4096 FC-‐1000 soImax

maxpool


maxpool


maxpool


maxpool conv-‐128

maxpool conv-‐64

image

FC-‐4096 FC-‐4096 FC-‐1000 soImax

maxpool


maxpool


maxpool


maxpool


maxpool


13-‐layer

image

FC-‐4096 FC-‐4096 FC-‐1000 soImax

conv-‐512

maxpool


conv-‐512

maxpool


conv-‐256

maxpool


maxpool


maxpool


16-‐layer

conv-‐256

conv-‐512

conv-‐512

The Networks


12

11-‐layer

image

FC-‐4096 FC-‐4096 FC-‐1000 soImax

maxpool


maxpool


maxpool


maxpool conv-‐128

maxpool conv-‐64

image

FC-‐4096 FC-‐4096 FC-‐1000 soImax

maxpool


maxpool


maxpool


maxpool


maxpool


13-‐layer

image

FC-‐4096 FC-‐4096 FC-‐1000 soImax

conv-‐512

maxpool


conv-‐512

maxpool


conv-‐256

maxpool


maxpool


maxpool


16-‐layer

image

FC-‐4096 FC-‐4096 FC-‐1000 soImax


maxpool



maxpool



maxpool


maxpool


maxpool


19-‐layer

Training

•  ConvNet input during training: fixed-‐size 224x224 crop •  But images have various sizes…

•  Solu=on: •  rescale images to a certain size

(preserving aspect ra=o)

•  random 224x224 crop

•  Image size affects the scale of image sta=s=cs seen by a ConvNet •  single-‐scale: 256xN or 384xN •  mul=-‐scale: randomly sample the size

for each image from 256xN to 512xN

•  Standard augmenta=on: random flip and RGB shik

13

256

N≥256

224

224 384

N≥384

Training (2)

•  Mini-‐batch gradient descent with momentum •  Regularisa=on: dropout and weight decay •  Fast convergence (74 training epochs) •  Ini=alisa=on •  deep nets are prone to vanishing/exploding gradient •  11-‐layer net: random ini=alisa=on from N(0;0.01) •  deeper nets

•  top and bolom layers ini=alised with 11-‐layer net •  other layers – random ini=alisa=on

•  also possible to ini=alise all layers randomly [Glorot & Bengio, 2010]

14

Tes=ng

•  ConvNet evalua=on on variable-‐size images: •  Tes=ng on mul=ple 224x224 crops [AlexNet] •  Dense applica=on over the whole image: [OverFeat]

•  all layers are treated as convolu=onal •  sum-‐pooling of class score maps •  more efficient than mul=ple crops

•  Combina=on of both

•  Mul=ple image sizes: •  256xN, 384xN, 512xN •  class scores averaged

15

image

class score map

conv. layers

global pooling

class scores

Implementa=on

•  Heavily-‐modified Caffe toolbox •  Mul=ple GPU support •  4 x NVIDIA Titan in an off-‐the-‐shelf worksta=on •  synchronous data parallelism •  ~3.75 =mes speed-‐up, 2-‐3 weeks for training

16

image mini-‐batch

17

Evalua=on on ImageNet Classifica=on

•  Error decreases with depth •  Using mul=ple scales is important •  mul=-‐scale training outperforms single-‐scale •  mul=-‐scale tes=ng further improves the results

18

10.4

9.4

8.8 9

8.8

8.1 8 8.2

7.5 7.5 7

7.5

8

8.5

9

9.5

10

10.5

11 layers 13 layers 16 layers 19 layers

Top-‐5 Classifica;on Error (Val. Set)

single/single

mul=/single

mul=/mul=

Effect of Depth and Scale

train/test scales

•  Dense evalua=on is on par with mul=-‐crop •  Dense & mul=-‐crop are complementary •  Combining predic=ons from 2 nets is beneficial

19

7.5 7.5

7.2

7.5 7.4

7.1 7.1 7.2

6.8

6.6

6.8

7

7.2

7.4

7.6

dense 150 crops dense & 150 crops

Top-‐5 Classifica;on Error (Val. Set)

16-‐layer

19-‐layer

16 & 19-‐layer

Evalua=on: Dense vs Mul=-‐Crop

networks

ImageNet Challenge 2014 (Classifica=on) 20

12.5

9.1

7.9 8.4

7

11.7

8.1

6.7 7.3 6.8 6 7 8 9

10 11 12

Top-‐5 Classifica;on Error (Test Set)

single net

mul=ple nets

•  1st place: GoogLeNet (6.7% with 7 nets, 7.9% with 1 net) •  2nd place: this work

•  submission: 7.3% with 2 nets, 8.4% with 1 net •  post-‐submission: 6.8% with 2 nets, 7.0% with 1 net

Comparison with GoogLeNet •  models developed independently •  both are very deep:

•  19 (VGG) vs 22 (GoogLeNet) weight layers •  mi=ga=ng gradient instability:

•  pre-‐training (VGG) vs auxiliary losses (GoogLeNet) •  mul=-‐scale training of both •  filter configura=on:

•  VGG: only 3x3, stride 1 convolu=on •  GoogLeNet: 5x5, 3x3, 1x1 filters combined into complex

mul=-‐branch modules to reduce computa=on complexity

•  speed: GoogLeNet is several =mes faster •  single-‐model accuracy: VGG is ~0.5% beler

21

image

FC-‐4096 FC-‐4096 FC-‐1000 soImax


maxpool



maxpool



maxpool


maxpool


maxpool


Recent Advances

•  Deep Image: Scaling up Image Recogni=on •  5.33% error •  models based on VGG-‐16, but wider

•  Delving Deep into Rec=fiers: Surpassing Human-‐Level Performance on ImageNet Classifica=on •  4.94% error •  models based on VGG-‐19, but deeper and wider with PReLU

•  Batch Normaliza=on: Accelera=ng Deep Network Training by Reducing Internal Covariate Shik •  4.82% error •  Incep=on layers with batch normalisa=on

22

23

Beyond ImageNet Classifica=on…

ImageNet Challenge 2014 (Localisa=on) Object bounding box predic=on •  Very deep loca=on regressor (similar to OverFeat) •  Last layer predicts a box for each class •  Ini=alised with classifica=on models •  Fine-‐tuning of all layers

•  1st place: VGG (25.3% error), 2nd place: GoogLeNet (26.4%) •  Outperforms OverFeat by using deeper representa=ons

24

0 224x224 crop

object box

29.9

26.4 25.3 23

28

OverFeat (2013) GoogLeNet VGG

Top-‐5 Localisa;on Error (Test Set)

Very Deep Features on Other Datasets

•  ImageNet-‐pretrained ConvNets are excellent feature extractors on other datasets

•  We compare against the state of the art, based on less deep ConvNet features, trained on ILSVRC

•  Simple recogni=on pipeline •  last fully-‐connected (classifica=on) layer is removed •  dense evalua=on over the whole image → global sum-‐pooling •  aggrega=on over several scales •  L2 normalisa=on → linear SVM •  no fine-‐tuning

25

Results on Caltech & PASCAL VOC

•  Very deep features are compe==ve or beler than the state of the art despite a simple pipeline

•  Two ConvNet extractors are complementary •  Even beler results have been recently reported using our released models

26

Method Caltech 101

Caltech 256

PASCAL VOC 2007

PASCAL VOC 2012 (class-‐n)

PASCAL VOC 2012 (ac=ons)

Best published result using shallow ConvNets 93.4 77.6 81.5 81.7 76.3

19-‐layer 92.3 85.1 89.3 89.0 -‐ 16 & 19-‐layer 92.7 86.2 89.7 89.3 84.0

Other Tasks…

Since the public release of models, they have been successfully applied by the community to: •  object detec=on •  seman=c segmenta=on •  image cap=on genera=on •  texture classifica=on •  etc.

27

Summary

•  Representa=on depth is important •  Excellent results using very deep ConvNets •  simple architecture, small recep=ve fields •  but very deep → lots of non-‐linearity

•  VGG-‐16 & VGG-‐19 models publicly available from: •  hlp://www.robots.ox.ac.uk/~vgg/research/very_deep/ •  formats: Caffe and MatConvNet •  can be used with any package with a cuDNN back-‐end, e.g. Torch and Theano

28

We gratefully acknowledge the support of NVIDIA Corpora;on with the dona;on of the GPUs used for this research.

Documents

VeryDeepConvNets for%Large3Scale%Image%Recogni;on% · • Using"mul=ple"scales"is"important • mul=Lscale"training"outperforms"singleLscale" • mul=Lscale"tes=ng"further"improves"the"results"