ILSVRC’Tutorial’@CVPR2015’ 7June2015 Karen’Simonyan’image-net.org/tutorials/cvpr2015/recent.pdf · ILSVRC’Submission’Essen1als’ in’the’lightof’recentdevelopments’

ILSVRC Submission Essen1als in the light of recent developments

ILSVRC Tutorial @ CVPR-‐2015

7 June 2015

Karen Simonyan

Outline •  Architectures – Convolu1onal Networks: recap – The importance of depth in image representa1ons

•  very deep ConvNets (VGG-‐Net and extensions) •  Incep1on modules (GoogLeNet)

•  Training – Op1misa1on – Data augmenta1on

•  Evalua1on •  References

2

Convolu1onal Networks •  State-‐of-‐the-‐art in image recogni1on – winner of ILSVRC since 2012

•  ConvNet -‐ hierarchical image representa1on [LeCun et al., 89, 98] – stack of conv. layers, interleaved with non-‐lineari1es –  typically followed by fully-‐connected layers

ConvNet schema.c 3

Convolu1onal Networks (2) •  Important conv. layer proper1es: –  locality: objects/parts have local spa1al support –  weight sharing: transla1on equivariance

•  Conv. layers operate across all channels, not just one

•  Each layer is followed by non-‐linearity (ac1va1on func1on), e.g. ReLU: max(W*x, 0)

•  Some layers are followed by spa1al pooling –  max-‐ or sum-‐pooling –  invariance to local transla1on

4

Convolu1onal Networks (3) •  Supervised training by back-‐propaga1on – gradient descent & chain rule

•  End-‐to-‐end training – all layers learnt jointly, no hand-‐craaing

•  But some engineering is s1ll needed to put together an architecture – number of layers, feature channels, etc. – some guidelines will be provided in this talk

5

AlexNet •  Winner of ILSVRC-‐2012 ([Krizhevsky et al.]) •  ConvNet with 8 layers (5 conv. & 3 FC)

layer output size

input image 3x224x224

conv-‐96x11x11/4 96x56x56

maxpool/2 96x28x28

conv-‐256x5x5 256x28x28

maxpool/2 256x14x14

conv-‐384x3x3 384x14x14

conv-‐384x3x3 384x14x14

conv-‐256x3x3 256x14x14

maxpool/2 256x7x7

full-‐4096 4096

full-‐4096 4096

full-‐1000 1000

With depth: •  spa1al resolu1on is gradually

reduced •  number of channels (feature

dimension) is increased •  higher-‐level representa1ons,

more spa1al invariance

6

Deeper is Beger •  Each weight layer performs a linear opera1on, followed by non-‐linearity – a single layer can be seen as a linear classifier itself

•  More layers – more non-‐lineari1es –  leads to a more discrimina1ve model

•  What limits the number of layers? – many models use pooling aaer each conv. layer

•  input image resolu1on sets the limit: log (s) for sxs input – computa1onal complexity

7

Building Very Deep Nets (1) •  Stack several layers between pooling – #conv. layers >> #pooling layers – #conv. layers does not affect resolu1on if each layer preserves spa1al resolu1on: •  conv. stride = 1 & input is padded

•  More generally, interleave deep mul1-‐layer blocks with resolu1on reduc1on layers

conv conv

pooling

conv conv

pooling

resolu.on reduc.on

deep mul.-‐layer processing

resolu.on reduc.on

conv

conv

conv

8

Building Very Deep Nets (2) •  Stack of small (3x3) conv. layers – has a large recep1ve field

•  two 3x3 layers – 5x5 recep1ve field •  three 3x3 layers – 7x7 recep1ve field

–  faster than a stack of large conv. layers –  less parameters than a single layer with large kernels

1st 3x3 conv. layer

2nd 3x3 conv. layer

5

5

9

Very Deep Nets at ILSVRC •  Large depth and small filters is used in two top-‐performing ILSVRC-‐2014 submissions – GoogLeNet (Incep1on) [Szegedy et al., 2014] – VGG-‐Net [Simonyan & Zisserman, 2014]

•  as well as the follow-‐up works – Delving deep into rec1fiers (MSRA, [He at al., 2015]) – Deep Image (Baidu, [Wu et al., 2015]) –  Incep1on v2 (Google, [Ioffe and Szegedy, 2015])

10

VGG-‐Net •  Straighrorward implementa1on of very deep nets: – stacks of conv. layers w/o pooling – 3x3 conv. kernels – very small – conv. stride 1 – no skipping

•  Other details are conven1onal: – 5 max-‐pool layers – no normalisa1on layers – 3 fully-‐connected layers

11

image

conv-‐64 conv-‐64 maxpool

FC-‐4096 FC-‐4096 FC-‐1000 soEmax





13-‐layer

VGG-‐Net Incarna1ons

•  Started from 11 layers 12

11-‐layer

image

FC-‐4096 FC-‐4096 FC-‐1000 soEmax

maxpool

conv-‐512 conv-‐512

maxpool


maxpool


maxpool conv-‐128

maxpool conv-‐64


•  Started from 11 layers & injected more conv. layers 13

11-‐layer

image

FC-‐4096 FC-‐4096 FC-‐1000 soEmax

maxpool


maxpool


maxpool


maxpool conv-‐128

maxpool conv-‐64 conv-‐64

conv-‐128


14

11-‐layer

image

FC-‐4096 FC-‐4096 FC-‐1000 soEmax

maxpool


maxpool


maxpool


maxpool conv-‐128

maxpool conv-‐64

image

FC-‐4096 FC-‐4096 FC-‐1000 soEmax

maxpool


maxpool


maxpool


maxpool


maxpool


13-‐layer

15

11-‐layer

image

FC-‐4096 FC-‐4096 FC-‐1000 soEmax

maxpool


maxpool


maxpool


maxpool conv-‐128

maxpool conv-‐64

image

FC-‐4096 FC-‐4096 FC-‐1000 soEmax

maxpool


maxpool


maxpool


maxpool


maxpool


13-‐layer

conv-‐256

conv-‐512

conv-‐512


Extra layers injected into deeper stacks •  first layers capture lower-‐level primi1ves, don’t need to be very discrimina1ve

•  spa1al resolu1on is higher in the first layers, adding extra layers there is computa1onally prohibi1ve


16

11-‐layer

image

FC-‐4096 FC-‐4096 FC-‐1000 soEmax

maxpool


maxpool


maxpool


maxpool conv-‐128

maxpool conv-‐64

image

FC-‐4096 FC-‐4096 FC-‐1000 soEmax

maxpool


maxpool


maxpool


maxpool


maxpool


13-‐layer

image

FC-‐4096 FC-‐4096 FC-‐1000 soEmax

conv-‐512

maxpool


conv-‐512

maxpool


conv-‐256

maxpool


maxpool


maxpool


16-‐layer


17

11-‐layer

image

FC-‐4096 FC-‐4096 FC-‐1000 soEmax

maxpool


maxpool


maxpool


maxpool conv-‐128

maxpool conv-‐64

image

FC-‐4096 FC-‐4096 FC-‐1000 soEmax

maxpool


maxpool


maxpool


maxpool


maxpool


13-‐layer

image

FC-‐4096 FC-‐4096 FC-‐1000 soEmax

conv-‐512

maxpool


conv-‐512

maxpool


conv-‐256

maxpool


maxpool


maxpool


16-‐layer

conv-‐256

conv-‐512

conv-‐512


•  16-‐ and 19-‐layer models are publicly available 18

11-‐layer

image

FC-‐4096 FC-‐4096 FC-‐1000 soEmax

maxpool


maxpool


maxpool


maxpool conv-‐128

maxpool conv-‐64

image

FC-‐4096 FC-‐4096 FC-‐1000 soEmax

maxpool


maxpool


maxpool


maxpool


maxpool


13-‐layer

image

FC-‐4096 FC-‐4096 FC-‐1000 soEmax

conv-‐512

maxpool


conv-‐512

maxpool


conv-‐256

maxpool


maxpool


maxpool


16-‐layer

image

FC-‐4096 FC-‐4096 FC-‐1000 soEmax


maxpool



maxpool



maxpool


maxpool


maxpool


19-‐layer

•  Error decreases with depth •  Plateaus aaer 16 layers – could be due to training specifics

19

10.4

9.4

8.8 9

8.5

9

9.5

10

10.5

11 layers 13 layers 16 layers 19 layers

Top-‐5 Classifica.on Error (Val. Set)

Effect of VGG-‐Net Depth

VGG-‐Net Layer Pagern

20

image

FC-‐4096 FC-‐4096 FC-‐1000 soEmax


maxpool



maxpool



maxpool


maxpool


maxpool


pool/2

2-‐conv/1

2-‐conv/1

pool/2

4-‐conv/1

pool/2

4-‐conv/1

pool/2

4-‐conv/1

pool/2

3-‐fc

•  Mul1-‐layer stacks (conv. layers, stride=1) interleaved with resolu1on reduc1on (max-‐pooling, stride=2)

•  Other very deep nets (incl. GoogLeNet) follow same/similar pagern

VGG-‐Net Extensions •  Deep Image (Baidu, [Wu at al., 2015]) –  VGG-‐16 and VGG-‐19 models with more channels

•  Delving Deep Into Rec1fiers (MSRA, [He et al., 2015])

pool/2

2-‐conv/1

2-‐conv/1

pool/2

pool/2

pool/2

pool/2

4-‐conv/1

4-‐conv/1

4-‐conv/1

3-‐layer

VGG-‐19

1-‐conv/2

pool/2

pool/2

pool/2

SP pool

6-‐conv/1

6-‐conv/1

6-‐conv/1

3-‐layer

MSRA-‐22

aggressive downsampling: 7x7 conv. with stride 2 (cf. GoogLeNet)

6-‐layer stacks instead of 4-‐layer

Spa1al Pyramid pooling [He at al., 2014]

21

•  Ac1va1on func1on:

•  ai is learnable with back-‐prop –  per-‐channel or per-‐layer –  learnable ac1va1on func1on!

•  Generalises –  ReLU (ai=0) –  leaky ReLU (ai=0.01)

•  0.5%/0.2% top-‐1/top-‐5 error reduc1on

Parametric ReLU

ReLU

PReLU 22 [He et al., 2015]

GoogLeNet (Incep1on) •  Developed concurrently with VGG-‐Net •  Some design choices are similar: – very deep (22 layers) – small filters

•  3x3, 5x5, 7x7 (1st layer only) in [Szegedy et al., 2014] •  3x3 and 7x7 (1st layer only) in [Ioffe & Szegedy, 2015]

•  But more computa1onally and parameter-‐efficient, due to the mul1-‐branch “Incep1on” modules

23

Prerequisite: 1x1 Convolu1on •  Doesn’t capture spa1al context, only operates across channels

•  Performs linear projec1on of one pixel’s features – can be used for dimensionality reduc.on:

•  Also increases the depth – computa1onally-‐ and parameter-‐cheap

•  used in “Network in Network” architecture [Lin et al., 2014]

Fout ∈ Rcout×whFin ∈ Rcin×whW ∈ Rcout×cin x =

24

Incep1on Module Conv. filters of different size alongside each other •  resul1ng feature maps are concatenated •  filter sizes: 1x1, 3x3, 5x5 & max/avg-‐pooling •  in Incep1on v2 [Ioffe & Szegedy, 2015] 5x5 replaced with two 3x3 •  most output channels are computed with fast layers, e.g.

1024 (pool) + 352 (1x1 conv) + 320 (3x3 conv) + 224 (5x5 conv) = 1920 (out)

fast slow

Incep.on module: naïve version

25

Incep1on Module •  Computa1on 1me & number of parameters reduced by 1x1 convolu1on – dimensionality reduc1on –  also increases depth

•  Allows for increasing #channels without large penalty •  single Incep1on module depth: 3

Incep.on module with dim. reduc.on

26

Incep1on Net v2 •  depth: 34 (10 Incep1on modules, 3 conv., 1 FC) •  aggressive spa1al downsampling – first layers quickly decrease resolu1on by 8 –  lots of depth in further stacks

[Ioffe & Szegedy, 2015] 27

Architectures: Comparison

pool/2

2-‐conv/1

2-‐conv/1

pool/2

pool/2

pool/2

pool/2

4-‐conv/1

4-‐conv/1

4-‐conv/1

3-‐layer

VGG-‐19

1-‐conv/2

pool/2

pool/2

pool/2

SP pool

6-‐conv/1

6-‐conv/1

6-‐conv/1

3-‐layer

MSRA-‐22

1-‐conv/2

pool/2

pool/2

1-‐Incep.on/2 (3-‐conv)

2-‐conv/1


1-‐layer




Incep.onNet v2

pool/7

Incep1onNet •  less deep in the first blocks, but deeper in the following ones •  Instead of pooling – Incep1on with stride 2 (pooling is inside) 28

Outline: Training •  Op1misa1on •  Regularisa1on •  Ini1alisa1on •  Batch normalisa1on

29

Op1misa1on •  Learning objec1ve –  mul1nomial logis1c regression (“soamax loss”)

•  A plethora of gradient-‐based op1misa1on methods –  in common: gradients are computed with back-‐prop –  then, weights can be updated in different ways:

•  SGD, ADAGRAD, RMSPROP, etc.

•  SGD with momentum works very well in prac1ce –  but important to get hyper-‐parameters right

30

Learning Rate •  Very important to set it properly –  too low – training is slow, too high – training diverges

•  Conven1onal strategy – start with a reasonably high learning rate (e.g. 0.01) – divide it by constant factor (e.g. 10)

•  when the valida1on error plateaus

val. error

itera.on 31

Regularisa1on •  Training suffers from over-‐fiyng, even on ILSVRC

•  Two simple and effec1ve techniques in most submissions since AlexNet – weight decay (L2 norm penalty) – dropout

•  Batch normalisa1on [Ioffe & Szegedy, 2015] –  regularises and speeds-‐up training

32

Ini1alisa1on •  Sample from zero-‐mean normal distribu1on with fixed variance, e.g. N(0; 0.01)–  works fine for shallow nets –  deeper nets suffer for vanishing/exploding gradient problem

•  Adap1vely choose variance for each layer –  preserve gradient magnitude [Glorot & Bengio, 2010]:

•  FC layers: Nin = #input channels •  conv. layers: Nin = #input channels × size2

–  compensate for ReLU [He et al., 2015]:

•  Supervised pre-‐training –  init deep with shallow [VGG-‐Net]

σ =2Nin

σ =1Nin

MSRA

33

Batch Normalisa1on •  The distribu1on of ac1va1ons changes during training, making training harder

•  Whitening of neural net inputs is a standard pre-‐processing technique

•  Batch normalisa1on [Ioffe & Szegedy, 2015] performs normalisa1on of outputs of each layer to zero mean and unit variance –  can be seen as diagonal whitening –  performed aaer each weight layer before ReLU

34

Batch Normalisa1on (2)

•  scale and shia parameters are learnt •  doing backprop through batchnorm is important •  nets with batchnorm need less regularisa1on –  smaller/zero dropout & weight decay 35

itera.on

accuracy

Data Augmenta1on •  ILSVRC is s1ll too small for large ConvNets – over-‐fiyng in spite of regularisa1on

•  Data augmenta1on (jigering) -‐ increases the amount of training data

•  Transforms original images in a way which – preserves their label –  is realis1c

•  Helpful for both training and evalua1on

36

Random Crop Augmenta1on •  Randomly sample a fixed-‐size sub-‐image (224x224) –  the crop is a ConvNet input –  essen1al component of most ImageNet submissions since AlexNet

•  Original image is rescaled to a certain smallest side –  affects the scale of image sta1s1cs seen by a ConvNet

–  single-‐scale: 256xN or 384xN – mul.-‐scale: randomly sample the size for each image from 256xN to 512xN

•  Random horizontal flips 37

256

N≥256

224

224 384

N≥384

Photometric Distor1on Augmenta1on •  Random RGB shia [AlexNet]

•  Randomly adjust contrast, brightness, and colour [Howard, 2013]

•  Vigneyng and lens distor1on [Deep Image]

38

Outline: Evalua1on •  Mul1-‐crop evalua1on

•  Dense evalua1on –  fully-‐convolu1onal nets

•  Model ensembles

39

Mul1-‐Crop Evalua1on •  Network is trained on fixed-‐size (224x224) crops •  Full image is normally larger, so –  1le the image with crops –  evaluate the net and average predic1ons

•  More crops – higher accuracy, but slower –  Single-‐scale: 5 crops x 2 flips = 10 crops [AlexNet] – Mul1-‐scale

•  rescale image to several sizes, sample crops in each •  [Howard, 2013]: 3 scales, 90 crops; [GoogLeNet]: 4 scales, 144 crops

–  disadvantage: slow, as need to evaluate ConvNet from scratch

40

Dense Evalua1on •  ConvNets can be applied to an image of any size •  Network should be fully-‐convolu1onal –  fully-‐connected layers expect fixed-‐resolu1on input –  so should be converted to conv.

•  Conversion (on the example of VGG-‐Net) –  assume FC layer has input 512x7x7, output is 4096-‐D –  can be seen as conv. layer with 7x7 recep1ve field, 512 input channels & 4096 output: 512x7x7 -‐> 4096x1x1

•  Output of full-‐conv. net is a class score map, should be pooled with global pooling to produce a vector of scores

•  Used in OverFeat [Sermanet et al., 2013] & VGG-‐Net 41

•  Dense evalua1on results •  Using mul1ple scales is important –  mul1-‐scale training outperforms single-‐scale –  mul1-‐scale tes1ng further improves the results

42

9.4

8.8 9

8.8

8.1 8 8.2

7.5 7.5 7

7.5

8

8.5

9

9.5

10

13 layers 16 layers 19 layers


single/single

mul1/single

mul1/mul1

Effect of Scale (VGG-‐Net)

train/test scales

•  Dense evalua1on is on par with mul1-‐crop •  Dense & mul1-‐crop are complementary •  Combining predic1ons from 2 nets is beneficial, but slow

43

7.5 7.5

7.2

7.5 7.4

7.1 7.1 7.2

6.8

6.6

6.8

7

7.2

7.4

7.6

dense 150 crops dense & 150 crops


16-‐layer

19-‐layer

16 & 19-‐layer

Evalua1on: Dense vs Mul1-‐Crop

networks

Model Ensembles •  Training mul1ple models and combining their predic1ons improves the accuracy –  average soa-‐max posteriors

•  Used in all top-‐performing submissions to ILSVRC •  Models don’t need to be the same –  can simply combine your best models developed by the submission 1me

•  Examples of ensembles’ improvement: –  VGG-‐Net: error decreases from 7.1% (1 net) to 6.8% (2 nets) –  GoogLeNet: from 7.9% (1 net) to 6.7% (7 nets) –  Incep1onNet v2 (batchnorm): from 5.8% to 4.8% (6 nets)

44

Object Localisa1on (In Brief) •  ILSVRC localisa1on task: classify and localise a single object (which is guaranteed to be in the image)

•  Object detec1on approaches would require adapta1on –  not all the objects are annotated in the training set

•  Object bounding box regression with ConvNets [OverFeat] –  last layer predicts a bounding box

•  class-‐agnos1c [OverFeat] •  for each class [VGG]

–  Ini1alised with classifica1on nets –  Fine-‐tuning of all layers

45

0 224x224 crop

object box

Object Detec1on (In Brief) •  Common approach: –  generate a large number of bounding box proposals –  classify them using visual features

•  ConvNet features work very well! –  R-‐CNN [Girshick et al., 2013]

•  Fast R-‐CNN [Girshick et al., 2015] –  for each proposal, predicts its class and precise bbox loca1on –  re-‐uses conv. features, no need to re-‐compute

•  Proposals –  Selec1ve search – Mul1-‐Box –  Faster R-‐CNN 46

Infrastructure •  Good infrastructure is just as important •  A number of off-‐the-‐shelf deep learning packages – Torch, Caffe, Theano, MatConvNet

•  Using GPUs is a must – most packages use the same low-‐level back-‐ends, e.g. cuDNN or cuBLAS, so speed is comparable

•  Mul1-‐GPU training helps a lot – available in packages above

47

Summary •  Deep ConvNets – an essen1al component of top ILSVRC submissions since 2012

•  Depth is important •  Other essen1als: – extensive augmenta1on at mul1ple scales – dropout, batch normalisa1on, weight decay

•  Next talk will cover the implementa1on side…

48

References •  Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropaga1on applied to

handwrigen zip code recogni1on. Neural Computa1on 1989. •  Y. LeCun, L. Bogou, Y. Bengio, and P. Haffner. Gradient-‐based learning applied to document recogni1on. Proceedings of

the IEEE 1998. •  X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. AISTATS 2010. •  A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classifica1on with Deep Convolu1onal Neural Networks.

NIPS 2012. •  M. Lin, Q. Chen, and S. Yan. Network In Network. ICLR 2014. •  P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. OverFeat: Integrated Recogni1on, Localiza1on

and Detec1on using Convolu1onal Networks. ICLR 2014. •  A. G. Howard. Some improvements on deep convolu1onal neural network based image classifica1on. ICLR 2014. •  D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov. Scalable Object Detec1on using Deep Neural Networks. CVPR 2014. •  M. D. Zeiler and R. Fergus. Visualizing and understanding convolu1onal networks. ECCV, 2014. •  K. Simonyan and A. Zisserman. Very Deep Convolu1onal Networks for Large-‐Scale Image Recogni1on. ICLR 2015. •  R. Wu, S. Yan, Y. Shan, Q. Dang, and G. Sun. Deep Image: Scaling up Image Recogni1on. Arxiv 2015. •  K. He, X. Zhang, S. Ren, and J. Sun. Delving Deep into Rec1fiers: Surpassing Human-‐Level Performance on ImageNet

Classifica1on. Arxiv 2015. •  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going Deeper

With Convolu1ons. CVPR 2015. •  S. Ioffe and C. Szegedy. Batch Normaliza1on: Accelera1ng Deep Network Training by Reducing Internal Covariate Shia.

ICML 2015. •  R. Girshick. Fast R-‐CNN. Arxiv 2015.

49

Documents

ILSVRC’Tutorial’@CVPR2015’ 7June2015 Karen’Simonyan’image-net.org/tutorials/cvpr2015/recent.pdf · ILSVRC’Submission’Essen1als’ in’the’lightof’recentdevelopments’