28
Very Deep ConvNets for LargeScale Image Recogni;on Karen Simonyan, Andrew Zisserman University of Oxford

VeryDeepConvNets for%Large3Scale%Image%Recogni;on% · • Using"mul=ple"scales"is"important • mul=Lscale"training"outperforms"singleLscale" • mul=Lscale"tes=ng"further"improves"the"results"

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: VeryDeepConvNets for%Large3Scale%Image%Recogni;on% · • Using"mul=ple"scales"is"important • mul=Lscale"training"outperforms"singleLscale" • mul=Lscale"tes=ng"further"improves"the"results"

Very  Deep  ConvNets  for  Large-­‐Scale  Image  Recogni;on  

Karen  Simonyan,  Andrew  Zisserman  University  of  Oxford  

Page 2: VeryDeepConvNets for%Large3Scale%Image%Recogni;on% · • Using"mul=ple"scales"is"important • mul=Lscale"training"outperforms"singleLscale" • mul=Lscale"tes=ng"further"improves"the"results"

Overview  

•  ConvNet  design  choices:  depth,  width,  recep=ve  fields    •  Variety  of  models  with  different  depth:  

•  How  does  ConvNet  depth  affect  accuracy  on  ImageNet?  

2  

ConvNet  Model   Dataset   Conv.  &  FC  layers  

LeNet-­‐5  [LeCun  et  al.,  1998]   MNIST   5  

[Ciresan  et  al.,  2012]   CIFAR   7  

AlexNet  [Krizhevsky  et  al.,  2012]  [Zeiler  &  Fergus,  2012]   ImageNet  

Challenge  

8  

9  OverFeat  [Sermanet  et  al.,  2012]  

[Goodfellow  et  al.,  2013]   SVHN   11  

Page 3: VeryDeepConvNets for%Large3Scale%Image%Recogni;on% · • Using"mul=ple"scales"is"important • mul=Lscale"training"outperforms"singleLscale" • mul=Lscale"tes=ng"further"improves"the"results"

Contribu=ons  

•  Comparison  of  very  deep  ConvNets  •  simple  architecture  design  •  increasing  depth:  from  11  to  19  layers  

•  Evalua=on  of  very  deep  features    on  other  datasets  

•  The  models  are  publicly  available  

3  

image  conv-­‐64  

FC-­‐4096  FC-­‐4096  FC-­‐1000  

conv-­‐192  conv-­‐384  conv-­‐256  conv-­‐256  

AlexNet  image  conv-­‐64  

FC-­‐4096  FC-­‐4096  FC-­‐1000  

conv-­‐128  conv-­‐256  conv-­‐256  

conv-­‐512  conv-­‐512  conv-­‐512  conv-­‐512  

conv-­‐64  conv-­‐128  

conv-­‐256  conv-­‐256  conv-­‐512  conv-­‐512  conv-­‐512  conv-­‐512  

19-­‐layer  

Page 4: VeryDeepConvNets for%Large3Scale%Image%Recogni;on% · • Using"mul=ple"scales"is"important • mul=Lscale"training"outperforms"singleLscale" • mul=Lscale"tes=ng"further"improves"the"results"

Network  Design  

•  Single  family  of  networks  

•  Key  design  choices:  •  3x3  conv.  kernels  –  very  small  •  stacks  of  conv.  layers  w/o  pooling  (but  with  ReLU)  •  conv.  stride  1  –  no  skipping  

•  Other  details  are  conven=onal:  •  Rec=fica=on  (ReLU)  non-­‐linearity  •  5  max-­‐pool  layers  (x2  downsampling)  •  no  normalisa=on  •  3  fully-­‐connected  layers  

4  

image  

conv-­‐64  conv-­‐64  maxpool  

FC-­‐4096  FC-­‐4096  FC-­‐1000  soImax  

conv-­‐128  conv-­‐128  maxpool  

conv-­‐256  conv-­‐256  maxpool  

conv-­‐512  conv-­‐512  maxpool  

conv-­‐512  conv-­‐512  maxpool  

13-­‐layer  

Page 5: VeryDeepConvNets for%Large3Scale%Image%Recogni;on% · • Using"mul=ple"scales"is"important • mul=Lscale"training"outperforms"singleLscale" • mul=Lscale"tes=ng"further"improves"the"results"

•  A  stack  of  3x3  layers  has  a  large  recep=ve  field  •  two  3x3  layers  –  5x5  recep=ve  field  •  three  3x3  layers  –  7x7  recep=ve  field  

•  More  ReLU  non-­‐lineari=es  •  more  discrimina=ve  

•  Less  parameters  than  a  single  layer  

•  Design  simplicity  •  no  need  to  tune  layer  parameters  

Why  Stacks  of  3x3  Filters?  5  

1st  3x3  conv.  layer  

2nd  3x3  conv.  layer  

5  

5  

Page 6: VeryDeepConvNets for%Large3Scale%Image%Recogni;on% · • Using"mul=ple"scales"is"important • mul=Lscale"training"outperforms"singleLscale" • mul=Lscale"tes=ng"further"improves"the"results"

The  Networks  

•  Start  from  11  layers,  inject  more  conv.  layers  

6  

11-­‐layer  

image  

FC-­‐4096  FC-­‐4096  FC-­‐1000  soImax  

maxpool  

conv-­‐512  conv-­‐512  

maxpool  

conv-­‐512  conv-­‐512  

maxpool  

conv-­‐256  conv-­‐256  

maxpool  conv-­‐128  

maxpool  conv-­‐64  

Page 7: VeryDeepConvNets for%Large3Scale%Image%Recogni;on% · • Using"mul=ple"scales"is"important • mul=Lscale"training"outperforms"singleLscale" • mul=Lscale"tes=ng"further"improves"the"results"

The  Networks  

 

•  Start  from  11  layers,  inject  more  conv.  layers  

7  

11-­‐layer  

image  

FC-­‐4096  FC-­‐4096  FC-­‐1000  soImax  

maxpool  

conv-­‐512  conv-­‐512  

maxpool  

conv-­‐512  conv-­‐512  

maxpool  

conv-­‐256  conv-­‐256  

maxpool  conv-­‐128  

maxpool  conv-­‐64   conv-­‐64  

conv-­‐128  

Page 8: VeryDeepConvNets for%Large3Scale%Image%Recogni;on% · • Using"mul=ple"scales"is"important • mul=Lscale"training"outperforms"singleLscale" • mul=Lscale"tes=ng"further"improves"the"results"

The  Networks  

•  Start  from  11  layers,  inject  more  conv.  layers  

8  

11-­‐layer  

image  

FC-­‐4096  FC-­‐4096  FC-­‐1000  soImax  

maxpool  

conv-­‐512  conv-­‐512  

maxpool  

conv-­‐512  conv-­‐512  

maxpool  

conv-­‐256  conv-­‐256  

maxpool  conv-­‐128  

maxpool  conv-­‐64  

image  

FC-­‐4096  FC-­‐4096  FC-­‐1000  soImax  

maxpool  

conv-­‐512  conv-­‐512  

maxpool  

conv-­‐512  conv-­‐512  

maxpool  

conv-­‐256  conv-­‐256  

maxpool  

conv-­‐128  conv-­‐128  

maxpool  

conv-­‐64  conv-­‐64  

13-­‐layer  

Page 9: VeryDeepConvNets for%Large3Scale%Image%Recogni;on% · • Using"mul=ple"scales"is"important • mul=Lscale"training"outperforms"singleLscale" • mul=Lscale"tes=ng"further"improves"the"results"

The  Networks  

 

•  Start  from  11  layers,  inject  more  conv.  layers  

9  

11-­‐layer  

image  

FC-­‐4096  FC-­‐4096  FC-­‐1000  soImax  

maxpool  

conv-­‐512  conv-­‐512  

maxpool  

conv-­‐512  conv-­‐512  

maxpool  

conv-­‐256  conv-­‐256  

maxpool  conv-­‐128  

maxpool  conv-­‐64  

image  

FC-­‐4096  FC-­‐4096  FC-­‐1000  soImax  

maxpool  

conv-­‐512  conv-­‐512  

maxpool  

conv-­‐512  conv-­‐512  

maxpool  

conv-­‐256  conv-­‐256  

maxpool  

conv-­‐128  conv-­‐128  

maxpool  

conv-­‐64  conv-­‐64  

13-­‐layer  

conv-­‐256  

conv-­‐512  

conv-­‐512  

Page 10: VeryDeepConvNets for%Large3Scale%Image%Recogni;on% · • Using"mul=ple"scales"is"important • mul=Lscale"training"outperforms"singleLscale" • mul=Lscale"tes=ng"further"improves"the"results"

The  Networks  

•  Start  from  11  layers,  inject  more  conv.  layers  

10  

11-­‐layer  

image  

FC-­‐4096  FC-­‐4096  FC-­‐1000  soImax  

maxpool  

conv-­‐512  conv-­‐512  

maxpool  

conv-­‐512  conv-­‐512  

maxpool  

conv-­‐256  conv-­‐256  

maxpool  conv-­‐128  

maxpool  conv-­‐64  

image  

FC-­‐4096  FC-­‐4096  FC-­‐1000  soImax  

maxpool  

conv-­‐512  conv-­‐512  

maxpool  

conv-­‐512  conv-­‐512  

maxpool  

conv-­‐256  conv-­‐256  

maxpool  

conv-­‐128  conv-­‐128  

maxpool  

conv-­‐64  conv-­‐64  

13-­‐layer  

image  

FC-­‐4096  FC-­‐4096  FC-­‐1000  soImax  

conv-­‐512  

maxpool  

conv-­‐512  conv-­‐512  

conv-­‐512  

maxpool  

conv-­‐512  conv-­‐512  

conv-­‐256  

maxpool  

conv-­‐256  conv-­‐256  

maxpool  

conv-­‐128  conv-­‐128  

maxpool  

conv-­‐64  conv-­‐64  

16-­‐layer  

Page 11: VeryDeepConvNets for%Large3Scale%Image%Recogni;on% · • Using"mul=ple"scales"is"important • mul=Lscale"training"outperforms"singleLscale" • mul=Lscale"tes=ng"further"improves"the"results"

The  Networks  

•  Start  from  11  layers,  inject  more  conv.  layers  

11  

11-­‐layer  

image  

FC-­‐4096  FC-­‐4096  FC-­‐1000  soImax  

maxpool  

conv-­‐512  conv-­‐512  

maxpool  

conv-­‐512  conv-­‐512  

maxpool  

conv-­‐256  conv-­‐256  

maxpool  conv-­‐128  

maxpool  conv-­‐64  

image  

FC-­‐4096  FC-­‐4096  FC-­‐1000  soImax  

maxpool  

conv-­‐512  conv-­‐512  

maxpool  

conv-­‐512  conv-­‐512  

maxpool  

conv-­‐256  conv-­‐256  

maxpool  

conv-­‐128  conv-­‐128  

maxpool  

conv-­‐64  conv-­‐64  

13-­‐layer  

image  

FC-­‐4096  FC-­‐4096  FC-­‐1000  soImax  

conv-­‐512  

maxpool  

conv-­‐512  conv-­‐512  

conv-­‐512  

maxpool  

conv-­‐512  conv-­‐512  

conv-­‐256  

maxpool  

conv-­‐256  conv-­‐256  

maxpool  

conv-­‐128  conv-­‐128  

maxpool  

conv-­‐64  conv-­‐64  

16-­‐layer  

conv-­‐256  

conv-­‐512  

conv-­‐512  

Page 12: VeryDeepConvNets for%Large3Scale%Image%Recogni;on% · • Using"mul=ple"scales"is"important • mul=Lscale"training"outperforms"singleLscale" • mul=Lscale"tes=ng"further"improves"the"results"

The  Networks  

•  Start  from  11  layers,  inject  more  conv.  layers  

12  

11-­‐layer  

image  

FC-­‐4096  FC-­‐4096  FC-­‐1000  soImax  

maxpool  

conv-­‐512  conv-­‐512  

maxpool  

conv-­‐512  conv-­‐512  

maxpool  

conv-­‐256  conv-­‐256  

maxpool  conv-­‐128  

maxpool  conv-­‐64  

image  

FC-­‐4096  FC-­‐4096  FC-­‐1000  soImax  

maxpool  

conv-­‐512  conv-­‐512  

maxpool  

conv-­‐512  conv-­‐512  

maxpool  

conv-­‐256  conv-­‐256  

maxpool  

conv-­‐128  conv-­‐128  

maxpool  

conv-­‐64  conv-­‐64  

13-­‐layer  

image  

FC-­‐4096  FC-­‐4096  FC-­‐1000  soImax  

conv-­‐512  

maxpool  

conv-­‐512  conv-­‐512  

conv-­‐512  

maxpool  

conv-­‐512  conv-­‐512  

conv-­‐256  

maxpool  

conv-­‐256  conv-­‐256  

maxpool  

conv-­‐128  conv-­‐128  

maxpool  

conv-­‐64  conv-­‐64  

16-­‐layer  

image  

FC-­‐4096  FC-­‐4096  FC-­‐1000  soImax  

conv-­‐512  conv-­‐512  

maxpool  

conv-­‐512  conv-­‐512  

conv-­‐512  conv-­‐512  

maxpool  

conv-­‐512  conv-­‐512  

conv-­‐256  conv-­‐256  

maxpool  

conv-­‐256  conv-­‐256  

maxpool  

conv-­‐128  conv-­‐128  

maxpool  

conv-­‐64  conv-­‐64  

19-­‐layer  

Page 13: VeryDeepConvNets for%Large3Scale%Image%Recogni;on% · • Using"mul=ple"scales"is"important • mul=Lscale"training"outperforms"singleLscale" • mul=Lscale"tes=ng"further"improves"the"results"

Training  

•  ConvNet  input  during  training:  fixed-­‐size  224x224  crop  •  But  images  have  various  sizes…  

•  Solu=on:  •  rescale  images  to  a  certain  size    

(preserving  aspect  ra=o)  

•  random  224x224  crop  

•  Image  size  affects  the  scale  of    image  sta=s=cs  seen  by  a  ConvNet  •  single-­‐scale:  256xN  or  384xN  •  mul=-­‐scale:  randomly  sample  the  size    

for  each  image  from  256xN  to  512xN  

•  Standard  augmenta=on:  random  flip  and  RGB  shik  

13  

256  

N≥256  

224  

224  384  

N≥384  

Page 14: VeryDeepConvNets for%Large3Scale%Image%Recogni;on% · • Using"mul=ple"scales"is"important • mul=Lscale"training"outperforms"singleLscale" • mul=Lscale"tes=ng"further"improves"the"results"

Training  (2)  

•  Mini-­‐batch  gradient  descent  with  momentum  •  Regularisa=on:  dropout  and  weight  decay  •  Fast  convergence  (74  training  epochs)  •  Ini=alisa=on  •  deep  nets  are  prone  to  vanishing/exploding  gradient  •  11-­‐layer  net:  random  ini=alisa=on  from  N(0;0.01)  •  deeper  nets  

•  top  and  bolom  layers  ini=alised  with  11-­‐layer  net  •  other  layers  –  random  ini=alisa=on  

•  also  possible  to  ini=alise  all  layers  randomly  [Glorot  &  Bengio,  2010]  

14  

Page 15: VeryDeepConvNets for%Large3Scale%Image%Recogni;on% · • Using"mul=ple"scales"is"important • mul=Lscale"training"outperforms"singleLscale" • mul=Lscale"tes=ng"further"improves"the"results"

Tes=ng  

•  ConvNet  evalua=on  on  variable-­‐size  images:  •  Tes=ng  on  mul=ple  224x224  crops  [AlexNet]  •  Dense  applica=on  over  the  whole  image:  [OverFeat]  

•  all  layers  are  treated  as  convolu=onal  •  sum-­‐pooling  of  class  score  maps  •  more  efficient  than  mul=ple  crops  

•  Combina=on  of  both  

•  Mul=ple  image  sizes:  •  256xN,  384xN,  512xN  •  class  scores  averaged  

15  

image  

class    score  map  

conv.    layers  

global  pooling  

class  scores  

Page 16: VeryDeepConvNets for%Large3Scale%Image%Recogni;on% · • Using"mul=ple"scales"is"important • mul=Lscale"training"outperforms"singleLscale" • mul=Lscale"tes=ng"further"improves"the"results"

Implementa=on  

•  Heavily-­‐modified  Caffe  toolbox  •  Mul=ple  GPU  support  •  4  x  NVIDIA  Titan  in  an  off-­‐the-­‐shelf  worksta=on  •  synchronous  data  parallelism  •  ~3.75  =mes  speed-­‐up,  2-­‐3  weeks  for  training  

16  

image    mini-­‐batch  

Page 17: VeryDeepConvNets for%Large3Scale%Image%Recogni;on% · • Using"mul=ple"scales"is"important • mul=Lscale"training"outperforms"singleLscale" • mul=Lscale"tes=ng"further"improves"the"results"

17  

Evalua=on  on  ImageNet  Classifica=on  

Page 18: VeryDeepConvNets for%Large3Scale%Image%Recogni;on% · • Using"mul=ple"scales"is"important • mul=Lscale"training"outperforms"singleLscale" • mul=Lscale"tes=ng"further"improves"the"results"

•  Error  decreases  with  depth  •  Using  mul=ple  scales  is  important  •  mul=-­‐scale  training  outperforms  single-­‐scale  •  mul=-­‐scale  tes=ng  further  improves  the  results  

 

18  

10.4  

9.4  

8.8  9  

8.8  

8.1   8  8.2  

7.5   7.5  7  

7.5  

8  

8.5  

9  

9.5  

10  

10.5  

11  layers   13  layers   16  layers   19  layers  

Top-­‐5  Classifica;on  Error  (Val.  Set)  

single/single  

mul=/single  

mul=/mul=  

Effect  of  Depth  and  Scale  

train/test  scales  

Page 19: VeryDeepConvNets for%Large3Scale%Image%Recogni;on% · • Using"mul=ple"scales"is"important • mul=Lscale"training"outperforms"singleLscale" • mul=Lscale"tes=ng"further"improves"the"results"

•  Dense  evalua=on  is  on  par  with  mul=-­‐crop  •  Dense  &  mul=-­‐crop  are  complementary  •  Combining  predic=ons  from  2  nets  is  beneficial  

 

19  

7.5   7.5  

7.2  

7.5  7.4  

7.1  7.1  7.2  

6.8  

6.6  

6.8  

7  

7.2  

7.4  

7.6  

dense   150  crops   dense  &  150  crops  

Top-­‐5  Classifica;on  Error  (Val.  Set)  

16-­‐layer  

19-­‐layer  

16  &  19-­‐layer  

Evalua=on:  Dense  vs  Mul=-­‐Crop  

networks  

Page 20: VeryDeepConvNets for%Large3Scale%Image%Recogni;on% · • Using"mul=ple"scales"is"important • mul=Lscale"training"outperforms"singleLscale" • mul=Lscale"tes=ng"further"improves"the"results"

ImageNet  Challenge  2014  (Classifica=on)  20  

12.5  

9.1  

7.9  8.4  

7  

11.7  

8.1  

6.7   7.3   6.8  6  7  8  9  

10  11  12  

Top-­‐5  Classifica;on  Error  (Test  Set)  

single  net  

mul=ple  nets  

•  1st  place:  GoogLeNet  (6.7%  with  7  nets,  7.9%  with  1  net)  •  2nd  place:  this  work  

•  submission:  7.3%  with  2  nets,  8.4%  with  1  net  •  post-­‐submission:  6.8%  with  2  nets,  7.0%  with  1  net  

Page 21: VeryDeepConvNets for%Large3Scale%Image%Recogni;on% · • Using"mul=ple"scales"is"important • mul=Lscale"training"outperforms"singleLscale" • mul=Lscale"tes=ng"further"improves"the"results"

Comparison  with  GoogLeNet  •  models  developed  independently  •  both  are  very  deep:    

•  19  (VGG)  vs  22  (GoogLeNet)  weight  layers  •  mi=ga=ng  gradient  instability:  

•  pre-­‐training  (VGG)  vs  auxiliary  losses  (GoogLeNet)  •  mul=-­‐scale  training  of  both  •  filter  configura=on:  

•  VGG:  only  3x3,  stride  1  convolu=on  •  GoogLeNet:  5x5,  3x3,  1x1  filters  combined  into  complex  

mul=-­‐branch  modules  to  reduce  computa=on  complexity  

•  speed:  GoogLeNet  is  several  =mes  faster  •  single-­‐model  accuracy:  VGG  is  ~0.5%  beler  

21  

image  

FC-­‐4096  FC-­‐4096  FC-­‐1000  soImax  

conv-­‐512  conv-­‐512  

maxpool  

conv-­‐512  conv-­‐512  

conv-­‐512  conv-­‐512  

maxpool  

conv-­‐512  conv-­‐512  

conv-­‐256  conv-­‐256  

maxpool  

conv-­‐256  conv-­‐256  

maxpool  

conv-­‐128  conv-­‐128  

maxpool  

conv-­‐64  conv-­‐64  

Page 22: VeryDeepConvNets for%Large3Scale%Image%Recogni;on% · • Using"mul=ple"scales"is"important • mul=Lscale"training"outperforms"singleLscale" • mul=Lscale"tes=ng"further"improves"the"results"

Recent  Advances  

•  Deep  Image:  Scaling  up  Image  Recogni=on  •  5.33%  error  •  models  based  on  VGG-­‐16,  but  wider  

•  Delving  Deep  into  Rec=fiers:  Surpassing  Human-­‐Level    Performance  on  ImageNet  Classifica=on  •  4.94%  error  •  models  based  on  VGG-­‐19,  but  deeper  and  wider  with  PReLU  

•  Batch  Normaliza=on:  Accelera=ng  Deep  Network  Training    by  Reducing  Internal  Covariate  Shik  •  4.82%    error  •  Incep=on  layers  with  batch  normalisa=on  

22  

Page 23: VeryDeepConvNets for%Large3Scale%Image%Recogni;on% · • Using"mul=ple"scales"is"important • mul=Lscale"training"outperforms"singleLscale" • mul=Lscale"tes=ng"further"improves"the"results"

23  

Beyond  ImageNet  Classifica=on…  

Page 24: VeryDeepConvNets for%Large3Scale%Image%Recogni;on% · • Using"mul=ple"scales"is"important • mul=Lscale"training"outperforms"singleLscale" • mul=Lscale"tes=ng"further"improves"the"results"

ImageNet  Challenge  2014  (Localisa=on)  Object  bounding  box  predic=on    •  Very  deep  loca=on  regressor  (similar  to  OverFeat)  •  Last  layer  predicts  a  box  for  each  class  •  Ini=alised  with  classifica=on  models  •  Fine-­‐tuning  of  all  layers  

 

•  1st  place:  VGG  (25.3%  error),  2nd  place:  GoogLeNet  (26.4%)  •  Outperforms  OverFeat  by  using  deeper  representa=ons  

24  

0  224x224  crop  

object  box  

29.9  

26.4   25.3  23  

28  

OverFeat  (2013)   GoogLeNet   VGG  

Top-­‐5  Localisa;on  Error  (Test  Set)  

Page 25: VeryDeepConvNets for%Large3Scale%Image%Recogni;on% · • Using"mul=ple"scales"is"important • mul=Lscale"training"outperforms"singleLscale" • mul=Lscale"tes=ng"further"improves"the"results"

Very  Deep  Features  on  Other  Datasets  

•  ImageNet-­‐pretrained  ConvNets  are  excellent  feature  extractors  on  other  datasets  

•  We  compare  against  the  state  of  the  art,  based  on    less  deep  ConvNet  features,  trained  on  ILSVRC  

•  Simple  recogni=on  pipeline  •  last  fully-­‐connected  (classifica=on)  layer  is  removed  •  dense  evalua=on  over  the  whole  image  →  global  sum-­‐pooling  •  aggrega=on  over  several  scales  •  L2  normalisa=on  →  linear  SVM  •  no  fine-­‐tuning  

25  

Page 26: VeryDeepConvNets for%Large3Scale%Image%Recogni;on% · • Using"mul=ple"scales"is"important • mul=Lscale"training"outperforms"singleLscale" • mul=Lscale"tes=ng"further"improves"the"results"

Results  on  Caltech  &  PASCAL  VOC  

•  Very  deep  features  are  compe==ve  or  beler  than  the  state  of  the  art  despite  a  simple  pipeline  

•  Two  ConvNet  extractors  are  complementary  •  Even  beler  results  have  been  recently  reported  using    our  released  models    

26  

Method   Caltech  101  

Caltech  256  

PASCAL  VOC  2007  

PASCAL  VOC  2012  (class-­‐n)  

PASCAL  VOC  2012  (ac=ons)  

Best  published  result  using  shallow  ConvNets   93.4   77.6   81.5   81.7   76.3  

19-­‐layer   92.3   85.1   89.3   89.0   -­‐  16  &  19-­‐layer     92.7   86.2   89.7   89.3   84.0  

Page 27: VeryDeepConvNets for%Large3Scale%Image%Recogni;on% · • Using"mul=ple"scales"is"important • mul=Lscale"training"outperforms"singleLscale" • mul=Lscale"tes=ng"further"improves"the"results"

Other  Tasks…  

Since  the  public  release  of  models,  they  have  been    successfully  applied  by  the  community  to:  •  object  detec=on  •  seman=c  segmenta=on  •  image  cap=on  genera=on  •  texture  classifica=on  •  etc.  

27  

Page 28: VeryDeepConvNets for%Large3Scale%Image%Recogni;on% · • Using"mul=ple"scales"is"important • mul=Lscale"training"outperforms"singleLscale" • mul=Lscale"tes=ng"further"improves"the"results"

Summary  

•  Representa=on  depth  is  important  •  Excellent  results  using  very  deep  ConvNets  •  simple  architecture,  small  recep=ve  fields  •  but  very  deep  →  lots  of  non-­‐linearity  

•  VGG-­‐16  &  VGG-­‐19  models  publicly  available  from:  •  hlp://www.robots.ox.ac.uk/~vgg/research/very_deep/    •  formats:  Caffe  and  MatConvNet  •  can  be  used  with  any  package  with  a  cuDNN  back-­‐end,  e.g.  Torch  and  Theano  

 

 

28  

We  gratefully  acknowledge  the  support  of  NVIDIA  Corpora;on  with  the  dona;on  of  the  GPUs  used  for  this  research.