49
ILSVRC Submission Essen1als in the light of recent developments ILSVRC Tutorial @ CVPR2015 7 June 2015 Karen Simonyan

ILSVRC’Tutorial’@CVPR2015’ 7June2015 Karen’Simonyan’image-net.org/tutorials/cvpr2015/recent.pdf · ILSVRC’Submission’Essen1als’ in’the’lightof’recentdevelopments’

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ILSVRC’Tutorial’@CVPR2015’ 7June2015 Karen’Simonyan’image-net.org/tutorials/cvpr2015/recent.pdf · ILSVRC’Submission’Essen1als’ in’the’lightof’recentdevelopments’

ILSVRC  Submission  Essen1als  in  the  light  of  recent  developments  

 ILSVRC  Tutorial  @  CVPR-­‐2015  

7  June  2015  

Karen  Simonyan    

Page 2: ILSVRC’Tutorial’@CVPR2015’ 7June2015 Karen’Simonyan’image-net.org/tutorials/cvpr2015/recent.pdf · ILSVRC’Submission’Essen1als’ in’the’lightof’recentdevelopments’

Outline  •  Architectures  – Convolu1onal  Networks:  recap  – The  importance  of  depth  in  image  representa1ons  

•  very  deep  ConvNets  (VGG-­‐Net  and  extensions)  •  Incep1on  modules  (GoogLeNet)  

•  Training  – Op1misa1on  – Data  augmenta1on  

•  Evalua1on  •  References  

2  

Page 3: ILSVRC’Tutorial’@CVPR2015’ 7June2015 Karen’Simonyan’image-net.org/tutorials/cvpr2015/recent.pdf · ILSVRC’Submission’Essen1als’ in’the’lightof’recentdevelopments’

Convolu1onal  Networks  •  State-­‐of-­‐the-­‐art  in  image  recogni1on  – winner  of  ILSVRC  since  2012  

•  ConvNet  -­‐  hierarchical  image  representa1on  [LeCun  et  al.,  89,  98]  – stack  of  conv.  layers,  interleaved  with  non-­‐lineari1es  –  typically  followed  by  fully-­‐connected  layers  

ConvNet  schema.c   3  

Page 4: ILSVRC’Tutorial’@CVPR2015’ 7June2015 Karen’Simonyan’image-net.org/tutorials/cvpr2015/recent.pdf · ILSVRC’Submission’Essen1als’ in’the’lightof’recentdevelopments’

Convolu1onal  Networks  (2)  •  Important  conv.  layer  proper1es:  –  locality:  objects/parts  have  local  spa1al  support  –  weight  sharing:  transla1on  equivariance  

•  Conv.  layers  operate  across  all  channels,  not  just  one  

•  Each  layer  is  followed  by  non-­‐linearity  (ac1va1on  func1on),  e.g.  ReLU: max(W*x, 0)  

•  Some  layers  are  followed  by  spa1al  pooling  –  max-­‐  or  sum-­‐pooling  –  invariance  to    local  transla1on  

4  

Page 5: ILSVRC’Tutorial’@CVPR2015’ 7June2015 Karen’Simonyan’image-net.org/tutorials/cvpr2015/recent.pdf · ILSVRC’Submission’Essen1als’ in’the’lightof’recentdevelopments’

Convolu1onal  Networks  (3)  •  Supervised  training  by  back-­‐propaga1on  – gradient  descent  &  chain  rule  

•  End-­‐to-­‐end  training  – all  layers  learnt  jointly,  no  hand-­‐craaing  

•  But  some  engineering  is  s1ll  needed  to  put  together  an  architecture  – number  of  layers,  feature  channels,  etc.  – some  guidelines  will  be  provided  in  this  talk  

5  

Page 6: ILSVRC’Tutorial’@CVPR2015’ 7June2015 Karen’Simonyan’image-net.org/tutorials/cvpr2015/recent.pdf · ILSVRC’Submission’Essen1als’ in’the’lightof’recentdevelopments’

AlexNet  •  Winner  of  ILSVRC-­‐2012  ([Krizhevsky  et  al.])  •  ConvNet  with  8  layers  (5  conv.  &  3  FC)  

layer   output  size  

input  image   3x224x224  

conv-­‐96x11x11/4   96x56x56  

maxpool/2   96x28x28  

conv-­‐256x5x5   256x28x28  

maxpool/2   256x14x14  

conv-­‐384x3x3   384x14x14  

conv-­‐384x3x3   384x14x14  

conv-­‐256x3x3   256x14x14  

maxpool/2   256x7x7  

full-­‐4096   4096  

full-­‐4096   4096  

full-­‐1000   1000  

With  depth:  •  spa1al  resolu1on  is  gradually  

reduced  •  number  of  channels  (feature    

dimension)  is  increased  •  higher-­‐level  representa1ons,  

more  spa1al  invariance  

6  

Page 7: ILSVRC’Tutorial’@CVPR2015’ 7June2015 Karen’Simonyan’image-net.org/tutorials/cvpr2015/recent.pdf · ILSVRC’Submission’Essen1als’ in’the’lightof’recentdevelopments’

Deeper  is  Beger  •  Each  weight  layer  performs  a  linear  opera1on,  followed  by  non-­‐linearity  – a  single  layer  can  be  seen  as  a  linear  classifier  itself  

•  More  layers  –  more  non-­‐lineari1es  –  leads  to  a  more  discrimina1ve  model  

•  What  limits  the  number  of  layers?  – many  models  use  pooling  aaer  each  conv.  layer  

•  input  image  resolu1on  sets  the  limit:  log (s)  for  sxs  input  – computa1onal  complexity  

7  

Page 8: ILSVRC’Tutorial’@CVPR2015’ 7June2015 Karen’Simonyan’image-net.org/tutorials/cvpr2015/recent.pdf · ILSVRC’Submission’Essen1als’ in’the’lightof’recentdevelopments’

Building  Very  Deep  Nets  (1)  •  Stack  several  layers  between  pooling  – #conv.  layers    >>    #pooling  layers  – #conv.  layers  does  not  affect  resolu1on  if  each  layer  preserves  spa1al  resolu1on:  •  conv.  stride  =  1  &  input  is  padded  

•  More  generally,  interleave    deep  mul1-­‐layer  blocks  with  resolu1on  reduc1on  layers  

conv  conv  

pooling  

conv  conv  

pooling  

resolu.on  reduc.on  

deep  mul.-­‐layer  processing  

resolu.on  reduc.on  

conv  

conv  

conv  

8  

Page 9: ILSVRC’Tutorial’@CVPR2015’ 7June2015 Karen’Simonyan’image-net.org/tutorials/cvpr2015/recent.pdf · ILSVRC’Submission’Essen1als’ in’the’lightof’recentdevelopments’

Building  Very  Deep  Nets  (2)  •  Stack  of  small  (3x3)  conv.  layers  – has  a  large  recep1ve  field  

•  two  3x3  layers  –  5x5  recep1ve  field  •  three  3x3  layers  –  7x7  recep1ve  field  

–  faster  than  a  stack  of  large  conv.  layers  –  less  parameters  than  a  single  layer  with  large  kernels  

1st  3x3  conv.  layer  

2nd  3x3  conv.  layer  

5  

5  

9  

Page 10: ILSVRC’Tutorial’@CVPR2015’ 7June2015 Karen’Simonyan’image-net.org/tutorials/cvpr2015/recent.pdf · ILSVRC’Submission’Essen1als’ in’the’lightof’recentdevelopments’

Very  Deep  Nets  at  ILSVRC  •  Large  depth  and  small  filters  is  used  in  two  top-­‐performing  ILSVRC-­‐2014  submissions  – GoogLeNet  (Incep1on)  [Szegedy  et  al.,  2014]  – VGG-­‐Net  [Simonyan  &  Zisserman,  2014]  

•  as  well  as  the  follow-­‐up  works  – Delving  deep  into  rec1fiers  (MSRA,  [He  at  al.,  2015])  – Deep  Image  (Baidu,  [Wu  et  al.,  2015])  –  Incep1on  v2  (Google,  [Ioffe  and  Szegedy,  2015])  

10  

Page 11: ILSVRC’Tutorial’@CVPR2015’ 7June2015 Karen’Simonyan’image-net.org/tutorials/cvpr2015/recent.pdf · ILSVRC’Submission’Essen1als’ in’the’lightof’recentdevelopments’

VGG-­‐Net  •  Straighrorward  implementa1on    of  very  deep  nets:  – stacks  of  conv.  layers  w/o  pooling    – 3x3  conv.  kernels  –  very  small  – conv.  stride  1  –  no  skipping  

•  Other  details  are  conven1onal:  – 5  max-­‐pool  layers  – no  normalisa1on  layers  – 3  fully-­‐connected  layers  

11  

image  

conv-­‐64  conv-­‐64  maxpool  

FC-­‐4096  FC-­‐4096  FC-­‐1000  soEmax  

conv-­‐128  conv-­‐128  maxpool  

conv-­‐256  conv-­‐256  maxpool  

conv-­‐512  conv-­‐512  maxpool  

conv-­‐512  conv-­‐512  maxpool  

13-­‐layer  

Page 12: ILSVRC’Tutorial’@CVPR2015’ 7June2015 Karen’Simonyan’image-net.org/tutorials/cvpr2015/recent.pdf · ILSVRC’Submission’Essen1als’ in’the’lightof’recentdevelopments’

VGG-­‐Net  Incarna1ons  

•  Started  from  11  layers   12  

11-­‐layer  

image  

FC-­‐4096  FC-­‐4096  FC-­‐1000  soEmax  

maxpool  

conv-­‐512  conv-­‐512  

maxpool  

conv-­‐512  conv-­‐512  

maxpool  

conv-­‐256  conv-­‐256  

maxpool  conv-­‐128  

maxpool  conv-­‐64  

Page 13: ILSVRC’Tutorial’@CVPR2015’ 7June2015 Karen’Simonyan’image-net.org/tutorials/cvpr2015/recent.pdf · ILSVRC’Submission’Essen1als’ in’the’lightof’recentdevelopments’

VGG-­‐Net  Incarna1ons  

 

•  Started  from  11  layers  &  injected  more  conv.  layers   13  

11-­‐layer  

image  

FC-­‐4096  FC-­‐4096  FC-­‐1000  soEmax  

maxpool  

conv-­‐512  conv-­‐512  

maxpool  

conv-­‐512  conv-­‐512  

maxpool  

conv-­‐256  conv-­‐256  

maxpool  conv-­‐128  

maxpool  conv-­‐64   conv-­‐64  

conv-­‐128  

Page 14: ILSVRC’Tutorial’@CVPR2015’ 7June2015 Karen’Simonyan’image-net.org/tutorials/cvpr2015/recent.pdf · ILSVRC’Submission’Essen1als’ in’the’lightof’recentdevelopments’

VGG-­‐Net  Incarna1ons  

14  

11-­‐layer  

image  

FC-­‐4096  FC-­‐4096  FC-­‐1000  soEmax  

maxpool  

conv-­‐512  conv-­‐512  

maxpool  

conv-­‐512  conv-­‐512  

maxpool  

conv-­‐256  conv-­‐256  

maxpool  conv-­‐128  

maxpool  conv-­‐64  

image  

FC-­‐4096  FC-­‐4096  FC-­‐1000  soEmax  

maxpool  

conv-­‐512  conv-­‐512  

maxpool  

conv-­‐512  conv-­‐512  

maxpool  

conv-­‐256  conv-­‐256  

maxpool  

conv-­‐128  conv-­‐128  

maxpool  

conv-­‐64  conv-­‐64  

13-­‐layer  

Page 15: ILSVRC’Tutorial’@CVPR2015’ 7June2015 Karen’Simonyan’image-net.org/tutorials/cvpr2015/recent.pdf · ILSVRC’Submission’Essen1als’ in’the’lightof’recentdevelopments’

15  

11-­‐layer  

image  

FC-­‐4096  FC-­‐4096  FC-­‐1000  soEmax  

maxpool  

conv-­‐512  conv-­‐512  

maxpool  

conv-­‐512  conv-­‐512  

maxpool  

conv-­‐256  conv-­‐256  

maxpool  conv-­‐128  

maxpool  conv-­‐64  

image  

FC-­‐4096  FC-­‐4096  FC-­‐1000  soEmax  

maxpool  

conv-­‐512  conv-­‐512  

maxpool  

conv-­‐512  conv-­‐512  

maxpool  

conv-­‐256  conv-­‐256  

maxpool  

conv-­‐128  conv-­‐128  

maxpool  

conv-­‐64  conv-­‐64  

13-­‐layer  

conv-­‐256  

conv-­‐512  

conv-­‐512  

VGG-­‐Net  Incarna1ons  

Extra  layers  injected  into  deeper  stacks  •  first  layers  capture  lower-­‐level  primi1ves,  don’t  need  to  be  very  discrimina1ve  

•  spa1al  resolu1on  is  higher  in  the  first  layers,  adding  extra  layers  there  is  computa1onally  prohibi1ve  

Page 16: ILSVRC’Tutorial’@CVPR2015’ 7June2015 Karen’Simonyan’image-net.org/tutorials/cvpr2015/recent.pdf · ILSVRC’Submission’Essen1als’ in’the’lightof’recentdevelopments’

VGG-­‐Net  Incarna1ons  

16  

11-­‐layer  

image  

FC-­‐4096  FC-­‐4096  FC-­‐1000  soEmax  

maxpool  

conv-­‐512  conv-­‐512  

maxpool  

conv-­‐512  conv-­‐512  

maxpool  

conv-­‐256  conv-­‐256  

maxpool  conv-­‐128  

maxpool  conv-­‐64  

image  

FC-­‐4096  FC-­‐4096  FC-­‐1000  soEmax  

maxpool  

conv-­‐512  conv-­‐512  

maxpool  

conv-­‐512  conv-­‐512  

maxpool  

conv-­‐256  conv-­‐256  

maxpool  

conv-­‐128  conv-­‐128  

maxpool  

conv-­‐64  conv-­‐64  

13-­‐layer  

image  

FC-­‐4096  FC-­‐4096  FC-­‐1000  soEmax  

conv-­‐512  

maxpool  

conv-­‐512  conv-­‐512  

conv-­‐512  

maxpool  

conv-­‐512  conv-­‐512  

conv-­‐256  

maxpool  

conv-­‐256  conv-­‐256  

maxpool  

conv-­‐128  conv-­‐128  

maxpool  

conv-­‐64  conv-­‐64  

16-­‐layer  

Page 17: ILSVRC’Tutorial’@CVPR2015’ 7June2015 Karen’Simonyan’image-net.org/tutorials/cvpr2015/recent.pdf · ILSVRC’Submission’Essen1als’ in’the’lightof’recentdevelopments’

VGG-­‐Net  Incarna1ons  

17  

11-­‐layer  

image  

FC-­‐4096  FC-­‐4096  FC-­‐1000  soEmax  

maxpool  

conv-­‐512  conv-­‐512  

maxpool  

conv-­‐512  conv-­‐512  

maxpool  

conv-­‐256  conv-­‐256  

maxpool  conv-­‐128  

maxpool  conv-­‐64  

image  

FC-­‐4096  FC-­‐4096  FC-­‐1000  soEmax  

maxpool  

conv-­‐512  conv-­‐512  

maxpool  

conv-­‐512  conv-­‐512  

maxpool  

conv-­‐256  conv-­‐256  

maxpool  

conv-­‐128  conv-­‐128  

maxpool  

conv-­‐64  conv-­‐64  

13-­‐layer  

image  

FC-­‐4096  FC-­‐4096  FC-­‐1000  soEmax  

conv-­‐512  

maxpool  

conv-­‐512  conv-­‐512  

conv-­‐512  

maxpool  

conv-­‐512  conv-­‐512  

conv-­‐256  

maxpool  

conv-­‐256  conv-­‐256  

maxpool  

conv-­‐128  conv-­‐128  

maxpool  

conv-­‐64  conv-­‐64  

16-­‐layer  

conv-­‐256  

conv-­‐512  

conv-­‐512  

Page 18: ILSVRC’Tutorial’@CVPR2015’ 7June2015 Karen’Simonyan’image-net.org/tutorials/cvpr2015/recent.pdf · ILSVRC’Submission’Essen1als’ in’the’lightof’recentdevelopments’

VGG-­‐Net  Incarna1ons  

•  16-­‐  and  19-­‐layer  models  are  publicly  available   18  

11-­‐layer  

image  

FC-­‐4096  FC-­‐4096  FC-­‐1000  soEmax  

maxpool  

conv-­‐512  conv-­‐512  

maxpool  

conv-­‐512  conv-­‐512  

maxpool  

conv-­‐256  conv-­‐256  

maxpool  conv-­‐128  

maxpool  conv-­‐64  

image  

FC-­‐4096  FC-­‐4096  FC-­‐1000  soEmax  

maxpool  

conv-­‐512  conv-­‐512  

maxpool  

conv-­‐512  conv-­‐512  

maxpool  

conv-­‐256  conv-­‐256  

maxpool  

conv-­‐128  conv-­‐128  

maxpool  

conv-­‐64  conv-­‐64  

13-­‐layer  

image  

FC-­‐4096  FC-­‐4096  FC-­‐1000  soEmax  

conv-­‐512  

maxpool  

conv-­‐512  conv-­‐512  

conv-­‐512  

maxpool  

conv-­‐512  conv-­‐512  

conv-­‐256  

maxpool  

conv-­‐256  conv-­‐256  

maxpool  

conv-­‐128  conv-­‐128  

maxpool  

conv-­‐64  conv-­‐64  

16-­‐layer  

image  

FC-­‐4096  FC-­‐4096  FC-­‐1000  soEmax  

conv-­‐512  conv-­‐512  

maxpool  

conv-­‐512  conv-­‐512  

conv-­‐512  conv-­‐512  

maxpool  

conv-­‐512  conv-­‐512  

conv-­‐256  conv-­‐256  

maxpool  

conv-­‐256  conv-­‐256  

maxpool  

conv-­‐128  conv-­‐128  

maxpool  

conv-­‐64  conv-­‐64  

19-­‐layer  

Page 19: ILSVRC’Tutorial’@CVPR2015’ 7June2015 Karen’Simonyan’image-net.org/tutorials/cvpr2015/recent.pdf · ILSVRC’Submission’Essen1als’ in’the’lightof’recentdevelopments’

•  Error  decreases  with  depth  •  Plateaus  aaer  16  layers  – could  be  due  to  training  specifics  

 

19  

10.4  

9.4  

8.8  9  

8.5  

9  

9.5  

10  

10.5  

11  layers   13  layers   16  layers   19  layers  

Top-­‐5  Classifica.on  Error  (Val.  Set)  

Effect  of  VGG-­‐Net  Depth  

Page 20: ILSVRC’Tutorial’@CVPR2015’ 7June2015 Karen’Simonyan’image-net.org/tutorials/cvpr2015/recent.pdf · ILSVRC’Submission’Essen1als’ in’the’lightof’recentdevelopments’

VGG-­‐Net  Layer  Pagern  

 20  

image  

FC-­‐4096  FC-­‐4096  FC-­‐1000  soEmax  

conv-­‐512  conv-­‐512  

maxpool  

conv-­‐512  conv-­‐512  

conv-­‐512  conv-­‐512  

maxpool  

conv-­‐512  conv-­‐512  

conv-­‐256  conv-­‐256  

maxpool  

conv-­‐256  conv-­‐256  

maxpool  

conv-­‐128  conv-­‐128  

maxpool  

conv-­‐64  conv-­‐64  

pool/2  

2-­‐conv/1  

2-­‐conv/1  

pool/2  

4-­‐conv/1  

pool/2  

4-­‐conv/1  

pool/2  

4-­‐conv/1  

pool/2  

3-­‐fc  

•  Mul1-­‐layer  stacks  (conv.  layers,  stride=1)  interleaved  with  resolu1on  reduc1on    (max-­‐pooling,  stride=2)  

•  Other  very  deep  nets    (incl.  GoogLeNet)  follow  same/similar  pagern  

Page 21: ILSVRC’Tutorial’@CVPR2015’ 7June2015 Karen’Simonyan’image-net.org/tutorials/cvpr2015/recent.pdf · ILSVRC’Submission’Essen1als’ in’the’lightof’recentdevelopments’

VGG-­‐Net  Extensions  •  Deep  Image  (Baidu,  [Wu  at  al.,  2015])  –  VGG-­‐16  and  VGG-­‐19  models  with  more  channels  

•  Delving  Deep  Into  Rec1fiers  (MSRA,  [He  et  al.,  2015])  

pool/2  

2-­‐conv/1  

2-­‐conv/1  

pool/2  

pool/2  

pool/2  

pool/2  

4-­‐conv/1  

4-­‐conv/1  

4-­‐conv/1  

3-­‐layer  

VGG-­‐19  

1-­‐conv/2  

pool/2  

pool/2  

pool/2  

SP  pool  

6-­‐conv/1  

6-­‐conv/1  

6-­‐conv/1  

3-­‐layer  

MSRA-­‐22  

aggressive  downsampling:  7x7  conv.  with  stride  2  (cf.  GoogLeNet)  

6-­‐layer  stacks  instead  of  4-­‐layer  

Spa1al  Pyramid  pooling  [He  at  al.,  2014]  

21  

Page 22: ILSVRC’Tutorial’@CVPR2015’ 7June2015 Karen’Simonyan’image-net.org/tutorials/cvpr2015/recent.pdf · ILSVRC’Submission’Essen1als’ in’the’lightof’recentdevelopments’

•  Ac1va1on  func1on:            

•  ai is  learnable  with  back-­‐prop  –  per-­‐channel  or  per-­‐layer  –  learnable  ac1va1on  func1on!  

•  Generalises    –  ReLU  (ai=0)    –  leaky  ReLU  (ai=0.01)  

•  0.5%/0.2%  top-­‐1/top-­‐5  error    reduc1on  

Parametric  ReLU  

ReLU  

PReLU  22  [He  et  al.,  2015]  

Page 23: ILSVRC’Tutorial’@CVPR2015’ 7June2015 Karen’Simonyan’image-net.org/tutorials/cvpr2015/recent.pdf · ILSVRC’Submission’Essen1als’ in’the’lightof’recentdevelopments’

GoogLeNet  (Incep1on)  •  Developed  concurrently  with  VGG-­‐Net  •  Some  design  choices  are  similar:  – very  deep  (22  layers)  – small  filters  

•  3x3,  5x5,  7x7  (1st  layer  only)  in  [Szegedy  et  al.,  2014]  •  3x3  and  7x7  (1st  layer  only)  in  [Ioffe  &  Szegedy,  2015]  

•  But  more  computa1onally  and    parameter-­‐efficient,  due  to  the    mul1-­‐branch  “Incep1on”  modules  

23  

Page 24: ILSVRC’Tutorial’@CVPR2015’ 7June2015 Karen’Simonyan’image-net.org/tutorials/cvpr2015/recent.pdf · ILSVRC’Submission’Essen1als’ in’the’lightof’recentdevelopments’

Prerequisite:  1x1  Convolu1on  •  Doesn’t  capture  spa1al  context,  only  operates  across  channels  

•  Performs  linear  projec1on  of  one  pixel’s  features  – can  be  used  for  dimensionality  reduc.on:  

•  Also  increases  the  depth  – computa1onally-­‐  and  parameter-­‐cheap  

•  used  in  “Network  in  Network”  architecture    [Lin  et  al.,  2014]  

Fout ∈ Rcout×whFin ∈ Rcin×whW ∈ Rcout×cin x   =  

24  

Page 25: ILSVRC’Tutorial’@CVPR2015’ 7June2015 Karen’Simonyan’image-net.org/tutorials/cvpr2015/recent.pdf · ILSVRC’Submission’Essen1als’ in’the’lightof’recentdevelopments’

Incep1on  Module  Conv.  filters  of  different  size  alongside  each  other  •  resul1ng  feature  maps  are  concatenated  •  filter  sizes:  1x1,  3x3,  5x5  &  max/avg-­‐pooling  •  in  Incep1on  v2  [Ioffe  &  Szegedy,  2015]  5x5  replaced  with  two  3x3  •  most  output  channels  are  computed  with  fast  layers,  e.g.  

1024  (pool)  +  352  (1x1  conv)  +  320  (3x3  conv)  +  224  (5x5  conv)  =  1920  (out)  

fast   slow  

Incep.on  module:  naïve  version  

25  

Page 26: ILSVRC’Tutorial’@CVPR2015’ 7June2015 Karen’Simonyan’image-net.org/tutorials/cvpr2015/recent.pdf · ILSVRC’Submission’Essen1als’ in’the’lightof’recentdevelopments’

Incep1on  Module  •  Computa1on  1me  &  number  of  parameters  reduced  by  1x1  convolu1on  –  dimensionality  reduc1on  –  also  increases  depth  

•  Allows  for  increasing  #channels  without  large  penalty  •  single  Incep1on  module  depth:  3  

Incep.on  module  with  dim.  reduc.on  

26  

Page 27: ILSVRC’Tutorial’@CVPR2015’ 7June2015 Karen’Simonyan’image-net.org/tutorials/cvpr2015/recent.pdf · ILSVRC’Submission’Essen1als’ in’the’lightof’recentdevelopments’

Incep1on  Net  v2  •  depth:  34  (10  Incep1on  modules,  3  conv.,  1  FC)  •  aggressive  spa1al  downsampling  – first  layers  quickly  decrease  resolu1on  by  8  –  lots  of  depth  in  further  stacks  

[Ioffe  &  Szegedy,  2015]   27  

Page 28: ILSVRC’Tutorial’@CVPR2015’ 7June2015 Karen’Simonyan’image-net.org/tutorials/cvpr2015/recent.pdf · ILSVRC’Submission’Essen1als’ in’the’lightof’recentdevelopments’

Architectures:  Comparison  

pool/2  

2-­‐conv/1  

2-­‐conv/1  

pool/2  

pool/2  

pool/2  

pool/2  

4-­‐conv/1  

4-­‐conv/1  

4-­‐conv/1  

3-­‐layer  

VGG-­‐19  

1-­‐conv/2  

pool/2  

pool/2  

pool/2  

SP  pool  

6-­‐conv/1  

6-­‐conv/1  

6-­‐conv/1  

3-­‐layer  

MSRA-­‐22  

1-­‐conv/2  

pool/2  

pool/2  

1-­‐Incep.on/2  (3-­‐conv)  

2-­‐conv/1  

2-­‐Incep.on/1  (6-­‐conv)  

1-­‐layer  

1-­‐Incep.on/2  (3-­‐conv)  

4-­‐Incep.on/1  (12-­‐conv)  

2-­‐Incep.on/1  (6-­‐conv)  

Incep.onNet  v2  

pool/7  

Incep1onNet  •  less  deep  in  the  first  blocks,  but  deeper  in  the  following  ones  •  Instead  of  pooling  –  Incep1on  with  stride  2  (pooling  is  inside)   28  

Page 29: ILSVRC’Tutorial’@CVPR2015’ 7June2015 Karen’Simonyan’image-net.org/tutorials/cvpr2015/recent.pdf · ILSVRC’Submission’Essen1als’ in’the’lightof’recentdevelopments’

Outline:  Training  •  Op1misa1on  •  Regularisa1on  •  Ini1alisa1on  •  Batch  normalisa1on  

29  

Page 30: ILSVRC’Tutorial’@CVPR2015’ 7June2015 Karen’Simonyan’image-net.org/tutorials/cvpr2015/recent.pdf · ILSVRC’Submission’Essen1als’ in’the’lightof’recentdevelopments’

Op1misa1on  •  Learning  objec1ve  –  mul1nomial  logis1c  regression  (“soamax  loss”)  

•  A  plethora  of  gradient-­‐based  op1misa1on  methods  –  in  common:  gradients  are  computed  with  back-­‐prop  –  then,  weights  can  be  updated  in  different  ways:  

•  SGD,  ADAGRAD,  RMSPROP,  etc.  

•  SGD  with  momentum  works  very  well  in  prac1ce  –  but  important  to  get  hyper-­‐parameters  right  

30  

Page 31: ILSVRC’Tutorial’@CVPR2015’ 7June2015 Karen’Simonyan’image-net.org/tutorials/cvpr2015/recent.pdf · ILSVRC’Submission’Essen1als’ in’the’lightof’recentdevelopments’

Learning  Rate  •  Very  important  to  set  it  properly  –  too  low  –  training  is  slow,  too  high  –  training  diverges  

•  Conven1onal  strategy  – start  with  a  reasonably  high  learning  rate  (e.g.  0.01)  – divide  it  by  constant  factor  (e.g.  10)  

•  when  the  valida1on  error  plateaus  

val.  error  

itera.on   31  

Page 32: ILSVRC’Tutorial’@CVPR2015’ 7June2015 Karen’Simonyan’image-net.org/tutorials/cvpr2015/recent.pdf · ILSVRC’Submission’Essen1als’ in’the’lightof’recentdevelopments’

Regularisa1on  •  Training  suffers  from  over-­‐fiyng,  even  on  ILSVRC  

•  Two  simple  and  effec1ve  techniques  in  most  submissions  since  AlexNet  – weight  decay  (L2  norm  penalty)  – dropout  

•  Batch  normalisa1on  [Ioffe  &  Szegedy,  2015]  –  regularises  and  speeds-­‐up  training  

32  

Page 33: ILSVRC’Tutorial’@CVPR2015’ 7June2015 Karen’Simonyan’image-net.org/tutorials/cvpr2015/recent.pdf · ILSVRC’Submission’Essen1als’ in’the’lightof’recentdevelopments’

Ini1alisa1on  •  Sample  from  zero-­‐mean  normal  distribu1on  with  fixed  variance,  e.g.  N(0; 0.01)–  works  fine  for  shallow  nets  –  deeper  nets  suffer  for  vanishing/exploding  gradient  problem  

•  Adap1vely  choose  variance  for  each  layer  –  preserve  gradient  magnitude  [Glorot  &  Bengio,  2010]:  

•  FC  layers:  Nin  =  #input  channels  •  conv.  layers:  Nin  =  #input  channels  ×  size2  

–  compensate  for  ReLU  [He  et  al.,  2015]:  

•  Supervised  pre-­‐training  –  init  deep  with  shallow  [VGG-­‐Net]  

 

σ =2Nin

σ =1Nin

MSRA  

33  

Page 34: ILSVRC’Tutorial’@CVPR2015’ 7June2015 Karen’Simonyan’image-net.org/tutorials/cvpr2015/recent.pdf · ILSVRC’Submission’Essen1als’ in’the’lightof’recentdevelopments’

Batch  Normalisa1on  •  The  distribu1on  of  ac1va1ons  changes  during  training,  making  training  harder  

•  Whitening  of  neural  net  inputs  is  a  standard  pre-­‐processing  technique  

•  Batch  normalisa1on  [Ioffe  &  Szegedy,  2015]  performs  normalisa1on  of  outputs  of  each  layer  to  zero  mean  and  unit  variance  –  can  be  seen  as  diagonal  whitening  –  performed  aaer  each  weight  layer  before  ReLU  

34  

Page 35: ILSVRC’Tutorial’@CVPR2015’ 7June2015 Karen’Simonyan’image-net.org/tutorials/cvpr2015/recent.pdf · ILSVRC’Submission’Essen1als’ in’the’lightof’recentdevelopments’

Batch  Normalisa1on  (2)  

•  scale  and  shia  parameters  are  learnt  •  doing  backprop  through  batchnorm  is  important  •  nets  with  batchnorm  need  less  regularisa1on    –  smaller/zero  dropout  &  weight  decay   35  

itera.on  

accuracy  

Page 36: ILSVRC’Tutorial’@CVPR2015’ 7June2015 Karen’Simonyan’image-net.org/tutorials/cvpr2015/recent.pdf · ILSVRC’Submission’Essen1als’ in’the’lightof’recentdevelopments’

Data  Augmenta1on  •  ILSVRC  is  s1ll  too  small  for  large  ConvNets  – over-­‐fiyng  in  spite  of  regularisa1on  

•  Data  augmenta1on  (jigering)  -­‐  increases  the  amount  of  training  data  

•  Transforms  original  images  in  a  way  which  – preserves  their  label  –  is  realis1c  

•  Helpful  for  both  training  and  evalua1on  

36  

Page 37: ILSVRC’Tutorial’@CVPR2015’ 7June2015 Karen’Simonyan’image-net.org/tutorials/cvpr2015/recent.pdf · ILSVRC’Submission’Essen1als’ in’the’lightof’recentdevelopments’

Random  Crop  Augmenta1on  •  Randomly  sample  a  fixed-­‐size  sub-­‐image  (224x224)  –  the  crop  is  a  ConvNet  input  –  essen1al  component  of  most    ImageNet  submissions  since  AlexNet  

•  Original  image  is  rescaled  to  a  certain  smallest  side  –  affects  the  scale  of  image  sta1s1cs    seen  by  a  ConvNet  

–  single-­‐scale:  256xN  or  384xN  – mul.-­‐scale:  randomly  sample  the  size    for  each  image  from  256xN  to  512xN  

•  Random  horizontal  flips  37  

256  

N≥256  

224  

224  384  

N≥384  

Page 38: ILSVRC’Tutorial’@CVPR2015’ 7June2015 Karen’Simonyan’image-net.org/tutorials/cvpr2015/recent.pdf · ILSVRC’Submission’Essen1als’ in’the’lightof’recentdevelopments’

Photometric  Distor1on  Augmenta1on  •  Random  RGB  shia  [AlexNet]  

•  Randomly  adjust  contrast,  brightness,  and  colour  [Howard,  2013]  

•  Vigneyng  and  lens  distor1on  [Deep  Image]    

38  

Page 39: ILSVRC’Tutorial’@CVPR2015’ 7June2015 Karen’Simonyan’image-net.org/tutorials/cvpr2015/recent.pdf · ILSVRC’Submission’Essen1als’ in’the’lightof’recentdevelopments’

Outline:  Evalua1on  •  Mul1-­‐crop  evalua1on  

•  Dense  evalua1on  –  fully-­‐convolu1onal  nets  

 •  Model  ensembles  

39  

Page 40: ILSVRC’Tutorial’@CVPR2015’ 7June2015 Karen’Simonyan’image-net.org/tutorials/cvpr2015/recent.pdf · ILSVRC’Submission’Essen1als’ in’the’lightof’recentdevelopments’

Mul1-­‐Crop  Evalua1on  •  Network  is  trained  on  fixed-­‐size  (224x224)  crops  •  Full  image  is  normally  larger,  so    –  1le  the  image  with  crops  –  evaluate  the  net  and  average  predic1ons  

•  More  crops  –  higher  accuracy,  but  slower  –  Single-­‐scale:  5  crops  x  2  flips  =  10  crops  [AlexNet]  – Mul1-­‐scale  

•  rescale  image  to  several  sizes,  sample  crops  in  each  •  [Howard,  2013]:  3  scales,  90  crops;  [GoogLeNet]:  4  scales,  144  crops  

–  disadvantage:  slow,  as  need  to  evaluate  ConvNet  from  scratch  

40  

Page 41: ILSVRC’Tutorial’@CVPR2015’ 7June2015 Karen’Simonyan’image-net.org/tutorials/cvpr2015/recent.pdf · ILSVRC’Submission’Essen1als’ in’the’lightof’recentdevelopments’

Dense  Evalua1on  •  ConvNets  can  be  applied  to  an  image  of  any  size  •  Network  should  be  fully-­‐convolu1onal  –  fully-­‐connected  layers  expect  fixed-­‐resolu1on  input  –  so  should  be  converted  to  conv.  

•  Conversion  (on  the  example  of  VGG-­‐Net)  –  assume  FC  layer  has  input  512x7x7,  output  is  4096-­‐D  –  can  be  seen  as  conv.  layer  with  7x7  recep1ve  field,  512  input  channels  &  4096  output:  512x7x7  -­‐>  4096x1x1  

•  Output  of  full-­‐conv.  net  is  a  class  score  map,  should  be  pooled  with  global  pooling  to  produce  a  vector  of  scores  

•  Used  in  OverFeat  [Sermanet  et  al.,  2013]  &  VGG-­‐Net  41  

Page 42: ILSVRC’Tutorial’@CVPR2015’ 7June2015 Karen’Simonyan’image-net.org/tutorials/cvpr2015/recent.pdf · ILSVRC’Submission’Essen1als’ in’the’lightof’recentdevelopments’

 

•  Dense  evalua1on  results  •  Using  mul1ple  scales  is  important  –  mul1-­‐scale  training  outperforms  single-­‐scale  –  mul1-­‐scale  tes1ng  further  improves  the  results  

 

42  

9.4  

8.8  9  

8.8  

8.1   8  8.2  

7.5   7.5  7  

7.5  

8  

8.5  

9  

9.5  

10  

13  layers   16  layers   19  layers  

Top-­‐5  Classifica.on  Error  (Val.  Set)  

single/single  

mul1/single  

mul1/mul1  

Effect  of  Scale  (VGG-­‐Net)  

train/test  scales  

Page 43: ILSVRC’Tutorial’@CVPR2015’ 7June2015 Karen’Simonyan’image-net.org/tutorials/cvpr2015/recent.pdf · ILSVRC’Submission’Essen1als’ in’the’lightof’recentdevelopments’

•  Dense  evalua1on  is  on  par  with  mul1-­‐crop  •  Dense  &  mul1-­‐crop  are  complementary  •  Combining  predic1ons  from  2  nets  is  beneficial,  but  slow  

43  

7.5   7.5  

7.2  

7.5  7.4  

7.1  7.1  7.2  

6.8  

6.6  

6.8  

7  

7.2  

7.4  

7.6  

dense   150  crops   dense  &  150  crops  

Top-­‐5  Classifica.on  Error  (Val.  Set)  

16-­‐layer  

19-­‐layer  

16  &  19-­‐layer  

Evalua1on:  Dense  vs  Mul1-­‐Crop  

networks  

Page 44: ILSVRC’Tutorial’@CVPR2015’ 7June2015 Karen’Simonyan’image-net.org/tutorials/cvpr2015/recent.pdf · ILSVRC’Submission’Essen1als’ in’the’lightof’recentdevelopments’

Model  Ensembles  •  Training  mul1ple  models  and  combining  their  predic1ons  improves  the  accuracy  –  average  soa-­‐max  posteriors  

•  Used  in  all  top-­‐performing  submissions  to  ILSVRC  •  Models  don’t  need  to  be  the  same  –  can  simply  combine  your  best  models  developed  by  the  submission  1me  

•  Examples  of  ensembles’  improvement:  –  VGG-­‐Net:  error  decreases  from  7.1%  (1  net)  to  6.8%  (2  nets)  –  GoogLeNet:  from  7.9%  (1  net)  to  6.7%  (7  nets)  –  Incep1onNet  v2  (batchnorm):  from  5.8%  to  4.8%  (6  nets)  

44  

Page 45: ILSVRC’Tutorial’@CVPR2015’ 7June2015 Karen’Simonyan’image-net.org/tutorials/cvpr2015/recent.pdf · ILSVRC’Submission’Essen1als’ in’the’lightof’recentdevelopments’

Object  Localisa1on  (In  Brief)  •  ILSVRC  localisa1on  task:  classify  and  localise  a  single  object  (which  is  guaranteed  to  be  in  the  image)  

•  Object  detec1on  approaches  would  require  adapta1on  –  not  all  the  objects  are  annotated  in  the  training  set  

•  Object  bounding  box  regression  with  ConvNets  [OverFeat]  –  last  layer  predicts  a  bounding  box    

•  class-­‐agnos1c  [OverFeat]  •  for  each  class  [VGG]  

–  Ini1alised  with  classifica1on  nets  –  Fine-­‐tuning  of  all  layers  

45  

0  224x224  crop  

object  box  

Page 46: ILSVRC’Tutorial’@CVPR2015’ 7June2015 Karen’Simonyan’image-net.org/tutorials/cvpr2015/recent.pdf · ILSVRC’Submission’Essen1als’ in’the’lightof’recentdevelopments’

Object  Detec1on  (In  Brief)  •  Common  approach:  –  generate  a  large  number  of  bounding  box  proposals  –  classify  them  using  visual  features  

•  ConvNet  features  work  very  well!  –  R-­‐CNN  [Girshick  et  al.,  2013]  

•  Fast  R-­‐CNN  [Girshick  et  al.,  2015]  –  for  each  proposal,  predicts  its  class  and  precise  bbox  loca1on  –  re-­‐uses  conv.  features,  no  need  to  re-­‐compute  

•  Proposals  –  Selec1ve  search  – Mul1-­‐Box  –  Faster  R-­‐CNN   46  

Page 47: ILSVRC’Tutorial’@CVPR2015’ 7June2015 Karen’Simonyan’image-net.org/tutorials/cvpr2015/recent.pdf · ILSVRC’Submission’Essen1als’ in’the’lightof’recentdevelopments’

Infrastructure  •  Good  infrastructure  is  just  as  important  •  A  number  of  off-­‐the-­‐shelf  deep  learning  packages  – Torch,  Caffe,  Theano,  MatConvNet  

•  Using  GPUs  is  a  must  – most  packages  use  the  same  low-­‐level  back-­‐ends,  e.g.  cuDNN  or  cuBLAS,  so  speed  is  comparable  

•  Mul1-­‐GPU  training  helps  a  lot  – available  in  packages  above  

47  

Page 48: ILSVRC’Tutorial’@CVPR2015’ 7June2015 Karen’Simonyan’image-net.org/tutorials/cvpr2015/recent.pdf · ILSVRC’Submission’Essen1als’ in’the’lightof’recentdevelopments’

Summary  •  Deep  ConvNets  –  an  essen1al  component  of  top  ILSVRC  submissions  since  2012  

•  Depth  is  important  •  Other  essen1als:  – extensive  augmenta1on  at  mul1ple  scales  – dropout,  batch  normalisa1on,  weight  decay  

•  Next  talk  will  cover  the  implementa1on  side…  

48  

Page 49: ILSVRC’Tutorial’@CVPR2015’ 7June2015 Karen’Simonyan’image-net.org/tutorials/cvpr2015/recent.pdf · ILSVRC’Submission’Essen1als’ in’the’lightof’recentdevelopments’

References  •  Y.  LeCun,  B.  Boser,  J.  S.  Denker,  D.  Henderson,  R.  E.  Howard,  W.  Hubbard,  and  L.  D.  Jackel.  Backpropaga1on  applied  to  

handwrigen  zip  code  recogni1on.  Neural  Computa1on  1989.  •  Y.  LeCun,  L.  Bogou,  Y.  Bengio,  and  P.  Haffner.  Gradient-­‐based  learning  applied  to  document  recogni1on.  Proceedings  of  

the  IEEE  1998.  •  X.  Glorot  and  Y.  Bengio.  Understanding  the  difficulty  of  training  deep  feedforward  neural  networks.  AISTATS  2010.  •  A.  Krizhevsky,  I.  Sutskever,  and  G.  E.  Hinton.  ImageNet  Classifica1on  with  Deep  Convolu1onal  Neural  Networks.    

NIPS  2012.  •  M.  Lin,  Q.  Chen,  and  S.  Yan.  Network  In  Network.  ICLR  2014.  •  P.  Sermanet,  D.  Eigen,  X.  Zhang,  M.  Mathieu,  R.  Fergus,  and  Y.  LeCun.  OverFeat:  Integrated  Recogni1on,  Localiza1on  

and  Detec1on  using  Convolu1onal  Networks.  ICLR  2014.  •  A.  G.  Howard.  Some  improvements  on  deep  convolu1onal  neural  network  based  image  classifica1on.  ICLR  2014.  •  D.  Erhan,  C.  Szegedy,  A.  Toshev,  and  D.  Anguelov.  Scalable  Object  Detec1on  using  Deep  Neural  Networks.  CVPR  2014.  •  M.  D.  Zeiler  and  R.  Fergus.  Visualizing  and  understanding  convolu1onal  networks.  ECCV,  2014.  •  K.  Simonyan  and  A.  Zisserman.  Very  Deep  Convolu1onal  Networks  for  Large-­‐Scale  Image  Recogni1on.  ICLR  2015.  •  R.  Wu,  S.  Yan,  Y.  Shan,  Q.  Dang,  and  G.  Sun.  Deep  Image:  Scaling  up  Image  Recogni1on.  Arxiv  2015.  •  K.  He,  X.  Zhang,  S.  Ren,  and  J.  Sun.  Delving  Deep  into  Rec1fiers:  Surpassing  Human-­‐Level  Performance  on  ImageNet  

Classifica1on.  Arxiv  2015.  •  C.  Szegedy,  W.  Liu,  Y.  Jia,  P.  Sermanet,  S.  Reed,  D.  Anguelov,  D.  Erhan,  V.  Vanhoucke,  and  A.  Rabinovich.  Going  Deeper  

With  Convolu1ons.  CVPR  2015.  •  S.  Ioffe  and  C.  Szegedy.  Batch  Normaliza1on:  Accelera1ng  Deep  Network  Training  by  Reducing  Internal  Covariate  Shia.  

ICML  2015.  •  R.  Girshick.  Fast  R-­‐CNN.  Arxiv  2015.  

49