Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
ILSVRC Submission Essen1als in the light of recent developments
ILSVRC Tutorial @ CVPR-‐2015
7 June 2015
Karen Simonyan
Outline • Architectures – Convolu1onal Networks: recap – The importance of depth in image representa1ons
• very deep ConvNets (VGG-‐Net and extensions) • Incep1on modules (GoogLeNet)
• Training – Op1misa1on – Data augmenta1on
• Evalua1on • References
2
Convolu1onal Networks • State-‐of-‐the-‐art in image recogni1on – winner of ILSVRC since 2012
• ConvNet -‐ hierarchical image representa1on [LeCun et al., 89, 98] – stack of conv. layers, interleaved with non-‐lineari1es – typically followed by fully-‐connected layers
ConvNet schema.c 3
Convolu1onal Networks (2) • Important conv. layer proper1es: – locality: objects/parts have local spa1al support – weight sharing: transla1on equivariance
• Conv. layers operate across all channels, not just one
• Each layer is followed by non-‐linearity (ac1va1on func1on), e.g. ReLU: max(W*x, 0)
• Some layers are followed by spa1al pooling – max-‐ or sum-‐pooling – invariance to local transla1on
4
Convolu1onal Networks (3) • Supervised training by back-‐propaga1on – gradient descent & chain rule
• End-‐to-‐end training – all layers learnt jointly, no hand-‐craaing
• But some engineering is s1ll needed to put together an architecture – number of layers, feature channels, etc. – some guidelines will be provided in this talk
5
AlexNet • Winner of ILSVRC-‐2012 ([Krizhevsky et al.]) • ConvNet with 8 layers (5 conv. & 3 FC)
layer output size
input image 3x224x224
conv-‐96x11x11/4 96x56x56
maxpool/2 96x28x28
conv-‐256x5x5 256x28x28
maxpool/2 256x14x14
conv-‐384x3x3 384x14x14
conv-‐384x3x3 384x14x14
conv-‐256x3x3 256x14x14
maxpool/2 256x7x7
full-‐4096 4096
full-‐4096 4096
full-‐1000 1000
With depth: • spa1al resolu1on is gradually
reduced • number of channels (feature
dimension) is increased • higher-‐level representa1ons,
more spa1al invariance
6
Deeper is Beger • Each weight layer performs a linear opera1on, followed by non-‐linearity – a single layer can be seen as a linear classifier itself
• More layers – more non-‐lineari1es – leads to a more discrimina1ve model
• What limits the number of layers? – many models use pooling aaer each conv. layer
• input image resolu1on sets the limit: log (s) for sxs input – computa1onal complexity
7
Building Very Deep Nets (1) • Stack several layers between pooling – #conv. layers >> #pooling layers – #conv. layers does not affect resolu1on if each layer preserves spa1al resolu1on: • conv. stride = 1 & input is padded
• More generally, interleave deep mul1-‐layer blocks with resolu1on reduc1on layers
conv conv
pooling
conv conv
pooling
resolu.on reduc.on
deep mul.-‐layer processing
resolu.on reduc.on
conv
conv
conv
8
Building Very Deep Nets (2) • Stack of small (3x3) conv. layers – has a large recep1ve field
• two 3x3 layers – 5x5 recep1ve field • three 3x3 layers – 7x7 recep1ve field
– faster than a stack of large conv. layers – less parameters than a single layer with large kernels
1st 3x3 conv. layer
2nd 3x3 conv. layer
5
5
9
Very Deep Nets at ILSVRC • Large depth and small filters is used in two top-‐performing ILSVRC-‐2014 submissions – GoogLeNet (Incep1on) [Szegedy et al., 2014] – VGG-‐Net [Simonyan & Zisserman, 2014]
• as well as the follow-‐up works – Delving deep into rec1fiers (MSRA, [He at al., 2015]) – Deep Image (Baidu, [Wu et al., 2015]) – Incep1on v2 (Google, [Ioffe and Szegedy, 2015])
10
VGG-‐Net • Straighrorward implementa1on of very deep nets: – stacks of conv. layers w/o pooling – 3x3 conv. kernels – very small – conv. stride 1 – no skipping
• Other details are conven1onal: – 5 max-‐pool layers – no normalisa1on layers – 3 fully-‐connected layers
11
image
conv-‐64 conv-‐64 maxpool
FC-‐4096 FC-‐4096 FC-‐1000 soEmax
conv-‐128 conv-‐128 maxpool
conv-‐256 conv-‐256 maxpool
conv-‐512 conv-‐512 maxpool
conv-‐512 conv-‐512 maxpool
13-‐layer
VGG-‐Net Incarna1ons
• Started from 11 layers 12
11-‐layer
image
FC-‐4096 FC-‐4096 FC-‐1000 soEmax
maxpool
conv-‐512 conv-‐512
maxpool
conv-‐512 conv-‐512
maxpool
conv-‐256 conv-‐256
maxpool conv-‐128
maxpool conv-‐64
VGG-‐Net Incarna1ons
• Started from 11 layers & injected more conv. layers 13
11-‐layer
image
FC-‐4096 FC-‐4096 FC-‐1000 soEmax
maxpool
conv-‐512 conv-‐512
maxpool
conv-‐512 conv-‐512
maxpool
conv-‐256 conv-‐256
maxpool conv-‐128
maxpool conv-‐64 conv-‐64
conv-‐128
VGG-‐Net Incarna1ons
14
11-‐layer
image
FC-‐4096 FC-‐4096 FC-‐1000 soEmax
maxpool
conv-‐512 conv-‐512
maxpool
conv-‐512 conv-‐512
maxpool
conv-‐256 conv-‐256
maxpool conv-‐128
maxpool conv-‐64
image
FC-‐4096 FC-‐4096 FC-‐1000 soEmax
maxpool
conv-‐512 conv-‐512
maxpool
conv-‐512 conv-‐512
maxpool
conv-‐256 conv-‐256
maxpool
conv-‐128 conv-‐128
maxpool
conv-‐64 conv-‐64
13-‐layer
15
11-‐layer
image
FC-‐4096 FC-‐4096 FC-‐1000 soEmax
maxpool
conv-‐512 conv-‐512
maxpool
conv-‐512 conv-‐512
maxpool
conv-‐256 conv-‐256
maxpool conv-‐128
maxpool conv-‐64
image
FC-‐4096 FC-‐4096 FC-‐1000 soEmax
maxpool
conv-‐512 conv-‐512
maxpool
conv-‐512 conv-‐512
maxpool
conv-‐256 conv-‐256
maxpool
conv-‐128 conv-‐128
maxpool
conv-‐64 conv-‐64
13-‐layer
conv-‐256
conv-‐512
conv-‐512
VGG-‐Net Incarna1ons
Extra layers injected into deeper stacks • first layers capture lower-‐level primi1ves, don’t need to be very discrimina1ve
• spa1al resolu1on is higher in the first layers, adding extra layers there is computa1onally prohibi1ve
VGG-‐Net Incarna1ons
16
11-‐layer
image
FC-‐4096 FC-‐4096 FC-‐1000 soEmax
maxpool
conv-‐512 conv-‐512
maxpool
conv-‐512 conv-‐512
maxpool
conv-‐256 conv-‐256
maxpool conv-‐128
maxpool conv-‐64
image
FC-‐4096 FC-‐4096 FC-‐1000 soEmax
maxpool
conv-‐512 conv-‐512
maxpool
conv-‐512 conv-‐512
maxpool
conv-‐256 conv-‐256
maxpool
conv-‐128 conv-‐128
maxpool
conv-‐64 conv-‐64
13-‐layer
image
FC-‐4096 FC-‐4096 FC-‐1000 soEmax
conv-‐512
maxpool
conv-‐512 conv-‐512
conv-‐512
maxpool
conv-‐512 conv-‐512
conv-‐256
maxpool
conv-‐256 conv-‐256
maxpool
conv-‐128 conv-‐128
maxpool
conv-‐64 conv-‐64
16-‐layer
VGG-‐Net Incarna1ons
17
11-‐layer
image
FC-‐4096 FC-‐4096 FC-‐1000 soEmax
maxpool
conv-‐512 conv-‐512
maxpool
conv-‐512 conv-‐512
maxpool
conv-‐256 conv-‐256
maxpool conv-‐128
maxpool conv-‐64
image
FC-‐4096 FC-‐4096 FC-‐1000 soEmax
maxpool
conv-‐512 conv-‐512
maxpool
conv-‐512 conv-‐512
maxpool
conv-‐256 conv-‐256
maxpool
conv-‐128 conv-‐128
maxpool
conv-‐64 conv-‐64
13-‐layer
image
FC-‐4096 FC-‐4096 FC-‐1000 soEmax
conv-‐512
maxpool
conv-‐512 conv-‐512
conv-‐512
maxpool
conv-‐512 conv-‐512
conv-‐256
maxpool
conv-‐256 conv-‐256
maxpool
conv-‐128 conv-‐128
maxpool
conv-‐64 conv-‐64
16-‐layer
conv-‐256
conv-‐512
conv-‐512
VGG-‐Net Incarna1ons
• 16-‐ and 19-‐layer models are publicly available 18
11-‐layer
image
FC-‐4096 FC-‐4096 FC-‐1000 soEmax
maxpool
conv-‐512 conv-‐512
maxpool
conv-‐512 conv-‐512
maxpool
conv-‐256 conv-‐256
maxpool conv-‐128
maxpool conv-‐64
image
FC-‐4096 FC-‐4096 FC-‐1000 soEmax
maxpool
conv-‐512 conv-‐512
maxpool
conv-‐512 conv-‐512
maxpool
conv-‐256 conv-‐256
maxpool
conv-‐128 conv-‐128
maxpool
conv-‐64 conv-‐64
13-‐layer
image
FC-‐4096 FC-‐4096 FC-‐1000 soEmax
conv-‐512
maxpool
conv-‐512 conv-‐512
conv-‐512
maxpool
conv-‐512 conv-‐512
conv-‐256
maxpool
conv-‐256 conv-‐256
maxpool
conv-‐128 conv-‐128
maxpool
conv-‐64 conv-‐64
16-‐layer
image
FC-‐4096 FC-‐4096 FC-‐1000 soEmax
conv-‐512 conv-‐512
maxpool
conv-‐512 conv-‐512
conv-‐512 conv-‐512
maxpool
conv-‐512 conv-‐512
conv-‐256 conv-‐256
maxpool
conv-‐256 conv-‐256
maxpool
conv-‐128 conv-‐128
maxpool
conv-‐64 conv-‐64
19-‐layer
• Error decreases with depth • Plateaus aaer 16 layers – could be due to training specifics
19
10.4
9.4
8.8 9
8.5
9
9.5
10
10.5
11 layers 13 layers 16 layers 19 layers
Top-‐5 Classifica.on Error (Val. Set)
Effect of VGG-‐Net Depth
VGG-‐Net Layer Pagern
20
image
FC-‐4096 FC-‐4096 FC-‐1000 soEmax
conv-‐512 conv-‐512
maxpool
conv-‐512 conv-‐512
conv-‐512 conv-‐512
maxpool
conv-‐512 conv-‐512
conv-‐256 conv-‐256
maxpool
conv-‐256 conv-‐256
maxpool
conv-‐128 conv-‐128
maxpool
conv-‐64 conv-‐64
pool/2
2-‐conv/1
2-‐conv/1
pool/2
4-‐conv/1
pool/2
4-‐conv/1
pool/2
4-‐conv/1
pool/2
3-‐fc
• Mul1-‐layer stacks (conv. layers, stride=1) interleaved with resolu1on reduc1on (max-‐pooling, stride=2)
• Other very deep nets (incl. GoogLeNet) follow same/similar pagern
VGG-‐Net Extensions • Deep Image (Baidu, [Wu at al., 2015]) – VGG-‐16 and VGG-‐19 models with more channels
• Delving Deep Into Rec1fiers (MSRA, [He et al., 2015])
pool/2
2-‐conv/1
2-‐conv/1
pool/2
pool/2
pool/2
pool/2
4-‐conv/1
4-‐conv/1
4-‐conv/1
3-‐layer
VGG-‐19
1-‐conv/2
pool/2
pool/2
pool/2
SP pool
6-‐conv/1
6-‐conv/1
6-‐conv/1
3-‐layer
MSRA-‐22
aggressive downsampling: 7x7 conv. with stride 2 (cf. GoogLeNet)
6-‐layer stacks instead of 4-‐layer
Spa1al Pyramid pooling [He at al., 2014]
21
• Ac1va1on func1on:
• ai is learnable with back-‐prop – per-‐channel or per-‐layer – learnable ac1va1on func1on!
• Generalises – ReLU (ai=0) – leaky ReLU (ai=0.01)
• 0.5%/0.2% top-‐1/top-‐5 error reduc1on
Parametric ReLU
ReLU
PReLU 22 [He et al., 2015]
GoogLeNet (Incep1on) • Developed concurrently with VGG-‐Net • Some design choices are similar: – very deep (22 layers) – small filters
• 3x3, 5x5, 7x7 (1st layer only) in [Szegedy et al., 2014] • 3x3 and 7x7 (1st layer only) in [Ioffe & Szegedy, 2015]
• But more computa1onally and parameter-‐efficient, due to the mul1-‐branch “Incep1on” modules
23
Prerequisite: 1x1 Convolu1on • Doesn’t capture spa1al context, only operates across channels
• Performs linear projec1on of one pixel’s features – can be used for dimensionality reduc.on:
• Also increases the depth – computa1onally-‐ and parameter-‐cheap
• used in “Network in Network” architecture [Lin et al., 2014]
Fout ∈ Rcout×whFin ∈ Rcin×whW ∈ Rcout×cin x =
24
Incep1on Module Conv. filters of different size alongside each other • resul1ng feature maps are concatenated • filter sizes: 1x1, 3x3, 5x5 & max/avg-‐pooling • in Incep1on v2 [Ioffe & Szegedy, 2015] 5x5 replaced with two 3x3 • most output channels are computed with fast layers, e.g.
1024 (pool) + 352 (1x1 conv) + 320 (3x3 conv) + 224 (5x5 conv) = 1920 (out)
fast slow
Incep.on module: naïve version
25
Incep1on Module • Computa1on 1me & number of parameters reduced by 1x1 convolu1on – dimensionality reduc1on – also increases depth
• Allows for increasing #channels without large penalty • single Incep1on module depth: 3
Incep.on module with dim. reduc.on
26
Incep1on Net v2 • depth: 34 (10 Incep1on modules, 3 conv., 1 FC) • aggressive spa1al downsampling – first layers quickly decrease resolu1on by 8 – lots of depth in further stacks
[Ioffe & Szegedy, 2015] 27
Architectures: Comparison
pool/2
2-‐conv/1
2-‐conv/1
pool/2
pool/2
pool/2
pool/2
4-‐conv/1
4-‐conv/1
4-‐conv/1
3-‐layer
VGG-‐19
1-‐conv/2
pool/2
pool/2
pool/2
SP pool
6-‐conv/1
6-‐conv/1
6-‐conv/1
3-‐layer
MSRA-‐22
1-‐conv/2
pool/2
pool/2
1-‐Incep.on/2 (3-‐conv)
2-‐conv/1
2-‐Incep.on/1 (6-‐conv)
1-‐layer
1-‐Incep.on/2 (3-‐conv)
4-‐Incep.on/1 (12-‐conv)
2-‐Incep.on/1 (6-‐conv)
Incep.onNet v2
pool/7
Incep1onNet • less deep in the first blocks, but deeper in the following ones • Instead of pooling – Incep1on with stride 2 (pooling is inside) 28
Outline: Training • Op1misa1on • Regularisa1on • Ini1alisa1on • Batch normalisa1on
29
Op1misa1on • Learning objec1ve – mul1nomial logis1c regression (“soamax loss”)
• A plethora of gradient-‐based op1misa1on methods – in common: gradients are computed with back-‐prop – then, weights can be updated in different ways:
• SGD, ADAGRAD, RMSPROP, etc.
• SGD with momentum works very well in prac1ce – but important to get hyper-‐parameters right
30
Learning Rate • Very important to set it properly – too low – training is slow, too high – training diverges
• Conven1onal strategy – start with a reasonably high learning rate (e.g. 0.01) – divide it by constant factor (e.g. 10)
• when the valida1on error plateaus
val. error
itera.on 31
Regularisa1on • Training suffers from over-‐fiyng, even on ILSVRC
• Two simple and effec1ve techniques in most submissions since AlexNet – weight decay (L2 norm penalty) – dropout
• Batch normalisa1on [Ioffe & Szegedy, 2015] – regularises and speeds-‐up training
32
Ini1alisa1on • Sample from zero-‐mean normal distribu1on with fixed variance, e.g. N(0; 0.01)– works fine for shallow nets – deeper nets suffer for vanishing/exploding gradient problem
• Adap1vely choose variance for each layer – preserve gradient magnitude [Glorot & Bengio, 2010]:
• FC layers: Nin = #input channels • conv. layers: Nin = #input channels × size2
– compensate for ReLU [He et al., 2015]:
• Supervised pre-‐training – init deep with shallow [VGG-‐Net]
σ =2Nin
σ =1Nin
MSRA
33
Batch Normalisa1on • The distribu1on of ac1va1ons changes during training, making training harder
• Whitening of neural net inputs is a standard pre-‐processing technique
• Batch normalisa1on [Ioffe & Szegedy, 2015] performs normalisa1on of outputs of each layer to zero mean and unit variance – can be seen as diagonal whitening – performed aaer each weight layer before ReLU
34
Batch Normalisa1on (2)
• scale and shia parameters are learnt • doing backprop through batchnorm is important • nets with batchnorm need less regularisa1on – smaller/zero dropout & weight decay 35
itera.on
accuracy
Data Augmenta1on • ILSVRC is s1ll too small for large ConvNets – over-‐fiyng in spite of regularisa1on
• Data augmenta1on (jigering) -‐ increases the amount of training data
• Transforms original images in a way which – preserves their label – is realis1c
• Helpful for both training and evalua1on
36
Random Crop Augmenta1on • Randomly sample a fixed-‐size sub-‐image (224x224) – the crop is a ConvNet input – essen1al component of most ImageNet submissions since AlexNet
• Original image is rescaled to a certain smallest side – affects the scale of image sta1s1cs seen by a ConvNet
– single-‐scale: 256xN or 384xN – mul.-‐scale: randomly sample the size for each image from 256xN to 512xN
• Random horizontal flips 37
256
N≥256
224
224 384
N≥384
Photometric Distor1on Augmenta1on • Random RGB shia [AlexNet]
• Randomly adjust contrast, brightness, and colour [Howard, 2013]
• Vigneyng and lens distor1on [Deep Image]
38
Outline: Evalua1on • Mul1-‐crop evalua1on
• Dense evalua1on – fully-‐convolu1onal nets
• Model ensembles
39
Mul1-‐Crop Evalua1on • Network is trained on fixed-‐size (224x224) crops • Full image is normally larger, so – 1le the image with crops – evaluate the net and average predic1ons
• More crops – higher accuracy, but slower – Single-‐scale: 5 crops x 2 flips = 10 crops [AlexNet] – Mul1-‐scale
• rescale image to several sizes, sample crops in each • [Howard, 2013]: 3 scales, 90 crops; [GoogLeNet]: 4 scales, 144 crops
– disadvantage: slow, as need to evaluate ConvNet from scratch
40
Dense Evalua1on • ConvNets can be applied to an image of any size • Network should be fully-‐convolu1onal – fully-‐connected layers expect fixed-‐resolu1on input – so should be converted to conv.
• Conversion (on the example of VGG-‐Net) – assume FC layer has input 512x7x7, output is 4096-‐D – can be seen as conv. layer with 7x7 recep1ve field, 512 input channels & 4096 output: 512x7x7 -‐> 4096x1x1
• Output of full-‐conv. net is a class score map, should be pooled with global pooling to produce a vector of scores
• Used in OverFeat [Sermanet et al., 2013] & VGG-‐Net 41
• Dense evalua1on results • Using mul1ple scales is important – mul1-‐scale training outperforms single-‐scale – mul1-‐scale tes1ng further improves the results
42
9.4
8.8 9
8.8
8.1 8 8.2
7.5 7.5 7
7.5
8
8.5
9
9.5
10
13 layers 16 layers 19 layers
Top-‐5 Classifica.on Error (Val. Set)
single/single
mul1/single
mul1/mul1
Effect of Scale (VGG-‐Net)
train/test scales
• Dense evalua1on is on par with mul1-‐crop • Dense & mul1-‐crop are complementary • Combining predic1ons from 2 nets is beneficial, but slow
43
7.5 7.5
7.2
7.5 7.4
7.1 7.1 7.2
6.8
6.6
6.8
7
7.2
7.4
7.6
dense 150 crops dense & 150 crops
Top-‐5 Classifica.on Error (Val. Set)
16-‐layer
19-‐layer
16 & 19-‐layer
Evalua1on: Dense vs Mul1-‐Crop
networks
Model Ensembles • Training mul1ple models and combining their predic1ons improves the accuracy – average soa-‐max posteriors
• Used in all top-‐performing submissions to ILSVRC • Models don’t need to be the same – can simply combine your best models developed by the submission 1me
• Examples of ensembles’ improvement: – VGG-‐Net: error decreases from 7.1% (1 net) to 6.8% (2 nets) – GoogLeNet: from 7.9% (1 net) to 6.7% (7 nets) – Incep1onNet v2 (batchnorm): from 5.8% to 4.8% (6 nets)
44
Object Localisa1on (In Brief) • ILSVRC localisa1on task: classify and localise a single object (which is guaranteed to be in the image)
• Object detec1on approaches would require adapta1on – not all the objects are annotated in the training set
• Object bounding box regression with ConvNets [OverFeat] – last layer predicts a bounding box
• class-‐agnos1c [OverFeat] • for each class [VGG]
– Ini1alised with classifica1on nets – Fine-‐tuning of all layers
45
0 224x224 crop
object box
Object Detec1on (In Brief) • Common approach: – generate a large number of bounding box proposals – classify them using visual features
• ConvNet features work very well! – R-‐CNN [Girshick et al., 2013]
• Fast R-‐CNN [Girshick et al., 2015] – for each proposal, predicts its class and precise bbox loca1on – re-‐uses conv. features, no need to re-‐compute
• Proposals – Selec1ve search – Mul1-‐Box – Faster R-‐CNN 46
Infrastructure • Good infrastructure is just as important • A number of off-‐the-‐shelf deep learning packages – Torch, Caffe, Theano, MatConvNet
• Using GPUs is a must – most packages use the same low-‐level back-‐ends, e.g. cuDNN or cuBLAS, so speed is comparable
• Mul1-‐GPU training helps a lot – available in packages above
47
Summary • Deep ConvNets – an essen1al component of top ILSVRC submissions since 2012
• Depth is important • Other essen1als: – extensive augmenta1on at mul1ple scales – dropout, batch normalisa1on, weight decay
• Next talk will cover the implementa1on side…
48
References • Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropaga1on applied to
handwrigen zip code recogni1on. Neural Computa1on 1989. • Y. LeCun, L. Bogou, Y. Bengio, and P. Haffner. Gradient-‐based learning applied to document recogni1on. Proceedings of
the IEEE 1998. • X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. AISTATS 2010. • A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classifica1on with Deep Convolu1onal Neural Networks.
NIPS 2012. • M. Lin, Q. Chen, and S. Yan. Network In Network. ICLR 2014. • P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. OverFeat: Integrated Recogni1on, Localiza1on
and Detec1on using Convolu1onal Networks. ICLR 2014. • A. G. Howard. Some improvements on deep convolu1onal neural network based image classifica1on. ICLR 2014. • D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov. Scalable Object Detec1on using Deep Neural Networks. CVPR 2014. • M. D. Zeiler and R. Fergus. Visualizing and understanding convolu1onal networks. ECCV, 2014. • K. Simonyan and A. Zisserman. Very Deep Convolu1onal Networks for Large-‐Scale Image Recogni1on. ICLR 2015. • R. Wu, S. Yan, Y. Shan, Q. Dang, and G. Sun. Deep Image: Scaling up Image Recogni1on. Arxiv 2015. • K. He, X. Zhang, S. Ren, and J. Sun. Delving Deep into Rec1fiers: Surpassing Human-‐Level Performance on ImageNet
Classifica1on. Arxiv 2015. • C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going Deeper
With Convolu1ons. CVPR 2015. • S. Ioffe and C. Szegedy. Batch Normaliza1on: Accelera1ng Deep Network Training by Reducing Internal Covariate Shia.
ICML 2015. • R. Girshick. Fast R-‐CNN. Arxiv 2015.
49