Overview of Convolutional Neural networkstat.snu.ac.kr/mcp/Lecture_5_DNN.pdf · 2019-09-23 · Overview of Convolutional Neural network Arti cial Neural Networks Layers of Arti cial

Overview of Convolutional Neural network


Seoul National University Deep Learning September-December, 2019 1 / 54

Overview of Convolutional Neural network Artificial Neural Networks

Perceptron: Building block

• The perceptron was intended to be a machine, rather than a program,and the perceptron machine was designed for image recognition of anarray of 400 photocells.• The perceptron is an algorithm for a binary classifier: f (x) = 1 ifwx + b > 0, 0, otherwise.



Single-layered neural network

• The perceptron model is called single-layered neural network.



An example of learned filters or weights for input images

• Note that the filter size is the same as the input size.



Multi-layered feedforward neural network

figure from slides of Andrej Karpathy

Feedforward neural networks take input x and predict

P(y = 1|x , θ) = fk(· · · f3(f2(f1(x ; θ1); θ2); θ3) · · · ; θk).



Layers of Artificial Neural Network (ANN)

fl(.) is commonly a repeated compositional function of linear andnonlinear transformation.

Trying to estimate invariant function in a compositional manner.

A unit of layers is composed of known and unknown transformations.

Convolutional layer: at the l th layer: Z l = W lhl−1 + bl , whereh0 = x .

W=filters. Z l= neurons. W ’s and b’s are unknown and to beestimated or trained.

Pooling layer

Activation layer: hl = gl(Zl): nonlinear transformation

The last layer: softmax: hKi = exp(Z i )/∑k

l=1 exp(Z l).



Convolutional neural network (CNN)

CNN is a special case of feedforward neural network with locality andsharing restriction.

This characteristic is referred to as ‘shift invariance’.

Restriction reduces the number of parameters and helps capture localcharacteristics.



Convolutional layer


Resulting output is a 28 by 28 activation map.



Convolutional layer


Apply 6 filters and obtain 6 activation maps.



Role of locality and sharing of convolutional layer

How locality and sharing reduces the number of parameters?

If 32x32x3 volume is processed to 28x28x6 volume as in the figureusing fully connected layer, the number ofparameters=(32*32*3)*(28*28*6)=14.5 Million

With 6 5x5 filters, we only used (5*5*3)*6=450 parameters.



Pooling layer


Average pooling or maxpooling shrinks the representations.Recall averaging or integration can extract invariant features of the

images.Integration over all rota-tions



Activation layer

sigm(Z ) = 11+exp(−Z)

tanh(Z )Rectified Linear Unit: ReLU(Z)= max(Z , 0)



Stacked layers

The first layer:

Z 1 = W 1h0 + b1 where h0 = x .h1 = g1(Z 1), g1(.) is activation function

The l th layer:

Z l = W lhl−1 + bl

hl = gl(Zl)



Stride

• Shrink dimensions by subsampling.

Source:http://adeshpande4.github.io/A-Beginner%Seoul National University Deep Learning September-December, 2019 14 / 54


Padding

Source:https://towardsdatascience.com/applied-deep-learning-part-4-convolutional-neural-networks-584bc134c1e2%



Role of multiple layers via visualization



Different architectures

CNNs popularity is triggered by debut of ‘AlexNet’ by Krizhevsky etal. (2012) winning ImageNet Large Scale Visual RecognitionChallenge (ILSVRC).

Imagenet competition is an annual computer vision contest runningsince 2010 after Li launched ImagNet assembling a free database of14 million+ labeled images.

Successful training is due to a large dataset, computational powerusing GPU and some aspects of the algorithm.

Every year through ImageNet competition new architecture andoptimization tips have been proposed and improved the accuracy ofclassification. We cover AlexNet, VGGNet and ResNet.



AlexNet by Krizhevsky et al. (2012)

Start with 224x224x3 input. End with three fully connected layers.

layer Filter size (stride) # filters maxpool (stride) output1.1 11x11x3 (4) 48x2 55x55x961.2 3x3 (2) 27x27x962.1 5x5x96 128x2 27x27x2562.2 3x3 (2) 13x13x2563 3x3x256 192x2 13x13x3844 3x3x384 192x2 13x13x384

5.1 3x3x384 128x2 13x13x2565.2 3x3 (2) 6x6x256=9216



AlexNet (Krizhevsky et al. 2012)

Used ReLu

Heavy data augmentation

Dropout

SGD, batch size 128, momentum=0.9, Reducing learning ratemanually starting from 0.01.

Ensemble of 7 CNNs



VGGNet, OxfordNet (Simonyan and Zisserman, 2014)

Deeper model. More layers (16 layers excluding maxpool and softmaxcompared to 5 layers for AlexNet).

Simpler structure.Only 3x3 filters with stride 1, pad 1, and 2x2 maxpool with stride 2,are used.Number of filters multiplied by two (64, 128, 256, 512)

Source: https://blog.heuritech.com/2016/02/29



VGGNet, OxfordNet (Simonyan and Zisserman, 2014)

Table: Structure of VGGNet

block # cov or fully connected layers # filter size1 2 conv 3x3 64 maxpool2 2 conv 3x3 128 maxpool3 3 conv 3x3 256 maxpool4 3 conv 3x3 512 maxpool5 3 conv 3x3 512 maxpool6 3 Fully connected 4096 (2) 1000 (1) softmax

• maxpool after each block• 140M parameters (heavy from FC layers)



VGGNet: Number of parameters and memory



Role of a small filter

If we stack two 3x3 convolutional layers, a neuron in the second layerwill cover 5x5 input region.

If we stack three 3x3 convolutional layers, a neuron in the third layerwill cover 7x7 input region.

If the number of filters is C : 7x7 filter needs Cx(7x7xC ) parameters;three 3x3 filters need 3xCx(3x3xC ). Three 3x3 filters need lessparameters with more nonlinearity.

How about even a smaller filter?



Role of a 1x1 filter

• For a HxWxCinput dimension, 1x1x(C/2) filtersoutput HxWx(C/2). (with stride1 and padding to preserve H, W)• (1. 1x1x(C/2) 2. 3x3x(C/2)3. 1x1xC) vs. single 3x3xC?The former needs less numberof parameters, less computation,with more nonlinearity.



GoogLeNet (Szegedy et al., 2014)

Design a good local network topology and stack these modules.

Use of average pooling before the classification

Computationally expensive

Auxiliary classifiers connected to intermediate layers



ResNet (He, Zhang, Ren and Sun, 2015)

Deeper the better? He et al. (2015) showed that deeper models canhave higher training error than shallower models.

Instead of f2(f1(xw1)w2) as in Alexnet or VGGNet, ResNet models theresidual, i.e., f1(xw1) + f2(f1(xw1)w2) so that w2 = 0 reduces to ashallow model.



ResNet

• 152-layer model• Every residual block has 3x3 conv layers• Periodilcally, double thenumber of filters and downsample spatially using stride 2• Additional conv layer at the beginning• No FC layers at the end• For deeper networks (50+ layers) usebottleneck layer to improve efficiency: 1x1→ 3x3 → 1x1• No dropout• Batch normalization • No maxpooling



ResNet (He, Zhang, Ren and Sun, 2015)



Performance of various architectures

source: Canziani, Culuciello and Paszke (2017)



Regularizations

In most cases, the number of parameters exceeds the number oftraining samples. To avoid overfitting, some regularization isnecessary.

ReLU (non-negative thresholding operator)

Early stopping

L1, L2 penalty on weights

Dropout

Batch normalization

Data augmentation

Ensemble


Documents

Overview of Convolutional Neural networkstat.snu.ac.kr/mcp/Lecture_5_DNN.pdf · 2019-09-23 · Overview of Convolutional Neural network Arti cial Neural Networks Layers of Arti cial