Lecture 3: Introduction to Convolutional Neural Networks

Lecture 3: Introduction to Convolutional Neural Networks

Bohyung HanComputer Vision [email protected]

CSED703R: Deep Learning for Visual Recognition (2016S)

Feature Learning

• Potential advantage Better performance Feature computation time

• Dozens of features now regularly used.• Getting prohibitive for large datasets (10’s sec/image)

Other domains: Kinect, video, multi‐spectral

• How to learn?

2

Convolutional neural networks!!!

Convolutional Neural Network (CNN)

• Feed‐forward network Convolution Pooling: (typically) local maximum Non‐linearity: Sigmoid units, rectified linear units

• Supervised learning• Training convolutional filters by back‐propagating error

3[Lecun89] Y. LeCun et al.: Handwritten Digit Recognition with a Back‐Propagation Network. NIPS 1989

LeNet[LeCun89]

Convolutional Neural Network (CNN)

• Reasons for failure Insufficient training data Slow convergence

• Vanishing gradient problem: Sigmoid function• Too many parameters• Limited computing resources

Lack of theory: needed to rely on trials‐and‐errors

• Reasons for recent success Availability of larger training datasets: ImageNet Powerful GPUs Simple activation function: ReLU Better regularization methods such as dropout and batch normalization

4

CNN had not shown impressive performance.

CNN recently draws a lot of attention due to its great success.

AlexNet[Krizhevsky12]

• Winner of ILSVRC 2012 challenge Same architecture with [LeCun89] but trained with larger data Bigger model: 7 hidden layers, 650K neurons, 60 million parameters Better regularization: drop‐out Trained on 2 GPUs for a week

5

[Krizhevsky12] A. Krizhevsky, I. Sutskever, G. E. Hinton: ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012

Architecture of AlexNet

6

GPU2

GPU1Inter‐GPU communication

Inpu

t image

Convolution

Max poo

ling

LRN

Convolution

Max poo

ling

LRN

Convolution

Convolution

Convolution

Max poo

ling

Fully co

nnected

Fully co

nnected

Fully co

nnected

Convolution

• Linear filtering

7

Input image Kernel (filter) Output image

where

, , ,∗ , , ,, , ,,

Pooling

• Summarizing the outputs of neighboring neurons Non‐overlapping vs. overlapping pooling Max vs. average pooling Overlapping pooling reduces top‐1

and top‐5 error rates by 0.4% and 0.3%, respectively.

Max pooling typically shows betterperformance.

8

Max Average

ReLU Nonlinearity

9

max 0,11

Sigmoid function Rectified linear unitConvergence rates

ReLu

Sigmoid

Local Response Normalization

• Sum over adjacent kernel maps at the same spatial position

Sort of “brightness normalization” , : the activity of a neuron computed by applying kernel at , : the total number of kernels in the layer : number of adjacent kernel maps, set to 5 Lateral inhibition: creating competition for big activities amongst

neuron outputs computed using different kernels Arbitrary kernel map ordering: determined before training Hyper‐parameters: 2, 10 , 0.75 Top‐1 and top‐5 error rate reduction by 1.4% and 1.2%, respectively

10

, , / ,, /, /

Training

• How to train CNNs? Error backpropagation Iteratively update weight matrix (or tensor) in each layer by a gradient

descent approach

11

conv1 conv2 conv3 conv4 conv5 fc6 fc7 softmax

Training

• GPU specification GTX 580 with 3GB memory 1.2 million training images

• Training on multiple GPUs Too big to fit on a single GPU Putting half of the kernels (or neurons) on each GPU The GPUs communicate only in certain layers: in layer 3

• Performance Reduces top‐1 and top‐5 error rates by 1.7% and 1.2%, respectively. Compared with a network with half as many kernels in each

convolutional layer trained on one GPU

12

Training

• Update rules

: momentum variable : learning rate ⋅ : average over ‐th batch

• Initialization : zero‐mean Gaussian with standard deviation 0.01 Bias in neurons: 0 for 1st and 3rd layers and 1 for other layers : 0.01 initially and reduced 3 times prior to termination

13

0.9 ⋅ 0.0005 ⋅ ⋅ ⋅GPU Performance

14http://www.anandtech.com/show/9059/the‐nvidia‐geforce‐gtx‐titan‐x‐review/15

Data Augmentation

15

256x256

224x224

224x224

224x224

224x224

224x224

224x224

Horizontal Flip

Training image Augmented training images

• Standard techniques Random cropping to 224x224 images Altering RGB values in training images

Dropout

• Avoiding overfitting Setting to zero the output of each hidden neuron with probability 0.5

• Employed in the first two fully‐connected layers Simulating ensemble learning without additional models

• Every time an input is presented, the neural network samples a different architecture.

• But, all these architectures share weights. At test time, we use all the neurons but multiply their outputs by 0.5.

16

A hidden layer’s activity on a given training image

A hidden unitturned off by dropout

A hidden unitunchanged

96 Learned Low‐Level Features

17

Quantitative Results

• ILSVRC‐2012 results

18

AlexNetTop‐5 error rate : 16.422%

Runner‐upTop‐5 error rate : 26.172%

Qualitative Results

19

Qualitative Results

20

Qualitative Results

21

Query images

22

Documents

Lecture 3: Introduction to Convolutional Neural Networks