Upload
doandung
View
219
Download
1
Embed Size (px)
Citation preview
Lecture 3: Introduction to Convolutional Neural Networks
Bohyung HanComputer Vision [email protected]
CSED703R: Deep Learning for Visual Recognition (2016S)
Feature Learning
• Potential advantage Better performance Feature computation time
• Dozens of features now regularly used.• Getting prohibitive for large datasets (10’s sec/image)
Other domains: Kinect, video, multi‐spectral
• How to learn?
2
Convolutional neural networks!!!
Convolutional Neural Network (CNN)
• Feed‐forward network Convolution Pooling: (typically) local maximum Non‐linearity: Sigmoid units, rectified linear units
• Supervised learning• Training convolutional filters by back‐propagating error
3[Lecun89] Y. LeCun et al.: Handwritten Digit Recognition with a Back‐Propagation Network. NIPS 1989
LeNet[LeCun89]
Convolutional Neural Network (CNN)
• Reasons for failure Insufficient training data Slow convergence
• Vanishing gradient problem: Sigmoid function• Too many parameters• Limited computing resources
Lack of theory: needed to rely on trials‐and‐errors
• Reasons for recent success Availability of larger training datasets: ImageNet Powerful GPUs Simple activation function: ReLU Better regularization methods such as dropout and batch normalization
4
CNN had not shown impressive performance.
CNN recently draws a lot of attention due to its great success.
AlexNet[Krizhevsky12]
• Winner of ILSVRC 2012 challenge Same architecture with [LeCun89] but trained with larger data Bigger model: 7 hidden layers, 650K neurons, 60 million parameters Better regularization: drop‐out Trained on 2 GPUs for a week
5
[Krizhevsky12] A. Krizhevsky, I. Sutskever, G. E. Hinton: ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012
Architecture of AlexNet
6
GPU2
GPU1Inter‐GPU communication
Inpu
t image
Convolution
Max poo
ling
LRN
Convolution
Max poo
ling
LRN
Convolution
Convolution
Convolution
Max poo
ling
Fully co
nnected
Fully co
nnected
Fully co
nnected
Convolution
• Linear filtering
7
Input image Kernel (filter) Output image
where
, , ,∗ , , ,, , ,,
Pooling
• Summarizing the outputs of neighboring neurons Non‐overlapping vs. overlapping pooling Max vs. average pooling Overlapping pooling reduces top‐1
and top‐5 error rates by 0.4% and 0.3%, respectively.
Max pooling typically shows betterperformance.
8
Max Average
ReLU Nonlinearity
9
max 0,11
Sigmoid function Rectified linear unitConvergence rates
ReLu
Sigmoid
Local Response Normalization
• Sum over adjacent kernel maps at the same spatial position
Sort of “brightness normalization” , : the activity of a neuron computed by applying kernel at , : the total number of kernels in the layer : number of adjacent kernel maps, set to 5 Lateral inhibition: creating competition for big activities amongst
neuron outputs computed using different kernels Arbitrary kernel map ordering: determined before training Hyper‐parameters: 2, 10 , 0.75 Top‐1 and top‐5 error rate reduction by 1.4% and 1.2%, respectively
10
, , / ,, /, /
Training
• How to train CNNs? Error backpropagation Iteratively update weight matrix (or tensor) in each layer by a gradient
descent approach
11
conv1 conv2 conv3 conv4 conv5 fc6 fc7 softmax
Training
• GPU specification GTX 580 with 3GB memory 1.2 million training images
• Training on multiple GPUs Too big to fit on a single GPU Putting half of the kernels (or neurons) on each GPU The GPUs communicate only in certain layers: in layer 3
• Performance Reduces top‐1 and top‐5 error rates by 1.7% and 1.2%, respectively. Compared with a network with half as many kernels in each
convolutional layer trained on one GPU
12
Training
• Update rules
: momentum variable : learning rate ⋅ : average over ‐th batch
• Initialization : zero‐mean Gaussian with standard deviation 0.01 Bias in neurons: 0 for 1st and 3rd layers and 1 for other layers : 0.01 initially and reduced 3 times prior to termination
13
0.9 ⋅ 0.0005 ⋅ ⋅ ⋅GPU Performance
14http://www.anandtech.com/show/9059/the‐nvidia‐geforce‐gtx‐titan‐x‐review/15
Data Augmentation
15
256x256
224x224
224x224
224x224
224x224
224x224
224x224
Horizontal Flip
Training image Augmented training images
• Standard techniques Random cropping to 224x224 images Altering RGB values in training images
Dropout
• Avoiding overfitting Setting to zero the output of each hidden neuron with probability 0.5
• Employed in the first two fully‐connected layers Simulating ensemble learning without additional models
• Every time an input is presented, the neural network samples a different architecture.
• But, all these architectures share weights. At test time, we use all the neurons but multiply their outputs by 0.5.
16
A hidden layer’s activity on a given training image
A hidden unitturned off by dropout
A hidden unitunchanged
96 Learned Low‐Level Features
17
Quantitative Results
• ILSVRC‐2012 results
18
AlexNetTop‐5 error rate : 16.422%
Runner‐upTop‐5 error rate : 26.172%
Qualitative Results
19
Qualitative Results
20
Qualitative Results
21
Query images
22