Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Deep Learning for Vision
Part II-CNN and Recognition
Associate Prof. Bingbing Ni (倪冰冰)
Shanghai Jiao Tong University
Convolutional Neural Network
Convolutional Neural Network
Alpha Go
Convolutional Neural Network
Input image: 200x200
Consider an image classification problem
“face”
Fully-connected, 400000 hidden units, 16 billion parameters!
Convolutional Neural Network
Idea: local connection
Locally-connected, 400000 hidden units, 40 million parameters!
1. Captures local
10x10 region (100
weights)
Leads to Conv Filter!
Input image: 200x200
𝒘
𝒘
2. Weights sharing
3. Like “convolution”
4. Can have different
local filters to generate
different responses
Convolutional Neural Network
Evidence: biological inspiration
Hubel and Wiesel, 1959
Convolutional Neural Network
Convolutional Neural Network
Convolve the filter with the image, i.e.,
“slide over the image spatially, computing
dot products”
Filters always extend the full
depth of the input volume
Convolutional filter
Convolutional Neural Network
- The result of taking a dot product between the filter and a small
5x5x3 chunk of the image (i.e. 5*5*3 = 75-dimensional dot product +
bias)
- Called convolution due to some legacy, in fact “correlation”
Output a single number
Convolutional filter
Convolutional Neural Network
convolve (slide) over all
spatial locations
Convolutional layer
Convolutional Neural Network
convolve (slide) over all
spatial locations
Convolutional layer
- If we have 6 5x5x3 filters we got 6 activation maps
- Stack up these maps to get a new “image” of the size 28x28x6
- The set of 6 5x5x3 filters is called a “convolutional layer”
Convolutional Neural Network
Image 6x6
Conv filter 3x1
We set stride = 1
Output map 4x4
An example
Convolutional Neural Network
Image 7x7, Filter 3x3
Another example
- If stride = 1, output map size 5x5
- If stride = 2, output map size 3x3
Formula for output size:
(𝑁 − 𝐹)/𝑠𝑡𝑟𝑖𝑑𝑒 + 1
N
F What happens when F = 3?
Convolutional Neural Network
Zero padding
- In practice, common to pad the border with 0
- In this case N = 7+2, F = 3, stride = 3, output
map size is 3 by the formula
- In general common to see CONV layers with
stride = 1, filters with size FxF, with zero-
padding with (F-1)/2
( N + 2 x (F-1) /2 – F)/1 + 1 = N preserve size!
Convolutional Neural Network
Convolutional Neural Network
Convolutional Neural Network
Convolutional Neural Network
First we convert image to column, then calculate 𝒘𝒙+ 𝒃
In CAFFE, we do CNN via vector/matrix operation
Convolutional Neural Network
Compose the network
Conv net is a sequence of conv layers,
interspersed with activation functions
Convolutional Neural Network
Compose the network
Need shrink the image
step by step to extract
higher level
information
Convolutional Neural Network
Receptive field
should be larger and
larger
Convolutional Neural Network
Max pooling
Max pooling with 2x2 filter and stride = 2
Convolutional Neural Network
Connect conv activation maps to fully connected layers (FC)
Convolutional Neural Network
Fully connected layer (FC)
May also convert FC layers to CONV layers, i.e., by setting the
filter size exactly as the input volume
Convolutional Neural Network
Local Contrast Normalization
- Performed also across features and in the higher layers
- improves invariance, optimization and sparsity
Convolutional Neural Network
Local Contrast Normalization Layer
Convolutional Neural Network
Implementation of Le-Net
Convolutional Neural Network
Convolutional Neural Network
Convolutional Neural Network
Convolutional Neural Network
Convolutional Neural Network
Convolutional Neural Network
Convolutional Neural Network
Convolutional Neural Network
Training Deep CNN
Convolutional Neural Network
Training Deep CNN
Batch Normalization (BN)
Convolutional Neural NetworkTraining Deep CNN
Convolutional Neural Network
Trouble shooting the training
Training Deep CNN
Convolutional Neural Network
AlexNet
GoogleNet
LeNet
VGGNet
Convolutional Neural Network
Convolutional Neural Network
Convolutional Neural Network
Convolutional Neural Network
Convolutional Neural Network
Convolutional Neural Network
Convolutional Neural Network
In practice: small scale, novel class
- Often small problem, e.g., hundred categories, thousands
samples
- Not stable if we train CNN from scratch
Convolutional Neural Network
Deep CNN model
Idea: knowledge transfer via CNN
Shared general low level features
Fine-tuned
Deep CNN model
Domain
adaption
Convolutional Neural Network
Idea: knowledge transfer via CNN
- Take a pre-trained model from model zoo
- Remove last fully convolutional and connect with new
objective
- Fine-tune the new network with higher learning rate on FC
layers and lower learning rate on the early CONV layers
Convolutional Neural Network
Application: image retrieval
Convolutional Neural Network
Application: OCR and logo
Convolutional Neural Network
Application: texture
Convolutional Neural Network
Application: object detection
Convolutional Neural Network
Application: scene parsing
Convolutional Neural Network
Application: action recognition
Location: apply CNNs to region proposals
Scarce data: fine-tune the pre-trained model
How to extent the CNN classification results to object detection?
R-CNN
DCNN Object Detection
SSD: Single Shot MultiBox Detector
Default boxes and aspect ratios
Each feature map cell has a set of default bounding
boxes and the position relative to its corresponding cell
is fixed.
DCNN Object Detection
Recurrent Neural Network
xt
yt
𝐡𝒕
x0
y0
𝐡𝟎
x1
y1
𝐡𝟏
x2
y2
𝐡𝟐
xt
yt
𝐡𝒕…=
Deep RNN for sequence
Recurrent Neural Network
𝐡𝒕+𝟏
𝜺𝒕+𝟏
𝜕𝜺𝒕+𝟏𝜕𝒉𝒕+𝟏
𝜕𝒉𝒕+𝟐𝜕𝒉𝒕+𝟏
𝐡𝒕
𝜺𝒕
𝜕𝜺𝒕𝜕𝒉𝒕
𝜕𝒉𝒕+𝟏𝜕𝒉𝒕
𝐡𝒕−𝟏
𝜺𝒕−𝟏
𝜕𝜺𝒕−𝟏𝜕𝒉𝒕−𝟏
𝜕𝒉𝒕𝜕𝒉𝒕−𝟏
𝜕𝒉𝒕−𝟏𝜕𝒉𝒕−𝟐
𝒙𝒕+𝟏𝒙𝒕𝒙𝒕−𝟏
Have no difference with vanilla neural network !
Training: back propagation though time (BPTT)
Recurrent Neural Network
Image Captioning
ℎ𝑡= tanh(𝑊ℎℎℎ𝑡−1 +𝑊𝑥ℎ𝑥𝑡)
𝑥0
START
ℎ0
straw
𝑥1
straw
ℎ1
hat
𝑥2
ℎ2
END
hat
V
𝑵𝒐𝒘: ℎ𝑡 = tanh(𝑊ℎℎℎ𝑡−1 +𝑊𝑥ℎ𝑥𝑡 +𝑊𝑣ℎ𝑣)
Recurrent Neural Network
Attention Model
Recurrent Neural Network Attention Model
Thank you!