Deep Learning for Vision Part II-CNN and Recognitionmedialab.sjtu.edu.cn/teaching/CV/Lec/Lec7-DP-CNN_Recognition.pdf · Convolutional Neural Network Idea: local connection Locally-connected,

Deep Learning for Vision

Part II-CNN and Recognition

Associate Prof. Bingbing Ni （倪冰冰）

Shanghai Jiao Tong University

Convolutional Neural Network


Alpha Go


Input image: 200x200

Consider an image classification problem

“face”

Fully-connected, 400000 hidden units, 16 billion parameters!


Idea: local connection

Locally-connected, 400000 hidden units, 40 million parameters!

1. Captures local

10x10 region (100

weights)

Leads to Conv Filter!

Input image: 200x200

𝒘

𝒘

2. Weights sharing

3. Like “convolution”

4. Can have different

local filters to generate

different responses


Evidence: biological inspiration

Hubel and Wiesel, 1959



Convolve the filter with the image, i.e.,

“slide over the image spatially, computing

dot products”

Filters always extend the full

depth of the input volume

Convolutional filter


- The result of taking a dot product between the filter and a small

5x5x3 chunk of the image (i.e. 5*5*3 = 75-dimensional dot product +

bias)

- Called convolution due to some legacy, in fact “correlation”

Output a single number

Convolutional filter


convolve (slide) over all

spatial locations

Convolutional layer


convolve (slide) over all

spatial locations

Convolutional layer

- If we have 6 5x5x3 filters we got 6 activation maps

- Stack up these maps to get a new “image” of the size 28x28x6

- The set of 6 5x5x3 filters is called a “convolutional layer”


Image 6x6

Conv filter 3x1

We set stride = 1

Output map 4x4

An example


Image 7x7, Filter 3x3

Another example

- If stride = 1, output map size 5x5

- If stride = 2, output map size 3x3

Formula for output size:

(𝑁 − 𝐹)/𝑠𝑡𝑟𝑖𝑑𝑒 + 1

N

F What happens when F = 3?


Zero padding

- In practice, common to pad the border with 0

- In this case N = 7+2, F = 3, stride = 3, output

map size is 3 by the formula

- In general common to see CONV layers with

stride = 1, filters with size FxF, with zero-

padding with (F-1)/2

( N + 2 x (F-1) /2 – F)/1 + 1 = N preserve size!





First we convert image to column, then calculate 𝒘𝒙+ 𝒃

In CAFFE, we do CNN via vector/matrix operation


Compose the network

Conv net is a sequence of conv layers,

interspersed with activation functions


Compose the network

Need shrink the image

step by step to extract

higher level

information


Receptive field

should be larger and

larger


Max pooling

Max pooling with 2x2 filter and stride = 2


Connect conv activation maps to fully connected layers (FC)


Fully connected layer (FC)

May also convert FC layers to CONV layers, i.e., by setting the

filter size exactly as the input volume


Local Contrast Normalization

- Performed also across features and in the higher layers

- improves invariance, optimization and sparsity


Local Contrast Normalization Layer


Implementation of Le-Net









Training Deep CNN


Training Deep CNN

Batch Normalization (BN)

Convolutional Neural NetworkTraining Deep CNN


Trouble shooting the training

Training Deep CNN


AlexNet

GoogleNet

LeNet

VGGNet








In practice: small scale, novel class

- Often small problem, e.g., hundred categories, thousands

samples

- Not stable if we train CNN from scratch


Deep CNN model

Idea: knowledge transfer via CNN

Shared general low level features

Fine-tuned

Deep CNN model

Domain

adaption


Idea: knowledge transfer via CNN

- Take a pre-trained model from model zoo

- Remove last fully convolutional and connect with new

objective

- Fine-tune the new network with higher learning rate on FC

layers and lower learning rate on the early CONV layers


Application: image retrieval


Application: OCR and logo


Application: texture


Application: object detection


Application: scene parsing


Application: action recognition

Location: apply CNNs to region proposals

Scarce data: fine-tune the pre-trained model

How to extent the CNN classification results to object detection?

R-CNN

DCNN Object Detection

SSD: Single Shot MultiBox Detector

Default boxes and aspect ratios

Each feature map cell has a set of default bounding

boxes and the position relative to its corresponding cell

is fixed.

DCNN Object Detection

Recurrent Neural Network

xt

yt

𝐡𝒕

x0

y0

𝐡𝟎

x1

y1

𝐡𝟏

x2

y2

𝐡𝟐

xt

yt

𝐡𝒕…=

Deep RNN for sequence


𝐡𝒕+𝟏

𝜺𝒕+𝟏

𝜕𝜺𝒕+𝟏𝜕𝒉𝒕+𝟏

𝜕𝒉𝒕+𝟐𝜕𝒉𝒕+𝟏

𝐡𝒕

𝜺𝒕

𝜕𝜺𝒕𝜕𝒉𝒕

𝜕𝒉𝒕+𝟏𝜕𝒉𝒕

𝐡𝒕−𝟏

𝜺𝒕−𝟏

𝜕𝜺𝒕−𝟏𝜕𝒉𝒕−𝟏

𝜕𝒉𝒕𝜕𝒉𝒕−𝟏

𝜕𝒉𝒕−𝟏𝜕𝒉𝒕−𝟐

𝒙𝒕+𝟏𝒙𝒕𝒙𝒕−𝟏

Have no difference with vanilla neural network !

Training: back propagation though time (BPTT)


Image Captioning

ℎ𝑡= tanh(𝑊ℎℎℎ𝑡−1 +𝑊𝑥ℎ𝑥𝑡)

𝑥0

START

ℎ0

straw

𝑥1

straw

ℎ1

hat

𝑥2

ℎ2

END

hat

V

𝑵𝒐𝒘: ℎ𝑡 = tanh(𝑊ℎℎℎ𝑡−1 +𝑊𝑥ℎ𝑥𝑡 +𝑊𝑣ℎ𝑣)


Attention Model

Recurrent Neural Network Attention Model

Thank you!

Documents

Deep Learning for Vision Part II-CNN and Recognitionmedialab.sjtu.edu.cn/teaching/CV/Lec/Lec7-DP-CNN_Recognition.pdf · Convolutional Neural Network Idea: local connection Locally-connected,