for Visual (2016S) 6: CNNs for Detection, Tracking, and Segmentationcvlab.postech.ac.kr/~bhhan/class/cse703r_2016s/csed7… · · 2016-06-01Bounding Box Regression • Learning

Lecture 6: CNNs for Detection, Tracking, and Segmentation

Bohyung HanComputer Vision [email protected]

CSED703R: Deep Learning for Visual Recognition (2016S)

2

Object Detection

Region‐based CNN (RCNN)

• Object detection Independent evaluation of each proposal Bounding box regression improves detection accuracy. Mean average precision (mAP): 53.7% with bounding box regression in

VOC 2010 test set

3

[Girshick14] R. Girshick, J. Donahue, S. Guadarrama, T. Darrell, J. Malik: Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, CVPR 2014

Input image Extract regionproposal

Compute CNN features Classification

Any proposal method (e.g., selective search, edgebox)

Any architecture Softmax, SVM

Selective Search

• Motivation Sliding window approach is not feasible for object detection with

convolutional neural networks. We need a more faster method to identify object candidates.

• Finding object proposals Greedy hierarchical superpixel

segmentation Diversification of superpixel

construction and merge• Using a variety of color spaces• Using different similarity measures

• Varying staring regions

4

[Uijlings13] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, A. W. M. Smeulders: Selective Search for Object Recognition. IJCV 2013

Bounding Box Regression

• Learning a transformation of bounding box Region proposal: , , , Ground‐truth: , , , Transformation: , , ,

5

CNN pool5 feature

∗ argmin

expexp

Detection Results

• VOC 2010 test set

• Feature analysis on VOC 2007 test set

6

Fast RCNN

• Fast version of RCNN 9x faster in training and 213x faster in testing than RCNN A single feature computation and ROI pooling using object proposals Bounding box regression into network Single stage training using multi‐task loss

7[Girshick15] R. Girshick: Fast R‐CNN, ICCV 2015

Faster RCNN

• Fast RCNN + RPN Proposal computation into network Marginal cost of proposals:

10ms

8

[Ren15] S. Ren, K. He, R. Girshick, J. Sun: Faster R‐CNN: Towards Real‐Time Object Detection with Region Proposal Networks. NIPS 2015

Object Detection Performance

9

Pascal VOC 2007 Object Detection mAP (%)

RCNN family achieves the state‐of‐the‐art performance in object detection!

Faster RCNN with ResNet

10

Faster RCNN with ResNet

1112

Visual Tracking with Convolutional Neural Networks

Main Idea

• Training shared features and domain‐specific classifiers jointly.

13

Domain‐specific classifiersDomain 1

Domain 2

Domain 3

Domain 4

Transfer to a new domain

Shared feature representation

Visual Tracking

• MDNet (Multi‐Domain Network) Multi‐domain learning Separating shared and domain‐specific layers

14

[Nam15] Hyeonseob Nam, Bohyung Han: Learning Multi‐Domain Convolutional Neural Networks for Visual Tracking, CVPR 2016

The Winner of Visual Object Tracking Challenge 2015

Multi‐Domain Learning

15

• Iteration #nK+1• Iteration #nK+2• Iteration #nK

Online Tracking using MDNet Features

16

Transfer shared features

New Sequence

Online Tracking using MDNet Features

17

Fine‐Tuning

Transfer shared features

New Sequence

Online Tracking: Overview

18

Update the CNN if needed

Find the optimal state

Collect training samples

Draw target candidates

∗ argmax xFrame 2

Repeat for the next frame

⋅ : positive score

• Short‐Term Update Performed at abrupt appearance

changes ( ∗ 0.5 Using short‐term training samples For Adaptiveness

Online Network Update

19

Frame #

• Long‐Term Update Performed at regular intervals Using long‐term training

samples For Robustness

Long-term update

Short-term update

1 0.82 0.91 0.86 0.93 0.94 0.85 0.73 0.78 0.66 0.38 0.53 0.47 0.62 0.83 0.88∗

• Provide a “hard” minibatch in each training iteration.

Hard Negative Mining

20

Pool of Positive Samples

Pool of Negative Samples

A MINIBATCH

Training CNN

Randomly draw samples

Select ≪samples with

highest scores

Randomly draw samples

Hard Negative Mining

21

1st minibatch 5th minibatch 30th minibatch

Positive sample Negative sample

Training iteration

Bounding Box Regression

• Improve the localization quality. ‐ DPM [Felzenszwalb et al. PAMI’10], R‐CNN [Girshick et al. CVPR’14]

22

Train a bounding box regression model.

Adjust the tracking result by bounding box regression.

Frame 1 Frame

Tracking result

Ground-Truth

Positive samples

Results on OTB100[Wu15]

• Protocol MDNet is trained with 58 sequences from {VOT’13,’14,’15} excluding

{OTB100}. Distance precision and overlap success rate by One‐Pass‐Evaluation (OPE)

23[Wu15] Y. Wu, J. Lim, M.‐H. Yang: Object Tracking Benchmark. TPAMI 2015

Results on VOT2015

24Ground‐truth Our 15 repetitions

25

Semantic Segmentation by Fully Convolutional Network

Semantic Segmentation

• Segmenting images based on its semantic notion

26

Semantic Segmentation using CNN

• Image classification

• Semantic segmentation Given an input image, obtain pixel‐wise segmentation mask using a deep

Convolutional Neural Network (CNN)

27

Query image

Query image

Fully Convolutional Network (FCN)

• Interpreting fully connected layers as convolution layers Each fully connected layer is identical to a convolution layer with a

large spatial filter that covers entire input field.

28

11

4096

409611

51277

Fully connected layers

fc7

fc6

pool5

4096

4096

11

11

7 7512

Convolution layers

pool5

fc6

fc7

51222

22

For the larger Input field

pool5

409616 16

fc6

40961616

fc7

FCN for Semantic Segmentation

• Network architecture[Long15]• End‐to‐end CNN architecture for semantic segmentation• Interpret fully connected layers to convolutional layers

29

[Long15] J. Long, E. Shelhamer, and T. Darrell, Fully Convolutional Network for Semantic Segmentation. CVPR 2015

Deconvolution

16x16x21

500x500x3

Deconvolution Filter

• Bilinear interpolation filter Same filter for every class No filter learning!

• How does this deconvolution work? Deconvolution layer is fixed. Fining‐tuning convolutional layers of

the network with segmentation ground‐truth.

30

64x64 bilinear interpolation

seg ∘Fixed Pretrained on ImageNet

Fine‐tuned for segmentation

Skip Architecture

• Ensemble of three different scales Combining complementary features

31

More semantic

More detailed

Limitations of FCN‐based Semantic Segmentation

• Coarse output score map A single bilinear filter should handle the variations in all kinds of object

classes. Difficult to capture detailed structure of objects in image

• Fixed size receptive field Unable to handle multiple scales Difficult to delineate too small or large objects compared to the size of rec

eptive field

• Noisy predictions due to skip architecture Trade off between details and noises Minor quantitative performance improvement

32

Results and Limitations

33

Input image GT FCN‐32s FCN‐16s FCN‐8s

Results and Limitations

34

Input image GT FCN‐32s FCN‐16s FCN‐8s

35

Documents

for Visual (2016S) 6: CNNs for Detection, Tracking, and Segmentationcvlab.postech.ac.kr/~bhhan/class/cse703r_2016s/csed7… · · 2016-06-01Bounding Box Regression • Learning