66
REGION-ORIENTED CONVOLUTIONAL NETWORKS FOR OBJECT RETRIEVAL Eduard Fontdevila Amaia Salvador Xavier Giró-i-Nieto ADVISORS AUTHOR

Region-oriented Convolutional Networks for Object Retrieval

Embed Size (px)

Citation preview

REGION-ORIENTED CONVOLUTIONAL NETWORKS FOR OBJECT RETRIEVAL

Eduard Fontdevila Amaia Salvador Xavier Giró-i-Nieto

ADVISORSAUTHOR

ACKNOWLEDGEMENTS

2

Financial Support Technical Support

Albert Gil Josep Pujal

OUTLINE

1. Motivation2. State of Art3. Local CNNs for Instance Search4. Fine-tuning5. Conclusions

3

visual Data is Big Data

4

motivation

libraries need librarians...

5

motivation

... and visual Data needs Computer Vision

6

COMPUTERVISION

motivation

applications

7

motivation

...

OUTLINE

1. Motivation2. State of Art3. Local CNNs for Instance Search4. Fine-tuning5. Conclusions

8

from shallow to deep learning

9

Bag of Words

SIFT

Histograms of gradients

Convolutional Neural Networks (CNNs)

“hand crafted” features

state of art

“learned” features

why deep learning now?

10

state of art

large datasets Powerful GPUs

...

AlexNet

11

state of art

Krizhevsky et al. (Toronto), ImageNet Classification with Deep Convolutional Neural Networks (2012)

CaffeNet

12

state of art

CaffeNet

architecture[Krizhevsky’12]

data[Deng’09]

framework[Jia’14]

Slide credit: Xavier Giró-i-Nieto

CaffeNet

13

state of art

inputimage

Babenko et al. (Moskow), Neural Codes for Image Retrieval (2014)

CaffeNet

14

state of art

convolutional layers

Babenko et al. (Moskow), Neural Codes for Image Retrieval (2014)

CaffeNet

15

state of art

fully connected layers

Babenko et al. (Moskow), Neural Codes for Image Retrieval (2014)

object candidates

16

state of art

Selective Search bounding boxes

Uijlings et al. (Trento), Selective Search for Object Recognition (2013)

MCG segments

Arbeláez et al. (Berkeley), Multiscale Combinatorial Grouping (2014)

R-CNN

17

state of art

Girshick et al. (Berkeley), Rich feature hierarchies for accurate object detection and semantic segmentation (2014)

Object Detection network

SDS

19

state of art

Hariharan et al. (Berkeley), Simultaneous Detection and Segmentation (2014)

Object Detection + Semantic Segmentation network

OUTLINE

1. Motivation2. State of Art3. Local CNNs for Instance Search4. Fine-tuning5. Conclusions

20

TRECVid Instance Search

21

local CNNs for instance search

large collection of videos

464h

shots

~470k

frames

1/4 fps

TRECVid Instance Search

22

local CNNs for instance search

large collection of videos

464h

shots

~470k

frames

1/4 fps

...in our case, subset of 13k shots (23k frames)

a Big Data scenario

23

local CNNs for instance search

query descriptors

24

local CNNs for instance search

CaffeNet

Fast R-CNN

SDS

visual features

visual features

visual features

query set

descriptorsimage

bbox

region

query descriptors

25

local CNNs for instance search

query set

examples of TRECVid query images

query descriptors

26

local CNNs for instance search

CaffeNet

Fast R-CNN

SDS

visual features

visual features

visual features

query set

descriptorsimage

bbox

region

object candidates

main scheme

27

local CNNs for instance search

CaffeNet

Fast R-CNN

SDS

visualfeatures

visualfeatures

visualfeatures

querydescriptors

matching

matching

matching

framesin 1 shot

pooling

pooling

pooling

ranking

ranking

ranking

object candidates

pooling

pooling

visualfeatures

visualfeatures

main scheme

28

local CNNs for instance search

CaffeNet

Fast R-CNN

SDS

visualfeatures

querydescriptors

matching

matching

matching

framesin 1 shot

pooling ranking

ranking

ranking

global approach

poolingvisualfeatures

object candidates

pooling

pooling

visualfeatures

visualfeatures

main scheme

29

local CNNs for instance search

CaffeNet

Fast R-CNN

SDS

querydescriptors

matching

matching

matching

framesin 1 shot

ranking

ranking

ranking

global approach

visualfeatures

pooling

object candidates

pooling

pooling

visualfeatures

visualfeatures

main scheme

30

local CNNs for instance search

CaffeNet

Fast R-CNN

SDS

querydescriptors

matching

matching

matching

framesin 1 shot

ranking

ranking

ranking

global approach

visualfeatures

pooling

object candidates

pooling

pooling

visualfeatures

visualfeatures

main scheme

31

local CNNs for instance search

CaffeNet

Fast R-CNN

SDS

querydescriptors

matching

framesin 1 shot matching

matching ranking

ranking

ranking

global approach

euclidean distance

Babenko et al. (Moskow), Neural Codes for Image Retrieval (2014)

poolingvisualfeatures

object candidates

pooling

pooling

visualfeatures

visualfeatures

main scheme

32

local CNNs for instance search

CaffeNet

Fast R-CNN

SDS

querydescriptors

matching

matching

matching

framesin 1 shot

ranking

ranking

ranking

global approach

Zhu et al. (NII), Multi-image aggregation for better visual object retrieval (2014)

distanceframe 1

distanceframe 2

distanceframe 3

average distance

distance shot - query

=

poolingvisualfeatures

object candidates

pooling

pooling

visualfeatures

visualfeatures

main scheme

33

local CNNs for instance search

CaffeNet

Fast R-CNN

SDS

querydescriptors

matching

matching

matching

framesin 1 shot

ranking

ranking

ranking

global approach

only top1000 shots

object candidates

main scheme

34

local CNNs for instance search

CaffeNet

Fast R-CNN

SDS

visualfeatures

visualfeatures

visualfeatures

querydescriptors

matching

matching

matching

framesin 1 shot

pooling

pooling

pooling

ranking

ranking

ranking

visualfeatures

pooling

object candidates

main scheme

35

local CNNs for instance search

CaffeNet

Fast R-CNN

SDS

visualfeatures

visualfeatures

querydescriptors

matching

matching

matching

pooling

pooling

ranking

ranking

ranking

local approach

framesin 1 shot

object candidates

main scheme

36

local CNNs for instance search

CaffeNet

Fast R-CNN

SDS

framesin 1 shot

local approach

visualfeatures

pooling

object candidates

main scheme

37

local CNNs for instance search

CaffeNet

Fast R-CNN

SDS

visualfeatures

visualfeatures

querydescriptors

matching

matching

matching

pooling

pooling

ranking

ranking

ranking

local approach

framesin 1 shot

quantitative results: ranking

38

local CNNs for instance search

mAP (%)

SDS Fast R-CNN

re-ranking

39

local CNNs for instance search

CaffeNet SDS / F-RCNN re-ranking

global + localfusion

quantitative results: re-ranking

40

mAP (%)

SDS Fast R-CNN CaffeNet

local CNNs for instance search

quantitative results: re-ranking

41

mAP (%)

SDS Fast R-CNN CaffeNet

local CNNs for instance search

adding context

~8%

qualitative results: re-ranking

42

query

SDS

Fast R-CNN

local CNNs for instance search

qualitative results: re-ranking

43

query

SDS

Fast R-CNN

local CNNs for instance search

as a reminder...

44

local CNNs for instance search

Selective Search bounding boxes

Uijlings et al. (Trento), Selective Search for Object Recognition (2013)

MCG segments

Arbeláez et al. (Berkeley), Multiscale Combinatorial Grouping (2014)

Fast R-CNN

SDS

OUTLINE

1. Motivation2. State of Art3. Local CNNs for Instance Search4. Fine-tuning5. Conclusions

45

training CNNs from scratch is costly...

46

fine-tuning

... instead: fine-tuning

47

fine-tuning

already trained network new dataset (novel domain)

resume training

a quick trial

48

fine-tuning

CaffeNet Pascal dataset

results on Pascal (global scale)

49

fine-tuning

validation subset

validation set

accuracy (%) 59,31% 4,14%

Histogram of images per category

categories

% of

imag

es

Microsoft COCO

50

fine-tuning

● Multiple objects per image

● 80 categories

● > 300k images (80k training)

● > 2M instances

Lin et al. (Cornell - Microsoft), http://vision.ucsd.edu/sites/default/files/coco_eccv.pdf (2015)

fine-tuning SDS on COCO

51

fine-tuning

SDS network COCO dataset

resume training

fine-tuning SDS on COCO

52

fine-tuning

SDS network COCO dataset

resume training

... but why?

53

fine-tuning

the more objects the network knows, the better

OUTLINE

1. Motivation2. State of Art3. Local CNNs for Instance Search4. Fine-tuning5. Conclusions

54

about the results

● Although not outperforming CaffeNet: SDS good for localization!

55

conclusions

maybe more suitable for TRECVid localization task?

about fine-tuning

● Networks trained on objects, but not on the objects to retrieve

56

conclusions

fine-tuning on a larger dataset is clearly the next step

about object candidates

● Only 100 candidates decreseases likelihood to success

... but using a higher number

57

conclusions

Fast SDS would be the key

thank you

visualizing CNNs’ features

more class-specific information

annex

SDS: Proposal Generation

input image

MCG object candidates

segments, not only bounding boxes

annex

SDS: Feature Extractionannex

SDS: Feature Extraction

object candidate

penultimate fully connected layers

annex

SDS: Region Classification

Linear SVM

annex

SDS: Region Refinementannex

basic pipeline for retrievalannex

interactive: Multi-image aggregationQuery images for a topic was used with the min distance to each shot.

The best option with SIFT-BoW is average, wheteher features (Avg-Pooling) or similarity scores (Sim-Avg)

annex

Zhu et al. (NII), Multi-image aggregation for better visual object retrieval (2014)