78
Lei Wang School of Computing and Information Technology University of Wollongong, Australia 15-Oct-2016 CBIR in the Era of Deep Learning -- A Perspective from Feature Representation

CBIR in the Era of Deep Learning

Embed Size (px)

Citation preview

Lei Wang School of Computing and Information Technology

University of Wollongong, Australia 15-Oct-2016

CBIR in the Era of Deep Learning -- A Perspective from Feature Representation

• Introduction of CBIR

• Evolution of CBIR

– Early days (before 2000)

– Days of BoF model (2000 ~ 2012)

– Era of Deep learning (after 2012)

• Conclusion

Outline

Images courtesy of related papers and authors

Introduction

• Retrieval

– Getting back information that has been stored in a

database

• Image Retrieval

Introduction

• Text-based image retrieval (TBIR, since late 1970’s)

– Manually associate images with text annotations

– Interpret images with high-level semantics

– Retrieval by matching the associated text annotations

Retrieval result of Google Images for “Airplane”

Introduction

• Issus with text-based image retrieval – Annotation is time consuming and labour intensive

– Only partially describe the visual content

– Human’s perception subjectivity

– Not support query by example

Drouin Post Office, front desks Iron Ore Fashion

Introduction

• Content-based image retrieval – Human annotators are replaced by computers

– Text annotations are replaced by visual features

– Retrieval by comparing the associated visual features

Drouin Post Office, front desks Iron Ore Fashion

Introduction

• National Science Foundation (NSF) organised a special

workshop on the topic of visual information

management (Feb 1992, San Jose, CA)

• "It would be impossible to cope with this explosion of image

information, unless the images were organized for retrieval.

The fundamental problem is that images, video, and other

similar data differ from numeric data and text data format,

and hence they require a totally different technique of

organization, indexing and query processing."

Introduction

• CBIR categorisation

– No query: Randomly browse similar images

– Query by text (by typing “airplane” or description)

– Query by example

• by using an image, sketch, or graphic of airplane

Introduction

• CBIR categorisation

– Find images of similar colour, texture or shape

– Find images of similar object, scene, place, event, etc.

Introduction

• CBIR categorisation

– Narrow domain

– Broad domain

Introduction

CBIR

Computer Vision

Informational Retrieval

Database Machine Learning

Introduction

• Applications of CBIR

– Archival photo collection management

– Personal album management

– Crime investigation

– Fashion and design

– Education and entertainment

– Localisation and navigation

– Medical Image analysis

– ….

Introduction

• CBIR systems – QBIC, Virage, Photobook, VisualSEEk, MARS, etc.

Source: http://vismod.media.mit.edu/vismod/demos/photobook/ Source: http://www.cse.unsw.edu.au/~jas/talks/curveix/notes.html

Introduction

• CBIR systems – QBIC, Virage, Photobook, VisualSEEk, MARS, etc.

• Introduction of CBIR

• Evolution of CBIR

– Early days (before 2000)

– Days of BoF model (2000 ~ 2012)

– Era of Deep learning (after 2012)

• Conclusion

Outline

Images courtesy of related papers and authors

Early days

A new research problem received great interest

CBIR

Application

Semantic gap

Domain knowledge

User model

Query mode Visual features

Similarity measure

Interaction

Learning from data

System

Evaluation

• Hand-crafted features

– Color, texture, shape, structure, etc.

– Goal: “Invariant and discriminative”

• Similarity or distance measure

– Euclidean distance, Manhattan distance, etc.

– Specific measures designed for specific features

Early days

• Relevance feedback

– Bring user into the loop of CBIR to handle “Semantic Gap”

– A key point of “machine Learning” research in CBIR

Early days

• Relevance feedback

– Learning from small sample

– Semi-supervised learning

– Transductive learning

– Feature selection, dimensionality reduction

– Kernel based learning

– Manifold learning

– Relation learning

– …

Early days

• Achievements

– Researched CBIR from various perspectives

– Identified the key issues and obstacles

– Many initial but insightful observations and attempts

– Machine learning started playing an important role

• To be improved

– Basic, hand-crafted features, limited invariance

– Considerably depend on domain theory

– Small-sized databases for evaluation

• Introduction of CBIR

• Evolution of CBIR

– Early days (before 2000)

– Days of the BoF model (2000 ~ 2012)

– Era of Deep learning (after 2012)

• Conclusion

Outline

Images courtesy of related papers and authors

• SIFT, HOG, SURF, CENTRIST, filter-based, … – Invariant to view angle, rotation, scale, illumination, ...

Days of the BoF model

Local Invariant Features

http://www.robots.ox.ac.uk/~vgg/software/

Image courtesy of David Lowe, IJCV04

SIFT (Scale Invariant Feature Transform

Days of the BoF model

Local Invariant Features

http://www.robots.ox.ac.uk/~vgg/research/affine/#software/

Image A Image B

Days of the BoF model

Local Invariant Features

Source: http://ivt.sourceforge.net/examples.html

Image A Image B

Days of the BoF model

Local Invariant Features

Source: http://www.robots.ox.ac.uk/~vgg/share/SearchPractical2012.html

Image A Image B

Days of the BoF model

Local Invariant Features

Days of the BoF model

Bag-of-features (BoF) model is borrowed from text analysis

Days of the BoF model

Interest point detection or

Dense sampling

The cropped detected regions

Bag-of-feature model is borrowed from text analysis

Days of the BoF model

A close-up view

Days of the BoF model

A close-up view

Days of the BoF model

Extract features from all training/test images

x 2 Rd

Days of the BoF model

Cluster all features to generated “Visual Words”

Rd

Days of the BoF model

Generated “Visual Words”

Word 1:

Word 2:

Word 3:

Word 4:

Word k: … … … … … … … … … … … … … … … … … … … … … … … … …

Days of the BoF model

From an image to a histogram

[ n1 , n2, … , nk ]

The number of occurrence of 1st “word” in this image

2 Rk

[ 0 , 1, 0, … , 0 ] 2 Rk

[ 1 , 0, 0, … , 0 ] 2 Rk

[ 0 , 0, 1, … , 0 ] 2 Rk… … … …

Days of the BoF model

Classifying, clustering or retrieving images

Rk

y = w>x + b

Days of the BoF model

A Bag-of-Features Image Analysis System

Image database

Feature extraction

Codebook generation

Feature coding

Feature pooling

Classification Clustering or

Retrieval

Days of the BoF model

Local Invariant Features, such as SIFT (Lowe, ICCV99)

Video Google (Sivic, CVPR03); Bag-of-keypoints (Csurka, SLCV@ECCV04)

Vocabulary tree (Nister, CVPR06); Randomized Clustering Forests (Moosmann, NIPS06); Spatial Pyramid Matching (Lazebnik, CVPR06)

Pyramid Match Kernel (Grauman, ICCV05); Dense sampling (Jurie, ICCV05); Compact Codebook (Winn, ICCV05)

Comparative Study (Zhang, IJCV07); Coding with Fisher Kernels (Perronnin, CVPR07)

Local Soft-assignment Coding & Mix-order pooling (Liu, ICCV11); Comparative Study on BoF model (Chatfield, BMVC, 2011);

Locality-constrained Linear Coding for BoF (Wang, CVPR10); Coding & pooling scheme comparison (Boureau, CVPR10);

Sparse coding for BoF (Yang, CVPR09) Local Coordinate Coding (Yu, NIPS09)

Kernel Codebook (van Gemert, ECCV08); In Defense of Nearest Neighbor Classifier (Boiman, CVPR08)

11

10

09

08

07

06

05

03

99

Days of the BoF model

Key issues of CBIR with the BoF model

Source: Nister and Stewenius, CVPR06

• How to quickly create a large visual codebook – hierarchical k-means clustering – Approximate k-means clustering

Days of the BoF model

Key issues of CBIR with the BoF model

• How to incorporate spatial information – The BoF model ignores the spatial information of

SIFT features

Spatial Pyramid Matching Re-ranking with Spatial verification

Days of the BoF model

Key issues of CBIR with the BoF model

Retrieval result before spatial verification

Query:

Days of the BoF model

25 points matched under a consistent spatial relationship

Only 4 points matched under a consistent spatial relationship

• Re-ranking with spatial verification

Key issues of CBIR with the BoF model

Days of the BoF model

Retrieval result after spatial verification

Query:

Key issues of CBIR with the BoF model

Days of the BoF model

• Large-scale image retrieval – Memory, time, precision – Approximate nearest-neighbor search

x1

x2

xd

.

.

. 0100101100…

How?

Key issues of CBIR with the BoF model

Days of the BoF model

• Local sensitive hashing (LSH) – Random projection, data independent, unsupervised,

• Learning compact binary codes – Preserving sample similarities, data dependent

1

1

1

0

0

0

LSH

Key issues of CBIR with the BoF model

Days of the BoF model

Retrieval examples from the “Oxford5K” data set

Source: Philbin et. al, Object retrieval with large vocabularies and fast spatial matching, CVPR07

Days of the BoF model (Summary)

• Achievements – Local invariant features plays a fundamental role – Visual codebook creation, feature coding, and feature

pooling are extensively studied – Multiple benchmark data sets are established – Large-scale image retrieval is also researched

• To be improved – Feature representation and recognition separate – Focused more on object level level retrieval but less

on semantic level retrieval

• Introduction of CBIR

• Evolution of CBIR

– Early days (before 2000)

– Days of the BoF model (2000 ~ 2012)

– Era of Deep learning (after 2012)

• Conclusion

Outline

Images courtesy of related papers and authors

Era of Deep Learning

Visual • Images • Videos

Audio • Speech • Music

Text • Natural Language

Planning

Era of Deep Learning

• Image Recognition – Faces, objects, poses, scenes, …

• Video content analysis – Action, activities, events, summarization, …

• Visual information management – Search, retrieval, indexing, browsing, …

• Potential Outcome: AI – Computers can see and understand visual

information – Robotics, self-driving cars, surveillance – ….

Era of Deep Learning

Object detection (Source: Rich feature hierarchies for accurate object detection and semantic segmentation, CVPR 2014)

Face Recognition (Source: DeepFace: Closing the Gap to Human-Level Performance in Face Verification, CVPR 2014)

Era of Deep Learning

Pose estimation (DeepPose: Human Pose Estimation via Deep Neural Networks, CVPR2014)

Image Segmentation (Source: SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation, IEEE TPAMI 2016)

Era of Deep Learning

• Fine-grained image recognition

• Human attribute classification

[Ning Zhang et al. CVPR 2014]

[Branson et al. arXiv 2014 ]

Era of Deep Learning

• Action Recognition

• Large-scale Video Classification

[Karpathy et al. CVPR 2014]

[Simonyan et al. arXiv 2014]

Era of Deep Learning

• Invariant and discriminative features

Feature Representation

Feature Extraction Classification “Panda”?

Prior Knowledge, Experience

Pose Occlusion Multiple objects

Inter-class similarity

Image courtesy of M. Ranzato

Era of Deep Learning

• From hand-crafted features to automatically learned ones

Rd

Rk

y = w>x + b

Era of Deep Learning

• Directly learn features representations from data. • Joint learn feature representation and classifier.

Low-level Features

Mid-level Features

High-level Features Classifier

Deep Learning: train layers of features so that classifier works well.

More abstract representation

“Panda”?

Image courtesy of M. Ranzato

Era of Deep Learning

• Deep Learning – Inspired by the way human brain processes information

– Many layers of non-linear information processing stages

Era of Deep Learning

Yes. • Basic ideas common to past neural networks research • Standard machine learning strategies still relevant.

No.

Have we been here before?

Computational Power Large-scale Data New Algorithms

Deep Learning

Era of Deep Learning

Convolutional Neural Networks (CNNs)

• A special multi-stage architecture inspired by visual system

Era of Deep Learning

Source: Slide: Girshick

Fukushima 1980 Neocognitron

LeCun et al. 1989-1998 Hand-written digit reading

Rumelhart, Hinton, Williams 1986 “T” versus “C” problem

...

Krizhevksy, Sutskever, Hinton 2012 ImageNet classification breakthrough “SuperVision” CNN

Convolutional Neural Networks (CNNs)

Era of Deep Learning

CNNs: ImageNet Breakthrough

● Krizhevsky et al. win 2012 ImageNet classification with a much bigger ConvNet ○ deeper: 7 stages vs 3 before ○ larger: 60 million parameters vs 1 million before ○ 16.4% error (top-5) vs Next best 26.2% error

● This was made possible by:

○ fast hardware: GPU-optimized code ○ big dataset: 1.2 million images vs thousands before ○ better regularization: dropout et al.

[Krizhevsky et al. NIPS 2012]

Image courtesy of Deng et al.

Era of Deep Learning

Learned Features of CNNs

[Matthew D. Zeiler et al. ECCV 2014]

Era of Deep Learning

CBIR: From SIFT to CNNs

• Three main approaches – Directly use pre-trained CNNs models

• to extract feature representations

– Fine-tune pre-trained CNNs models • with information (pairwise or triplet similarity)

– Bag-of-features model on CNN features • “Deep SIFT”

Era of Deep Learning

1. Directly use pre-trained CNNs

• How to use the feature representations? – Which layer? – How to pool the features in a convolutional layer? – How to select the features in a convolutional layer?

Era of Deep Learning

1. Directly use pre-trained CNNs

• How to use the feature representations? – Which layer?

Fully connected layer Convolutional layer

Era of Deep Learning

1. Directly use pre-trained CNNs

• How to use the feature representations? – How to pool the features in a convolutional layer?

Depth

Height

Width

x1

x2

.

.

.

xn

How?

Era of Deep Learning

1. Directly use pre-trained CNNs

• How to use the feature representations? – How to pool the features in a convolutional layer?

Depth

Height

Width

x1

x2

.

.

.

xn

How? • Sum-pooling • Max-pooling • Grid-based max-pooling • Region-based pooling • Mixed sum & max pooling

Era of Deep Learning

1. Directly use pre-trained CNNs

• How to use the feature representations? – How to select the features in a convolutional layer?

• Weighting • Activation

magnitude • Region

detection

Source: Cao et. al, Where to Focus: Query Adaptive Matching for Instance Retrieval Using Convolutional Feature Maps

Era of Deep Learning

2. Fine-tune pre-trained CNNs

• To incorporate extra information from a new image data set – Side information (pairwise or triplet similarity) – Distance metric learning

X

Era of Deep Learning

2. Fine-tune pre-trained CNNs

Source: MatchNet, CVPR2015 Source: Learning Fine-Grained Image Similarity with Deep Ranking. CVPR 2014

Era of Deep Learning

3. Bag-of-features model on “Deep SIFT”

SIFT (Scale Invariant Feature Transform

Source: Multi-scale Orderless Pooling of Deep Convolutional Activation Features, ECCV2014

Era of Deep Learning

3. Bag-of-features model on “Deep SIFT”

SIFT (Scale Invariant Feature Transform

“Deep SIFT”

Source: Cao et. al, Where to Focus: Query Adaptive Matching for Instance Retrieval Using Convolutional Feature Maps

Era of Deep Learning

3. Bag-of-features model on “Deep SIFT”

Codebook generation

Feature coding

Feature pooling

Classification Clustering or

Retrieval

Or

Era of Deep Learning

Image Classification with DCNN (Krizhevsky, NIPS12)

CNN Features off-the-shelf (Razavian, CVPRW14); Neural codes (Babenko, ECCV14) Deep ranking (Wang, CVPR14) Multi-scale orderless pooling (Gong, ECCV14) Encoding High Dimensional Local Features (Liu, NIPS14) Survey: Deep learning for CBIR (Wan, ACMMM14)

16

15

14

13

12

Deep filter banks (Cimpoi, CVPR15); Exploiting Local Features from DNN (Ng, CVPRW15) SPoC (Babenko, ICCV15); MatchNet (Han, CVPR15)

R-MAC (Tolias, ICLR16); CNN IR Learns from BoW (Radenovic, ECCV16); CroW (Kalantidis, ECCVW16); Where to focus (Cao, 2016)

Some papers appeared on Arxiv

Summary

• A very limited (and biased) account of CBIR • CBIR has made significant progress during two

past decades • The development of feature representation plays

a key role • Issues to be resolved

– How to transfer the benefit of Deep Learning? – How to deal with unsupervised learning case? – How to better handle the semantic gap? – …

Color histogram

Gabor feature Euclidean distance

User model Query model

SIFT Bag-of-features

Hashing Fine-grained recognition

Deep features Deep

retrieval Deep ranking Deep hashing

Images Courtesy of Google Image