CBIR in the Era of Deep Learning

Lei Wang School of Computing and Information Technology

University of Wollongong, Australia 15-Oct-2016

CBIR in the Era of Deep Learning -- A Perspective from Feature Representation

• Introduction of CBIR

• Evolution of CBIR

– Early days (before 2000)

– Days of BoF model (2000 ~ 2012)

– Era of Deep learning (after 2012)

• Conclusion

Outline

Images courtesy of related papers and authors

Introduction

• Retrieval

– Getting back information that has been stored in a

database

• Image Retrieval

Introduction

• Text-based image retrieval (TBIR, since late 1970’s)

– Manually associate images with text annotations

– Interpret images with high-level semantics

– Retrieval by matching the associated text annotations

Retrieval result of Google Images for “Airplane”

http://images.google.com/imgres?imgurl=http://germes-online.com/direct/dbimage/50200964/Radio_Controlled_Airplane.jpg&imgrefurl=http://germes-online.com/catalog/62/106/page3/131051/balsa_model_airplane.html&h=360&w=360&sz=29&hl=en&start=7&tbnid=vX1ExAYoJxvZiM:&tbnh=121&tbnw=121&prev=/images?q=airplane&gbv=2&svnum=10&hl=en&sa=G

http://images.google.com/imgres?imgurl=http://www.hickerphoto.com/data/media/186/nice-airport-airplane_12202.jpg&imgrefurl=http://www.hickerphoto.com/nice-airport-airplane-12202-pictures.htm&h=311&w=468&sz=17&hl=en&start=8&tbnid=IAvjcJXp79y4TM:&tbnh=85&tbnw=128&prev=/images?q=airplane&gbv=2&svnum=10&hl=en&sa=G

http://images.google.com/imgres?imgurl=http://www.cepolina.com/freephoto/f/other.objects.science/airplane.jpg&imgrefurl=http://www.cepolina.com/freephoto/va/airplane.htm&h=450&w=600&sz=43&hl=en&start=9&tbnid=a919DDSIGbg64M:&tbnh=101&tbnw=135&prev=/images?q=airplane&gbv=2&svnum=10&hl=en&sa=G

http://images.google.com/imgres?imgurl=http://www.grc.nasa.gov/WWW/K-12/airplane/Images/airplane.gif&imgrefurl=http://www.grc.nasa.gov/WWW/K-12/airplane/airplane.html&h=533&w=709&sz=26&hl=en&start=10&tbnid=9XZS4rGsOZQurM:&tbnh=105&tbnw=140&prev=/images?q=airplane&gbv=2&svnum=10&hl=en&sa=G

http://images.google.com/imgres?imgurl=http://www.icaen.uiowa.edu/~dip/examples/images/airplane.gif&imgrefurl=http://www.icaen.uiowa.edu/~dip/LECTURE/Segmentation1.html&h=512&w=512&sz=57&hl=en&start=11&tbnid=7AvRApu-IrSFAM:&tbnh=131&tbnw=131&prev=/images?q=airplane&gbv=2&svnum=10&hl=en&sa=G

Introduction

• Issus with text-based image retrieval – Annotation is time consuming and labour intensive

– Only partially describe the visual content

– Human’s perception subjectivity

– Not support query by example

Drouin Post Office, front desks Iron Ore Fashion

Introduction

• Content-based image retrieval – Human annotators are replaced by computers

– Text annotations are replaced by visual features

– Retrieval by comparing the associated visual features

Drouin Post Office, front desks Iron Ore Fashion

Introduction

• National Science Foundation (NSF) organised a special

workshop on the topic of visual information

management (Feb 1992, San Jose, CA)

• "It would be impossible to cope with this explosion of image

information, unless the images were organized for retrieval.

The fundamental problem is that images, video, and other

similar data differ from numeric data and text data format,

and hence they require a totally different technique of

organization, indexing and query processing."

Introduction

• CBIR categorisation

– No query: Randomly browse similar images

– Query by text (by typing “airplane” or description)

– Query by example

• by using an image, sketch, or graphic of airplane

Introduction


– Find images of similar colour, texture or shape

– Find images of similar object, scene, place, event, etc.

Introduction


– Narrow domain

– Broad domain

Introduction

CBIR

Image matching

Image Recognition

Image Segmentation

Object detection

Image annotation

More tasks …

http://www.vision.caltech.edu/Image_Datasets/Caltech256/images/018.bowling-pin

http://www.vision.caltech.edu/Image_Datasets/Caltech256/images/023.bulldozer

http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2011/examples/images/car_05.jpg

Introduction

CBIR

Computer Vision

Informational Retrieval

Database Machine Learning

Introduction

• Applications of CBIR

– Archival photo collection management

– Personal album management

– Crime investigation

– Fashion and design

– Education and entertainment

– Localisation and navigation

– Medical Image analysis

– ….

Introduction

• CBIR systems – QBIC, Virage, Photobook, VisualSEEk, MARS, etc.

Source: http://vismod.media.mit.edu/vismod/demos/photobook/ Source: http://www.cse.unsw.edu.au/~jas/talks/curveix/notes.html

Introduction

• CBIR systems – QBIC, Virage, Photobook, VisualSEEk, MARS, etc.




– Days of BoF model (2000 ~ 2012)


• Conclusion

Outline


Early days

A new research problem received great interest

CBIR

Application

Semantic gap

Domain knowledge

User model

Query mode Visual features

Similarity measure

Interaction

Learning from data

System

Evaluation

• Hand-crafted features

– Color, texture, shape, structure, etc.

– Goal: “Invariant and discriminative”

• Similarity or distance measure

– Euclidean distance, Manhattan distance, etc.

– Specific measures designed for specific features

Early days

• Relevance feedback

– Bring user into the loop of CBIR to handle “Semantic Gap”

– A key point of “machine Learning” research in CBIR

Early days

• Relevance feedback

– Learning from small sample

– Semi-supervised learning

– Transductive learning

– Feature selection, dimensionality reduction

– Kernel based learning

– Manifold learning

– Relation learning

– …

Early days

• Achievements

– Researched CBIR from various perspectives

– Identified the key issues and obstacles

– Many initial but insightful observations and attempts

– Machine learning started playing an important role

• To be improved

– Basic, hand-crafted features, limited invariance

– Considerably depend on domain theory

– Small-sized databases for evaluation




– Days of the BoF model (2000 ~ 2012)


• Conclusion

Outline


• SIFT, HOG, SURF, CENTRIST, filter-based, … – Invariant to view angle, rotation, scale, illumination, ...

Days of the BoF model

Local Invariant Features

http://www.robots.ox.ac.uk/~vgg/software/

Image courtesy of David Lowe, IJCV04

SIFT (Scale Invariant Feature Transform

http://www.robots.ox.ac.uk/~vgg/research/affine/det_eval_files/img1.haraff.gif



http://www.robots.ox.ac.uk/~vgg/research/affine/#software/

Image A Image B



Source: http://ivt.sourceforge.net/examples.html

Image A Image B



Source: http://www.robots.ox.ac.uk/~vgg/share/SearchPractical2012.html

Image A Image B




Bag-of-features (BoF) model is borrowed from text analysis


Interest point detection or

Dense sampling

The cropped detected regions

Bag-of-feature model is borrowed from text analysis


A close-up view


A close-up view


Extract features from all training/test images

x 2 Rd


Cluster all features to generated “Visual Words”

Rd


Generated “Visual Words”

…

…

…

…

Word 1:

Word 2:

Word 3:

Word 4:

Word k: … … … … … … … … … … … … … … … … … … … … … … … … …

…


From an image to a histogram

[ n1 , n2, … , nk ]

The number of occurrence of 1st “word” in this image

2 Rk

[ 0 , 1, 0, … , 0 ] 2 Rk

[ 1 , 0, 0, … , 0 ] 2 Rk

[ 0 , 0, 1, … , 0 ] 2 Rk… … … …


Classifying, clustering or retrieving images

Rk

y = w>x + b

http://yuhang.rsise.anu.edu.au/db/db689.jpg


A Bag-of-Features Image Analysis System

Image database

Feature extraction

Codebook generation

Feature coding

Feature pooling

Classification Clustering or

Retrieval


Local Invariant Features, such as SIFT (Lowe, ICCV99)

Video Google (Sivic, CVPR03); Bag-of-keypoints (Csurka, SLCV@ECCV04)

Vocabulary tree (Nister, CVPR06); Randomized Clustering Forests (Moosmann, NIPS06); Spatial Pyramid Matching (Lazebnik, CVPR06)

Pyramid Match Kernel (Grauman, ICCV05); Dense sampling (Jurie, ICCV05); Compact Codebook (Winn, ICCV05)

Comparative Study (Zhang, IJCV07); Coding with Fisher Kernels (Perronnin, CVPR07)

Local Soft-assignment Coding & Mix-order pooling (Liu, ICCV11); Comparative Study on BoF model (Chatfield, BMVC, 2011);

Locality-constrained Linear Coding for BoF (Wang, CVPR10); Coding & pooling scheme comparison (Boureau, CVPR10);

Sparse coding for BoF (Yang, CVPR09) Local Coordinate Coding (Yu, NIPS09)

Kernel Codebook (van Gemert, ECCV08); In Defense of Nearest Neighbor Classifier (Boiman, CVPR08)

11

10

09

08

07

06

05

03

99


Key issues of CBIR with the BoF model

Source: Nister and Stewenius, CVPR06

• How to quickly create a large visual codebook – hierarchical k-means clustering – Approximate k-means clustering



• How to incorporate spatial information – The BoF model ignores the spatial information of

SIFT features

Spatial Pyramid Matching Re-ranking with Spatial verification



Retrieval result before spatial verification

Query:


25 points matched under a consistent spatial relationship

Only 4 points matched under a consistent spatial relationship

• Re-ranking with spatial verification



Retrieval result after spatial verification

Query:



• Large-scale image retrieval – Memory, time, precision – Approximate nearest-neighbor search

x1

x2

xd

.

.

. 0100101100…

How?



• Local sensitive hashing (LSH) – Random projection, data independent, unsupervised,

• Learning compact binary codes – Preserving sample similarities, data dependent

1

1

1

0

0

0

LSH



Retrieval examples from the “Oxford5K” data set

Source: Philbin et. al, Object retrieval with large vocabularies and fast spatial matching, CVPR07

Days of the BoF model (Summary)

• Achievements – Local invariant features plays a fundamental role – Visual codebook creation, feature coding, and feature

pooling are extensively studied – Multiple benchmark data sets are established – Large-scale image retrieval is also researched

• To be improved – Feature representation and recognition separate – Focused more on object level level retrieval but less

on semantic level retrieval




– Days of the BoF model (2000 ~ 2012)


• Conclusion

Outline


Era of Deep Learning

Visual • Images • Videos

Audio • Speech • Music

Text • Natural Language

Planning

…


• Image Recognition – Faces, objects, poses, scenes, …

• Video content analysis – Action, activities, events, summarization, …

• Visual information management – Search, retrieval, indexing, browsing, …

• Potential Outcome: AI – Computers can see and understand visual

information – Robotics, self-driving cars, surveillance – ….


Object detection (Source: Rich feature hierarchies for accurate object detection and semantic segmentation, CVPR 2014)

Face Recognition (Source: DeepFace: Closing the Gap to Human-Level Performance in Face Verification, CVPR 2014)


Pose estimation (DeepPose: Human Pose Estimation via Deep Neural Networks, CVPR2014)

Image Segmentation (Source: SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation, IEEE TPAMI 2016)


• Fine-grained image recognition

• Human attribute classification

[Ning Zhang et al. CVPR 2014]

[Branson et al. arXiv 2014 ]


• Action Recognition

• Large-scale Video Classification

[Karpathy et al. CVPR 2014]

[Simonyan et al. arXiv 2014]


• Invariant and discriminative features

Feature Representation

Feature Extraction Classification “Panda”?

Prior Knowledge, Experience

Pose Occlusion Multiple objects

Inter-class similarity

Image courtesy of M. Ranzato


• From hand-crafted features to automatically learned ones

Rd

Rk

y = w>x + b


• Directly learn features representations from data. • Joint learn feature representation and classifier.

Low-level Features

Mid-level Features

High-level Features Classifier

Deep Learning: train layers of features so that classifier works well.

More abstract representation

“Panda”?

Image courtesy of M. Ranzato


• Deep Learning – Inspired by the way human brain processes information

– Many layers of non-linear information processing stages


Yes. • Basic ideas common to past neural networks research • Standard machine learning strategies still relevant.

No.

Have we been here before?

Computational Power Large-scale Data New Algorithms

Deep Learning


Convolutional Neural Networks (CNNs)

• A special multi-stage architecture inspired by visual system


Source: Slide: Girshick

Fukushima 1980 Neocognitron

LeCun et al. 1989-1998 Hand-written digit reading

Rumelhart, Hinton, Williams 1986 “T” versus “C” problem

...

Krizhevksy, Sutskever, Hinton 2012 ImageNet classification breakthrough “SuperVision” CNN

Convolutional Neural Networks (CNNs)


CNNs: ImageNet Breakthrough

● Krizhevsky et al. win 2012 ImageNet classification with a much bigger ConvNet ○ deeper: 7 stages vs 3 before ○ larger: 60 million parameters vs 1 million before ○ 16.4% error (top-5) vs Next best 26.2% error

● This was made possible by:

○ fast hardware: GPU-optimized code ○ big dataset: 1.2 million images vs thousands before ○ better regularization: dropout et al.

[Krizhevsky et al. NIPS 2012]

Image courtesy of Deng et al.


Learned Features of CNNs

[Matthew D. Zeiler et al. ECCV 2014]


CBIR: From SIFT to CNNs

• Three main approaches – Directly use pre-trained CNNs models

• to extract feature representations

– Fine-tune pre-trained CNNs models • with information (pairwise or triplet similarity)

– Bag-of-features model on CNN features • “Deep SIFT”


1. Directly use pre-trained CNNs

• How to use the feature representations? – Which layer? – How to pool the features in a convolutional layer? – How to select the features in a convolutional layer?



• How to use the feature representations? – Which layer?

Fully connected layer Convolutional layer



• How to use the feature representations? – How to pool the features in a convolutional layer?

Depth

Height

Width

x1

x2

.

.

.

xn

How?



• How to use the feature representations? – How to pool the features in a convolutional layer?

Depth

Height

Width

x1

x2

.

.

.

xn

How? • Sum-pooling • Max-pooling • Grid-based max-pooling • Region-based pooling • Mixed sum & max pooling



• How to use the feature representations? – How to select the features in a convolutional layer?

• Weighting • Activation

magnitude • Region

detection

Source: Cao et. al, Where to Focus: Query Adaptive Matching for Instance Retrieval Using Convolutional Feature Maps


2. Fine-tune pre-trained CNNs

• To incorporate extra information from a new image data set – Side information (pairwise or triplet similarity) – Distance metric learning

√

X


2. Fine-tune pre-trained CNNs

Source: MatchNet, CVPR2015 Source: Learning Fine-Grained Image Similarity with Deep Ranking. CVPR 2014


3. Bag-of-features model on “Deep SIFT”


Source: Multi-scale Orderless Pooling of Deep Convolutional Activation Features, ECCV2014





“Deep SIFT”

Source: Cao et. al, Where to Focus: Query Adaptive Matching for Instance Retrieval Using Convolutional Feature Maps




Codebook generation

Feature coding

Feature pooling

Classification Clustering or

Retrieval

Or


Image Classification with DCNN (Krizhevsky, NIPS12)

CNN Features off-the-shelf (Razavian, CVPRW14); Neural codes (Babenko, ECCV14) Deep ranking (Wang, CVPR14) Multi-scale orderless pooling (Gong, ECCV14) Encoding High Dimensional Local Features (Liu, NIPS14) Survey: Deep learning for CBIR (Wan, ACMMM14)

16

15

14

13

12

Deep filter banks (Cimpoi, CVPR15); Exploiting Local Features from DNN (Ng, CVPRW15) SPoC (Babenko, ICCV15); MatchNet (Han, CVPR15)

R-MAC (Tolias, ICLR16); CNN IR Learns from BoW (Radenovic, ECCV16); CroW (Kalantidis, ECCVW16); Where to focus (Cao, 2016)

Some papers appeared on Arxiv

Summary

• A very limited (and biased) account of CBIR • CBIR has made significant progress during two

past decades • The development of feature representation plays

a key role • Issues to be resolved

– How to transfer the benefit of Deep Learning? – How to deal with unsupervised learning case? – How to better handle the semantic gap? – …

Color histogram

Gabor feature Euclidean distance

User model Query model

…

SIFT Bag-of-features

Hashing Fine-grained recognition

…

Deep features Deep

retrieval Deep ranking Deep hashing

…

Images Courtesy of Google Image

…