Upload
zukun
View
214
Download
0
Embed Size (px)
Citation preview
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 1/106
Lecture 7Powers of 10
6.870 Object Recognition and Scene Understandinghttp://people.csail.mit.edu/torralba/courses/6.870/6.870.recognition.htm
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 2/106
Wednesday
Presenter: Vladimir Bychkovsky
Evaluator: Krista Ehinger
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 3/106
The internet power
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 4/106
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 5/106
http://en.wikipedia.org/wiki/One_red_paperclip
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 6/106
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 7/106
Human Vision
Some key properties:
Many input modalities
Active Supervised, unsupervised,semi supervised learning. Itcan look for supervision.
Performance: amazing
How it works: no idea
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 8/106
Robot Vision
Performance: It does not work
How it works: SIFT+SVM+HMM
Some key properties:
Many poor input modalities
Active, but it does not go far
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 9/106
Internet Vision
Performance: The more data, the better
How it works: SIFT+SVM+LSH
Some key properties:
Many input modalities
It can reach everywhere Tons of data
Image credit: Matt Britt
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 10/106
Past and future of image datasets in
computer vision
Lenaa dataset in one picture
1972
100
105
1010
1020
Number of
pictures
1015
Human Click Limit(all humanity takingone picture/secondduring 100 years)
Time1996
40.000
COREL
2007
2 billion
2020?
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 11/106
The extremes of learning
Number of
training
samples
1 10 102 103 104 105
Extrapolation problemGeneralization
Transfer learning
Interpolation problemCorrespondence
Finding the differences
106
Traditionaldatasets
Lecture 2Lecture 3 Lecture 7
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 12/106
Scenes are unique
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 13/106
But not all scenes are so original
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 14/106
But not all scenes are so original
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 15/106
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 16/106
Lots
Of
Images
A. Torralba, R. Fergus, W.T.Freeman. PAMI 2008
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 17/106
Lots
Of
Images
A. Torralba, R. Fergus, W.T.Freeman. PAMI 2008
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 18/106
Lots
Of
Images
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 19/106
Automatic Colorization Result
Grayscale input High resolution
Colorization of input using average
A. Torralba, R. Fergus, W.T.Freeman. 2008
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 20/106
Automatic Orientation
Many images haveambiguous orientation
Look at top 25%by confidence:
Examples of high and low confidenceimages:
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 21/106
Automatic Orientation Examples
A. Torralba, R. Fergus, W.T.Freeman. 2008
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 22/106
How many images are there?
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 23/106
Powers of 10
Number of images on my hard drive: 104
Number of images seen during my first 10 years: 108
(3 images/second * 60 * 60 * 16 * 365 * 10 = 630720000)
Number of images seen by all humanity: 1020
106,456,367,669 humans1 * 60 years * 3 images/second * 60 * 60 * 16 * 365 =1 from http://www.prb.org/Articles/2002/HowManyPeopleHaveEverLivedonEarth.aspx
Number of all images in the universe: 10243
1081 atoms * 1081 * 1081 =
Number of all 32x32 images: 107373
25632*32*3
~ 107373
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 24/106
How many images are there?
Chandler, and Field. (2007).
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 25/106
How many images are there?
Torralba, Fergus, Freeman. PAMI 2008
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 26/106
10% of the objects
account for 90% of the data
~Zipf¶s law
Caltech 101
Tiny images
LabelMe
We need transfer learning
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 27/106
10% of the objectsaccount for 90% of
the data
~Zipf¶s law
Caltech 101
Tiny images
LabelMe
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 28/106
Is this something humans do at all?
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 29/106
What¶s the Capacity of Visual Long Term Memory?
³Basically, my recollection is that we justseparated the pictures into distinct thematic
categories: e.g. cars, animals, single-person, 2-people, plants, etc.) Only a fewslides were selected which fell into eachcategory, and they were visually distinct.´
According to Standing
Standing (1973)
10,000 images
83% Recognition
What we know« What we don¶t know«
Sparse Details
DogsDogsPlaying CardsPlaying Cards
³Gist´ Only Highly Detailed
« people canremember thousands
of images
« what people are rememberingfor each item?
High Fidelity Visual
Memory is possible
(Hollingworth 2004)
Slide by Aude Oliva
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 30/106
Massive Memory I: Methods
... ......
Showed 14 observers 2500 categorically unique objects
1 at a time, 3 seconds each
800 ms blank between items
Study session lasted about 5.5 hours
Repeat Detection task to maintain focus
1-back
Followed by 300 2-alternative forced choice tests
1024-back
Slide by Aude Oliva
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 31/106
Slide by Aude Oliva
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 32/106
how far can we push the fidelity of
visual LTM representation ?
Same object, different states
Slide by Aude Oliva
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 33/106
Visual Cognition
Expert Predictions
92%
Massive Memory I: Recognition Memory Results
Replication of Standing (1973)
Slide by Aude Oliva
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 34/106
92% 88% 87%
Massive Memory I: Recognition Memory Results
Slide by Aude Oliva
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 35/106
Extrapolation of Repeat Detection Data
Human performances for n = 1024
Power law
(r 2=.988)
Quadratic (r 2=.988)
Brady, Konkle, Alvarez, Oliva (submitted) Slide by Aude Oliva
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 36/106
Building datasets
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 37/106
Collecting datasets
(towards 106-7 examples)
ESP game (CMU)Luis Von Ahn and Laura Dabbish 2004
LabelMe (MIT)Russell, Torralba, Freeman, 2005
StreetScenes (CBCL-MIT
)Bileschi, Poggio, 2006
WhatWhere (Caltech)Perona et al, 2007
PASCAL challenge
2006, 2007
Lotus Hill InstituteSong-Chun Zhu et al 2007
80 million images
Torralba, Fergus, Freeman, 2007
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 38/106
Names and faces
Tamara L. Berg, Alexander C. Berg, Jaety Edwards, Michael Maire, Ryan White, Yee Whye Teh, Erik Learned-Miller, David A. Forsyth. CVPR, 2004
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 39/106
Names and faces
30,281 face images, obtained by applying a face finder to approximately half a
million captioned news images
Clustering stage
Tamara L. Berg, Alexander C. Berg, Jaety Edwards, Michael Maire, Ryan White, Yee Whye Teh, Erik Learned-Miller, David A. Forsyth. CVPR, 2004
The general approach involves
using unambiguously labeled
data items to estimate discriminant
coordinates.
use a version of K-means to
allocate ambiguously labeled faces
to one of their labels
clean up the clusters by removing
data items far from the mean, and
re-estimate discriminantcoordinates
merge clusters based on facial
similarities
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 40/106
Semiautomatic labelingNumber of
labeled instances
1
10
1000
100
10000
10 100 1000
>1500 descriptions with less
than 100 samples
Semi-automatic labeling: Abramson & Freund, CVPR 2005
Labeling Google images: Fergus et al, ECCV 2004, ICCV 2005
Challenge: we need accurate labeling (similar to users)
Segments instead of bounding boxes
Overlap with ground truth > 90%
LabelMe statistics
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 41/106
Semiautomatic labeling
Sailboats from LabelMe Sailboats from Google, Altavista, Flikr
SVM
Train asimple detector
Label Googleimages
Query LabelmeQuery online search tools
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 42/106
Semiautomatic labeling
Google ranking
Detector ranking
Precision(object
Presence)
Image rank100 500 1000
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 43/106
Examples semi-automatic labeling
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 44/106
Optimol
Li-Jia Li, Gang Wang and Li Fei-Fei. OPTIMOL: automatic Object Picture collecTion via Incremental MOdel Learning. IEEE
Computer Vision and Pattern Recognition (CVPR), Minneapolis, 2007
Once a model is learned, it can be used to do classification on the images from the web resource. If the image is classifiedas in this object category, it gets accepted and incorporated into the collected dataset. Otherwise, it will be discarded. Themodel will again be updated by the newly accepted images in current round. In this incremental way, the category modelgets more and more robust. As a consequence, the collected dataset gets larger and larger with reliable images.
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 45/106
im2gpsInstead of using objects labels, the web provides other kinds of metadata associate
to large collections of images
Hays & Efros. CVPR 2008
20 million geotagged and geographic text-labeled images
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 46/106
Hays & Efros. CVPR 2008
im2gps
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 47/106
Video Google
S ivic, J. and Zisserman, A. Video Google: A Text Retrieval Approach to Object Matching in Videos
Proceedings of the International Conference on Computer Vision (2003)
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 48/106
Visually defined search
Given an object specified by its image, retrieve allshots containing the object:
must handle viewpoint change etc
must be efficient at run time
peopleobjects places Slide by Josef Sivic
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 49/106
Object search in video: why is it hard?
an object¶s imaged appearance varies «
scale changes
lighting changes
viewpoint changes
partial occlusion
sheer amount of data
feature length movie ~ 100,000 -150,000 frames
Slide by Josef Sivic
Visual description visual words
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 50/106
Image visual nouns
Visual description ± visual words
Slide by Josef Sivic
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 51/106
Visual vocabulary unaffected by scale and viewpoint
The same visual word Slide by Josef Sivic
Image representation using visual words
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 52/106
Image representation using visual words
Use efficient google like search on visual words
Slide by Josef Sivic
Vid G l
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 53/106
Efficient search: In a classical file
structure all words are stored in the
document they appear in. An inverted file
structure has an entry (hit list) for each
word where all occurrences of the word
in all documents are stored. In our case
the inverted file has an entry for each
visual word, which stores all the
matches, i.e. occurrences of the same
word in all frames. The document vector
is very sparse and use of an inverted file
makes the retrieval very fast. Querying a
database of 4k frames takes about 0.1
second with a Matlab implementation on
a 2GHz pentium.
Video GoogleQuery:
Retrieved
frames:
retrieved shotsxamp e :
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 54/106
retrieved shotsxamp e :
Groundhog Day
Video Google, Sivic & Zisserman, ICCV 2003
Slide by Josef Sivic
E l C bl retrieved shots
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 55/106
Example: Casablancaretrieved shots
Slide by Josef Sivic
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 56/106
Scalable recognition with a vocabulary tree
David Nister and Henrik Stewenius. CVPR 2006.
With a good image similarity
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 57/106
With a good image similarity
and a lot of data«
Input image Nearest neighbors
22,000 LabelMe scenes
Hays, Efros, Siggraph 2006Russell, Liu, Torralba, Fergus, Freeman. NIPS 2007
With a good image similarity
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 58/106
With a good image similarity
and a lot of data«
Russell, Liu, Torralba, Fergus, Freeman. NIPS 2007
With a good image similarity
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 59/106
With a good image similarity
and a lot of data«
Russell, Liu, Torralba, Fergus, Freeman. NIPS 2007
With a good image similarity
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 60/106
With a good image similarity
and a lot of data«
Russell, Liu, Torralba, Fergus, Freeman. NIPS 2007
With a good image similarity
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 61/106
With a good image similarity
and a lot of data«
Russell, Liu, Torralba, Fergus, Freeman. NIPS 2007
Outputs
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 62/106
Outputs
Russell, Liu, Torralba, Fergus, Freeman. NIPS 2007
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 63/106
road
window
keyboard
screenscreen
Russell, Liu, Torralba, Fergus, Freeman. NIPS 2007
Next Wednesday
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 64/106
Next Wednesday
N. Snavely, S. M. Seitz, R. Szeliski, Photo tourism:Exploring photo collections in 3D, Siggraph 2006(website) (code)
J. Hays, A. A. Efros. Scene Completion Using Millions of Photographs. SIGGRAPH 2007, (website and code)
A. Torralba, R. Fergus, W. T. Freeman, 80 million tinyimages: a large dataset for non-parametric object andscene recognition. PAMI 2008. (website)
SIFT flow
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 65/106
SIFT flow
Dense sampling in time : classical optical flow ::Dense sampling in world images: SIFT flow
Ce Liu, Jenny Yuen, Torralba, Sivic, Freeman
Matching frames / views
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 66/106
Matching frames / views
The two images are taken from the samescene with different time and/or
perspective
Liu, Yuen, Torralba, Sivic, Freeman. ECCV 08
Matching scenes
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 67/106
Matching scenes
Two images taken from the same scenecategory, but different instances
Contain different objects with different
scales, perspectives and spatial location
Liu, Yuen, Torralba, Sivic, Freeman. ECCV 08
Image representation
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 68/106
Image representation
128 dimensions/pixel
SIFT Visualization: map 128dimensions in 3D color space
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 69/106
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 70/106
Same scene instance matching
(a) Query image ( b) Dense SIFT (c) Best match (d) SIFT of (c) (e) Warped (c) (f) Warped (d)
Matching different scenes
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 71/106
Matching different scenes
Matching: objects
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 72/106
Matching: objects
S hi
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 73/106
Scene matching
S t hi
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 74/106
Scene matching
Failures
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 75/106
Failures
The nearest neighbors may not contain similar scenes or object categories (SIFT flow tries to
match image structures anyway)
Dealing with millions of images
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 76/106
Dealing with millions of images
Input image
Binary codes for global scene
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 77/106
Binary codes for global scene
representation
Short codes allow for storing millions of images
Efficient search: hamming distance
(search millions of images in fewmicroseconds)
Internet scale experiments: compute
nearest neighbors between all images inthe internet
512 bits
Binary codes for images
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 78/106
Binary codes for images
Want images with similar content
to have similar binary codes
Use Hamming distance between codes
± Number of bit flips
± E.g.:
Semantic Hashing [Salakhutdinov & Hinton,
2007]
± Text documents
Ham_Dist(10001010,10001110)=1
Ham_Dist(10001010,11101110)=3
Slide Rob Fergus
Binary codes for images
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 79/106
Binary codes for images
Permits fast lookup via hashing
± Each code is a memory address
± Find neighbors by exploring Hamming
ball around query address
± Lookup time depends on
radius of ball,
NOT
on # data points
Address Space
Query ImageSemantic Hash
Function
Semantically
similar
images
Query address
F i g u r e a d a p t e d f r o m
S a l a k h u t d i n o v &
H i n t o n 0 7
Slide Rob Fergus
Compact Binary Codes
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 80/106
Compact Binary Codes
Google has few billion images (109)
Big PC has ~10 Gbytes (1011 bits)
Codes must fit in memory (disk too slow) Budget of 102 bits/image
1 Megapixel image is 107 bits
32x32 color image is 104 bits
Semantic hash function must also reduce
dimensionality
Slide Rob Fergus
How many bits do we need?
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 81/106
How many bits do we need?
Goal: to decide when two images are similar (two imagesare similar if they contain the same large object classesin similar spatial configurations)
First, let¶s use some hand-wavy arguments to gain someintuition about how many bits do we need. If we have1000 object categories, then we would need 1000 bits totell if an object is present or not (assuming independentobjects). However, if we assume that objects are sparse,
and that, on average, there are only 5 important objectpresent on an image, then we would need only 50 bits todescribe the image content. Adding spatial location canadd another 8*5 = 40 bits.
3 mega pixels
8 million bitsShort
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 82/106
8 million bits
1000 pixels
512 bits
image
codes
How many bits do we need?
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 83/106
How many bits do we need?
16 bits
64 bits
32 bits
128 bits
256 bits
512 bits
1024 bits
2048 bits
24576 bits
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 84/106
M
e a s ur i n gi m a g
e s i mi l a
r i t y
ann o t a t e d d a t a
Number of pixels
BuildingSky
Road
Person
Sidewalk
Car
Sculpture
Bicycle
Number of pixels
Building
Sky
Road
Person
Sidewalk
Car
Sculpture
Bicycle
Number of pixels
Building
Sky
Road
Person
Sidewalk
Car
Sculpture
Bicycle
Number of pixels
Building
Sky
Road
Person
Sidewalk
Car
Sculpture
Bicycle
Number of pixels
Building
Sky
Road
Person
Sidewalk
Car
Sculpture
Bicycle
Numberofpixels
Building
Sky
Road
Person
Sidewalk
Car
Sculpture
Bicycle
Numberofpixels
Building
Sky
Road
Person
Sidewalk
Car
Sculpture
Bicycle
Numberofpixels
Building
Sky
Road
Person
Sidewalk
Car
Sculpture
Bicycle
Numberofpixels
Building
Sky
Road
Person
Sidewalk
Car
Sculpture
Bicycle
Numberofpixels
Building
Sky
Road
Person
Sidewalk
Numberofpixels
Building
Sky
Road
Person
Sidewalk
Numberofpixels
Building
Sky
Road
Person
Sidewalk
Car
Sculpture
Bicycle
Numberofpixels
Building
Sky
Road
Person
Sidewalk
Car
Sculpture
Bicycle
Numberofpixels
Building
Sky
Road
Person
Sidewalk
Numberofpixels
Building
Sky
Road
Person
Sidewalk
Car
Sculpture
Bicycle
Numberofpixels
Building
Sky
Road
Person
Sidewalk
Car
Sculpture
Bicycle
Numberofpixels
Building
Sky
Road
Person
Sidewalk
S p a t i al p y r ami d m a t c h
i n g [ L az e b n
S ( h 1 ,h 2 ) = s um
( mi n ( h 1 ,h 2 ) )
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 85/106
Hashing
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 86/106
Hashing
We consider the following learning problem - given a database of images {xi} and a distance function D(i, j) we seek a binary feature
vector yi = f(xi) that preserves the nearest neighbor relationships
using a Hamming distance.
Salakhutdinov and Hinton [SIGIR 2007], Shakhnarovich et al [ICCV 2003], Athitsos et al. [ICDE 2008], Grauman et al[CVPR 2007], Nascimentio et al [ACM Smyp. App. Computing 2002], Wang [ICME 2006], Wang [PAMI 2008],
Locality Sensitive Hashing
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 87/106
Locality Sensitive Hashing
Gionis, A. & Indyk, P. & Motwani, R. (1999) Take random projections of data
Quantize each projection with few bits
For our N bit code:
± Compute first N PCA
components of data
± Each random projection
must be linear combination of
the N PCA components
Pose estimation
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 88/106
Pose estimation
Fast Pose Estimation with Parameter Sensitive Hashing.
Shakhnarovich, Viola, Darrell. ICCV 2003
Learning hamming distances with boosting
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 89/106
g g g
Shaknarovich and Darrell
Each image is represented by a binary vector with M bits
x = vector of image features
hi = function with binary output
y = binary vector
Distance between two images is given by a weightedHamming distance
y = [h1(x), h2(x), ..., hM(x)]
n=1
M
D(i, j) = 7 En|hn(xi) í hn(x j)|
The weights Ei and the functions hn(xi) that map the input vector xi into binary
features are learned.
Learning
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 90/106
Learning
Positive examples are pairs of images xi, x j so that x j is one of the N nearest
neighbors of xi. Negative examples are pairs of images that are not neighbors.
In BoostSSC, each regression stump has the form:
At each iteration n we select the parameters of f n, the regression coefficients (En,
Fn), the stump parameters (where en is a unit vector, so that eT
n x returns the k-thcomponent of x, and Tn is a threshold), to minimize the square loss:
Where K is the number of training pairs, zk is the neighborhood label (zk = 1 if the
two images are neighbors and zk = í1 otherwise), and wkn is the weight for eachtraining pair at iteration n given by
hn(xi)
Shaknarovich and Darrell
Compressing the gist descriptor
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 91/106
p g g p
GIST
[Oliva and Torralba¶01]
Original image 1
0
1
1
1
0
0
1
«
Ground truth neighbors Gist Gist (32 ± bits)Input image
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 92/106
How many bits do we need?
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 93/106
y
road
mountain treecar
sky
LabelMe: 22,000 images
32 bits
Tiny Images: 107 images
256 bits
Scene matching with camera
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 94/106
transformations
Image representation
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 95/106
g p
Color layout
GIST
[Oliva and Torralba¶01]Original image
Scene matching with camera view
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 96/106
3. Find a match to fill
the missing pixels
g
transformations: Translation
1. Move camera
2. View from the
virtual camera
4. Locally align images
5. Find a seam
6. Blend in the gradient domain
Scene matching with camera view
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 97/106
4. Stitched rotation
g
transformations: Camera rotation
1. Rotate camera
2. View from the
virtual camera
3. Find a match to fill-
in the missing pixels
5. Display on a cylinder
Scene matching with camera view
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 98/106
g
transformations: Forward motion
1. Move camera
2. View from the
virtual camera
3. Find a match to
replace pixels
Tour from a single image
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 99/106
Navigate the virtual space using intuitive motion controls
Tour from a single image
Basic camera motions
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 100/106
Basic camera motions
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 101/106
Basic camera motions
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 102/106
Exploring famous sites
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 103/106
p g
Direction of forward motion
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 104/106
Input image Query region Best match forward
Forward motion (towards image centre)
Forward motion on the ground plane
[Torralba and Sinha¶01, Lalonde¶07,
Hoiem¶07]
If images are from the same
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 105/106
g
place«
Google Street View PhotoToursim/PhotoSynth[Snavely et al.,2006](controlled image capture)
(register images based on
multi-view geometry)
Next Wednesday
8/3/2019 MIT6870_ORSU_lecture7: Power of 10
http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 106/106
N. Snavely, S. M. Seitz, R. Szeliski, Photo tourism:Exploring photo collections in 3D, Siggraph 2006(website) (code)
J. Hays, A. A. Efros. Scene Completion Using Millions of Photographs. SIGGRAPH 2007, (website and code)
A. Torralba, R. Fergus, W. T. Freeman, 80 million tinyimages: a large dataset for non-parametric object and
scene recognition. PAMI 2008. (website)