MIT6870_ORSU_lecture7: Power of 10

Preview:

Citation preview

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 1/106

Lecture 7Powers of 10

6.870 Object Recognition and Scene Understandinghttp://people.csail.mit.edu/torralba/courses/6.870/6.870.recognition.htm

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 2/106

Wednesday

Presenter: Vladimir Bychkovsky

Evaluator: Krista Ehinger 

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 3/106

The internet power 

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 4/106

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 5/106

http://en.wikipedia.org/wiki/One_red_paperclip

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 6/106

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 7/106

Human Vision

Some key properties:

Many input modalities

Active Supervised, unsupervised,semi supervised learning. Itcan look for supervision.

Performance: amazing

How it works: no idea

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 8/106

Robot Vision

Performance: It does not work

How it works: SIFT+SVM+HMM

Some key properties:

Many poor input modalities

Active, but it does not go far 

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 9/106

Internet Vision

Performance: The more data, the better 

How it works: SIFT+SVM+LSH

Some key properties:

Many input modalities

It can reach everywhere Tons of data

Image credit: Matt Britt

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 10/106

Past and future of image datasets in

computer vision

Lenaa dataset in one picture

1972

100

105

1010

1020

Number of 

pictures

1015

Human Click Limit(all humanity takingone picture/secondduring 100 years)

Time1996

40.000

COREL

2007

2 billion

2020?

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 11/106

The extremes of learning

Number of 

training

samples

1 10 102 103 104 105

Extrapolation problemGeneralization

Transfer learning

Interpolation problemCorrespondence

Finding the differences

106

Traditionaldatasets

Lecture 2Lecture 3 Lecture 7

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 12/106

Scenes are unique

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 13/106

But not all scenes are so original

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 14/106

But not all scenes are so original

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 15/106

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 16/106

Lots

Of 

Images

 A. Torralba, R. Fergus, W.T.Freeman. PAMI 2008

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 17/106

Lots

Of 

Images

 A. Torralba, R. Fergus, W.T.Freeman. PAMI 2008

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 18/106

Lots

Of 

Images

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 19/106

 Automatic Colorization Result

Grayscale input High resolution

Colorization of input using average

 A. Torralba, R. Fergus, W.T.Freeman. 2008

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 20/106

 Automatic Orientation

Many images haveambiguous orientation

Look at top 25%by confidence:

Examples of high and low confidenceimages:

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 21/106

 Automatic Orientation Examples

 A. Torralba, R. Fergus, W.T.Freeman. 2008

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 22/106

How many images are there?

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 23/106

Powers of 10

Number of images on my hard drive: 104

Number of images seen during my first 10 years: 108

(3 images/second * 60 * 60 * 16 * 365 * 10 = 630720000)

Number of images seen by all humanity: 1020

106,456,367,669 humans1 * 60 years * 3 images/second * 60 * 60 * 16 * 365 =1 from http://www.prb.org/Articles/2002/HowManyPeopleHaveEverLivedonEarth.aspx

Number of all images in the universe: 10243

1081 atoms * 1081 * 1081 =

Number of all 32x32 images: 107373

25632*32*3

~ 107373

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 24/106

How many images are there?

Chandler, and Field. (2007).

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 25/106

How many images are there?

Torralba, Fergus, Freeman. PAMI 2008

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 26/106

10% of the objects

account for 90% of the data

~Zipf¶s law

Caltech 101

Tiny images

LabelMe

We need transfer learning

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 27/106

10% of the objectsaccount for 90% of 

the data

~Zipf¶s law

Caltech 101

Tiny images

LabelMe

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 28/106

Is this something humans do at all?

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 29/106

What¶s the Capacity of Visual Long Term Memory?

³Basically, my recollection is that we justseparated the pictures into distinct thematic

categories: e.g. cars, animals, single-person, 2-people, plants, etc.) Only a fewslides were selected which fell into eachcategory, and they were visually distinct.´

 According to Standing

Standing (1973)

10,000 images

83% Recognition

What we know« What we don¶t know«

Sparse Details

DogsDogsPlaying CardsPlaying Cards

³Gist´ Only Highly Detailed

« people canremember thousands

of images

« what people are rememberingfor each item? 

High Fidelity Visual

Memory is possible

(Hollingworth 2004)

Slide by Aude Oliva

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 30/106

Massive Memory I: Methods

... ......

Showed 14 observers 2500 categorically unique objects

1 at a time, 3 seconds each

800 ms blank between items

Study session lasted about 5.5 hours

Repeat Detection task to maintain focus

1-back

Followed by 300 2-alternative forced choice tests

1024-back

Slide by Aude Oliva

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 31/106

Slide by Aude Oliva

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 32/106

how far can we push the fidelity of 

visual LTM representation ?

Same object, different states

Slide by Aude Oliva

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 33/106

Visual Cognition

Expert Predictions

92%

Massive Memory I: Recognition Memory Results

Replication of Standing (1973)

Slide by Aude Oliva

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 34/106

92% 88% 87%

Massive Memory I: Recognition Memory Results

Slide by Aude Oliva

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 35/106

Extrapolation of Repeat Detection Data

Human performances for n = 1024

Power law

(r 2=.988)

Quadratic (r 2=.988)

Brady, Konkle, Alvarez, Oliva (submitted) Slide by Aude Oliva

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 36/106

Building datasets

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 37/106

Collecting datasets

(towards 106-7 examples)

ESP game (CMU)Luis Von Ahn and Laura Dabbish 2004

LabelMe (MIT)Russell, Torralba, Freeman, 2005

StreetScenes (CBCL-MIT

)Bileschi, Poggio, 2006

WhatWhere (Caltech)Perona et al, 2007

PASCAL challenge

2006, 2007

Lotus Hill InstituteSong-Chun Zhu et al 2007

80 million images

Torralba, Fergus, Freeman, 2007

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 38/106

Names and faces

Tamara L. Berg, Alexander C. Berg, Jaety Edwards, Michael Maire, Ryan White, Yee Whye Teh, Erik Learned-Miller, David A. Forsyth. CVPR, 2004

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 39/106

Names and faces

30,281 face images, obtained by applying a face finder to approximately half a

million captioned news images

Clustering stage

Tamara L. Berg, Alexander C. Berg, Jaety Edwards, Michael Maire, Ryan White, Yee Whye Teh, Erik Learned-Miller, David A. Forsyth. CVPR, 2004

The general approach involves

using unambiguously labeled

data items to estimate discriminant

coordinates.

use a version of K-means to

allocate ambiguously labeled faces

to one of their labels

clean up the clusters by removing

data items far from the mean, and

re-estimate discriminantcoordinates

merge clusters based on facial

similarities

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 40/106

Semiautomatic labelingNumber of 

labeled instances

1

10

1000

100

10000

10 100 1000

>1500 descriptions with less

than 100 samples

Semi-automatic labeling: Abramson & Freund, CVPR 2005

Labeling Google images: Fergus et al, ECCV 2004, ICCV 2005

Challenge: we need accurate labeling (similar to users)

Segments instead of bounding boxes

Overlap with ground truth > 90%

LabelMe statistics

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 41/106

Semiautomatic labeling

Sailboats from LabelMe Sailboats from Google, Altavista, Flikr  

SVM

Train asimple detector 

Label Googleimages

Query LabelmeQuery online search tools

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 42/106

Semiautomatic labeling

Google ranking

Detector ranking

Precision(object

Presence)

Image rank100 500 1000

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 43/106

Examples semi-automatic labeling

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 44/106

Optimol

Li-Jia Li, Gang Wang and Li Fei-Fei. OPTIMOL: automatic Object Picture collecTion via Incremental MOdel Learning. IEEE  

Computer Vision and Pattern Recognition (CVPR), Minneapolis, 2007 

Once a model is learned, it can be used to do classification on the images from the web resource. If the image is classifiedas in this object category, it gets accepted and incorporated into the collected dataset. Otherwise, it will be discarded. Themodel will again be updated by the newly accepted images in current round. In this incremental way, the category modelgets more and more robust. As a consequence, the collected dataset gets larger and larger with reliable images.

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 45/106

im2gpsInstead of using objects labels, the web provides other kinds of metadata associate

to large collections of images

Hays & Efros. CVPR 2008

20 million geotagged and geographic text-labeled images

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 46/106

Hays & Efros. CVPR 2008

im2gps

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 47/106

Video Google

S ivic, J. and Zisserman, A. Video Google: A Text Retrieval Approach to Object Matching in Videos

Proceedings of the International Conference on Computer Vision (2003)

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 48/106

Visually defined search

Given an object specified by its image, retrieve allshots containing the object:

must handle viewpoint change etc

must be efficient at run time

peopleobjects places Slide by Josef Sivic

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 49/106

Object search in video: why is it hard?

an object¶s imaged appearance varies «

scale changes

lighting changes

viewpoint changes

partial occlusion

sheer amount of data

feature length movie ~ 100,000 -150,000 frames

Slide by Josef Sivic

Visual description visual words

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 50/106

Image visual nouns

Visual description ± visual words

Slide by Josef Sivic

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 51/106

Visual vocabulary unaffected by scale and viewpoint

The same visual word Slide by Josef Sivic

Image representation using visual words

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 52/106

Image representation using visual words

Use efficient google like search on visual words

Slide by Josef Sivic

Vid G l

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 53/106

Efficient search: In a classical file

structure all words are stored in the

document they appear in. An inverted file

structure has an entry (hit list) for each

word where all occurrences of the word

in all documents are stored. In our case

the inverted file has an entry for each

visual word, which stores all the

matches, i.e. occurrences of the same

word in all frames. The document vector 

is very sparse and use of an inverted file

makes the retrieval very fast. Querying a

database of 4k frames takes about 0.1

second with a Matlab implementation on

a 2GHz pentium.

Video GoogleQuery:

Retrieved

frames:

retrieved shotsxamp e :

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 54/106

retrieved shotsxamp e :

Groundhog Day

Video Google, Sivic & Zisserman, ICCV 2003

Slide by Josef Sivic

E l C bl retrieved shots

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 55/106

Example: Casablancaretrieved shots

Slide by Josef Sivic

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 56/106

Scalable recognition with a vocabulary tree

David Nister and Henrik Stewenius. CVPR 2006.

With a good image similarity

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 57/106

With a good image similarity

and a lot of data«

Input image Nearest neighbors

22,000 LabelMe scenes

Hays, Efros, Siggraph 2006Russell, Liu, Torralba, Fergus, Freeman. NIPS 2007

With a good image similarity

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 58/106

With a good image similarity

and a lot of data«

Russell, Liu, Torralba, Fergus, Freeman. NIPS 2007

With a good image similarity

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 59/106

With a good image similarity

and a lot of data«

Russell, Liu, Torralba, Fergus, Freeman. NIPS 2007

With a good image similarity

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 60/106

With a good image similarity

and a lot of data«

Russell, Liu, Torralba, Fergus, Freeman. NIPS 2007

With a good image similarity

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 61/106

With a good image similarity

and a lot of data«

Russell, Liu, Torralba, Fergus, Freeman. NIPS 2007

Outputs

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 62/106

Outputs

Russell, Liu, Torralba, Fergus, Freeman. NIPS 2007

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 63/106

road

window

keyboard

screenscreen

Russell, Liu, Torralba, Fergus, Freeman. NIPS 2007

Next Wednesday

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 64/106

Next Wednesday

N. Snavely, S. M. Seitz, R. Szeliski, Photo tourism:Exploring photo collections in 3D, Siggraph 2006(website) (code)

J. Hays, A. A. Efros. Scene Completion Using Millions of Photographs. SIGGRAPH 2007, (website and code)

A. Torralba, R. Fergus, W. T. Freeman, 80 million tinyimages: a large dataset for non-parametric object andscene recognition. PAMI 2008. (website)

SIFT flow

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 65/106

SIFT flow

 Dense sampling in time : classical optical flow ::Dense sampling in world images: SIFT flow

Ce Liu, Jenny Yuen, Torralba, Sivic, Freeman

Matching frames / views

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 66/106

Matching frames / views

The two images are taken from the samescene with different time and/or 

perspective

Liu, Yuen, Torralba, Sivic, Freeman. ECCV 08

Matching scenes

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 67/106

Matching scenes

Two images taken from the same scenecategory, but different instances

Contain different objects with different

scales, perspectives and spatial location

Liu, Yuen, Torralba, Sivic, Freeman. ECCV 08

Image representation

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 68/106

Image representation

128 dimensions/pixel

SIFT Visualization: map 128dimensions in 3D color space

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 69/106

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 70/106

Same scene instance matching

(a) Query image ( b) Dense SIFT (c) Best match (d) SIFT of (c) (e) Warped (c) (f) Warped (d)

Matching different scenes

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 71/106

Matching different scenes

Matching: objects

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 72/106

Matching: objects

S hi

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 73/106

Scene matching

S t hi

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 74/106

Scene matching

Failures

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 75/106

Failures

The nearest neighbors may not contain similar scenes or object categories (SIFT flow tries to

match image structures anyway)

Dealing with millions of images

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 76/106

Dealing with millions of images

Input image

Binary codes for global scene

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 77/106

Binary codes for global scene

representation

Short codes allow for storing millions of images

Efficient search: hamming distance

(search millions of images in fewmicroseconds)

Internet scale experiments: compute

nearest neighbors between all images inthe internet

512 bits

Binary codes for images

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 78/106

Binary codes for images

Want images with similar content

to have similar binary codes

Use Hamming distance between codes

 ± Number of bit flips

 ± E.g.:

Semantic Hashing [Salakhutdinov & Hinton,

2007]

 ± Text documents

Ham_Dist(10001010,10001110)=1

Ham_Dist(10001010,11101110)=3

Slide Rob Fergus

Binary codes for images

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 79/106

Binary codes for images

Permits fast lookup via hashing

 ± Each code is a memory address

 ± Find neighbors by exploring Hamming

ball around query address

 ± Lookup time depends on

radius of ball,

NOT

on # data points

Address Space

Query ImageSemantic Hash

Function

Semantically

similar

images

Query address

   F   i   g   u   r   e   a   d   a   p   t   e   d    f   r   o   m

    S   a    l   a    k    h   u   t   d   i   n   o   v   &

   H   i   n   t   o   n       0   7

Slide Rob Fergus

Compact Binary Codes

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 80/106

Compact Binary Codes

Google has few billion images (109)

Big PC has ~10 Gbytes (1011 bits)

Codes must fit in memory (disk too slow) Budget of 102 bits/image

1 Megapixel image is 107 bits

32x32 color image is 104 bits

Semantic hash function must also reduce

dimensionality

Slide Rob Fergus

How many bits do we need?

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 81/106

How many bits do we need?

Goal: to decide when two images are similar (two imagesare similar if they contain the same large object classesin similar spatial configurations)

First, let¶s use some hand-wavy arguments to gain someintuition about how many bits do we need. If we have1000 object categories, then we would need 1000 bits totell if an object is present or not (assuming independentobjects). However, if we assume that objects are sparse,

and that, on average, there are only 5 important objectpresent on an image, then we would need only 50 bits todescribe the image content. Adding spatial location canadd another 8*5 = 40 bits.

3 mega pixels

8 million bitsShort

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 82/106

8 million bits

1000 pixels

512 bits

image

codes

How many bits do we need?

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 83/106

How many bits do we need?

16 bits

64 bits

32 bits

128 bits

256 bits

512 bits

1024 bits

2048 bits

24576 bits

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 84/106

M

 e a s  ur i  n gi  m a g

 e s i  mi  l   a

r i   t   y 

 ann o t   a t   e d  d  a t   a

Number of pixels

BuildingSky

Road

Person

Sidewalk

Car 

Sculpture

Bicycle

Number of pixels

Building

Sky

Road

Person

Sidewalk

Car 

Sculpture

Bicycle

Number of pixels

Building

Sky

Road

Person

Sidewalk

Car 

Sculpture

Bicycle

Number of pixels

Building

Sky

Road

Person

Sidewalk

Car 

Sculpture

Bicycle

Number of pixels

Building

Sky

Road

Person

Sidewalk

Car 

Sculpture

Bicycle

Numberofpixels

Building

Sky

Road

Person

Sidewalk

Car 

Sculpture

Bicycle

Numberofpixels

Building

Sky

Road

Person

Sidewalk

Car 

Sculpture

Bicycle

Numberofpixels

Building

Sky

Road

Person

Sidewalk

Car 

Sculpture

Bicycle

Numberofpixels

Building

Sky

Road

Person

Sidewalk

Car 

Sculpture

Bicycle

Numberofpixels

Building

Sky

Road

Person

Sidewalk

Numberofpixels

Building

Sky

Road

Person

Sidewalk

Numberofpixels

Building

Sky

Road

Person

Sidewalk

Car 

Sculpture

Bicycle

Numberofpixels

Building

Sky

Road

Person

Sidewalk

Car 

Sculpture

Bicycle

Numberofpixels

Building

Sky

Road

Person

Sidewalk

Numberofpixels

Building

Sky

Road

Person

Sidewalk

Car 

Sculpture

Bicycle

Numberofpixels

Building

Sky

Road

Person

Sidewalk

Car 

Sculpture

Bicycle

Numberofpixels

Building

Sky

Road

Person

Sidewalk

 S  p a t  i   al   p y r  ami   d m a t   c h 

i  n g [  L  az  e b n

 S  (  h 1  ,h 2  )  = s  um

 (  mi  n (  h 1  ,h 2  )   )  

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 85/106

Hashing

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 86/106

Hashing

We consider the following learning problem - given a database of images {xi} and a distance function D(i, j) we seek a binary feature

vector yi = f(xi) that preserves the nearest neighbor relationships

using a Hamming distance.

Salakhutdinov and Hinton [SIGIR 2007], Shakhnarovich et al [ICCV 2003], Athitsos et al. [ICDE 2008], Grauman et al[CVPR 2007], Nascimentio et al [ACM Smyp. App. Computing 2002], Wang [ICME 2006], Wang [PAMI 2008],

Locality Sensitive Hashing

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 87/106

Locality Sensitive Hashing

Gionis, A. & Indyk, P. & Motwani, R. (1999) Take random projections of data

Quantize each projection with few bits

For our N bit code:

 ± Compute first N PCA

components of data

 ± Each random projection

must be linear combination of 

the N PCA components

Pose estimation

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 88/106

Pose estimation

Fast Pose Estimation with Parameter Sensitive Hashing.

Shakhnarovich, Viola, Darrell. ICCV 2003

Learning hamming distances with boosting

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 89/106

g g g

Shaknarovich and Darrell

Each image is represented by a binary vector with M bits

x = vector of image features

hi = function with binary output

y = binary vector 

Distance between two images is given by a weightedHamming distance

y = [h1(x), h2(x), ..., hM(x)]

n=1

M

D(i, j) = 7 En|hn(xi) í hn(x j)|

The weights Ei and the functions hn(xi) that map the input vector xi into binary

features are learned.

Learning

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 90/106

Learning

Positive examples are pairs of images xi, x j so that x j is one of the N nearest

neighbors of xi. Negative examples are pairs of images that are not neighbors.

In BoostSSC, each regression stump has the form:

 At each iteration n we select the parameters of f n, the regression coefficients (En,

 Fn), the stump parameters (where en is a unit vector, so that eT

n x returns the k-thcomponent of x, and Tn is a threshold), to minimize the square loss:

Where K is the number of training pairs, zk is the neighborhood label (zk = 1 if the

two images are neighbors and zk = í1 otherwise), and wkn is the weight for eachtraining pair at iteration n given by

hn(xi)

Shaknarovich and Darrell

Compressing the gist descriptor 

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 91/106

p g g p

GIST 

[Oliva and Torralba¶01]

Original image 1

0

1

1

1

0

0

1

«

Ground truth neighbors Gist Gist (32 ± bits)Input image

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 92/106

How many bits do we need?

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 93/106

y

road

mountain treecar 

sky

LabelMe: 22,000 images

32 bits

Tiny Images: 107 images

256 bits

Scene matching with camera

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 94/106

transformations

Image representation

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 95/106

g p

Color layout

GIST 

[Oliva and Torralba¶01]Original image

Scene matching with camera view

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 96/106

3. Find a match to fill

the missing pixels

g

transformations: Translation

1. Move camera

2. View from the

virtual camera

4. Locally align images

5. Find a seam

6. Blend in the gradient domain

Scene matching with camera view

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 97/106

4. Stitched rotation

g

transformations: Camera rotation

1. Rotate camera

2. View from the

virtual camera

3. Find a match to fill-

in the missing pixels

5. Display on a cylinder 

Scene matching with camera view

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 98/106

g

transformations: Forward motion

1. Move camera

2. View from the

virtual camera

3. Find a match to

replace pixels

Tour from a single image

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 99/106

Navigate the virtual space using intuitive motion controls

Tour from a single image

Basic camera motions

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 100/106

Basic camera motions

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 101/106

Basic camera motions

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 102/106

Exploring famous sites

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 103/106

p g

Direction of forward motion

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 104/106

Input image Query region Best match forward

Forward motion (towards image centre)

Forward motion on the ground plane

[Torralba and Sinha¶01, Lalonde¶07,

Hoiem¶07]

If images are from the same

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 105/106

g

place«

Google Street View PhotoToursim/PhotoSynth[Snavely et al.,2006](controlled image capture)

(register images based on

multi-view geometry)

Next Wednesday

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 106/106

N. Snavely, S. M. Seitz, R. Szeliski, Photo tourism:Exploring photo collections in 3D, Siggraph 2006(website) (code)

J. Hays, A. A. Efros. Scene Completion Using Millions of Photographs. SIGGRAPH 2007, (website and code)

A. Torralba, R. Fergus, W. T. Freeman, 80 million tinyimages: a large dataset for non-parametric object and

scene recognition. PAMI 2008. (website)

Recommended