106
8/3/2019 MIT6870_ORSU_lecture7: Power of 10 http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 1/106 Lecture 7 Powers of 10 6.870 Object Recognition and Scene Understanding http://people.csail.mit.edu/torralba/courses/6.870/6.870.recognition.htm

MIT6870_ORSU_lecture7: Power of 10

  • Upload
    zukun

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

Page 1: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 1/106

Lecture 7Powers of 10

6.870 Object Recognition and Scene Understandinghttp://people.csail.mit.edu/torralba/courses/6.870/6.870.recognition.htm

Page 2: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 2/106

Wednesday

Presenter: Vladimir Bychkovsky

Evaluator: Krista Ehinger 

Page 3: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 3/106

The internet power 

Page 4: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 4/106

Page 5: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 5/106

http://en.wikipedia.org/wiki/One_red_paperclip

Page 6: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 6/106

Page 7: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 7/106

Human Vision

Some key properties:

Many input modalities

Active Supervised, unsupervised,semi supervised learning. Itcan look for supervision.

Performance: amazing

How it works: no idea

Page 8: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 8/106

Robot Vision

Performance: It does not work

How it works: SIFT+SVM+HMM

Some key properties:

Many poor input modalities

Active, but it does not go far 

Page 9: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 9/106

Internet Vision

Performance: The more data, the better 

How it works: SIFT+SVM+LSH

Some key properties:

Many input modalities

It can reach everywhere Tons of data

Image credit: Matt Britt

Page 10: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 10/106

Past and future of image datasets in

computer vision

Lenaa dataset in one picture

1972

100

105

1010

1020

Number of 

pictures

1015

Human Click Limit(all humanity takingone picture/secondduring 100 years)

Time1996

40.000

COREL

2007

2 billion

2020?

Page 11: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 11/106

The extremes of learning

Number of 

training

samples

1 10 102 103 104 105

Extrapolation problemGeneralization

Transfer learning

Interpolation problemCorrespondence

Finding the differences

106

Traditionaldatasets

Lecture 2Lecture 3 Lecture 7

Page 12: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 12/106

Scenes are unique

Page 13: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 13/106

But not all scenes are so original

Page 14: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 14/106

But not all scenes are so original

Page 15: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 15/106

Page 16: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 16/106

Lots

Of 

Images

 A. Torralba, R. Fergus, W.T.Freeman. PAMI 2008

Page 17: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 17/106

Lots

Of 

Images

 A. Torralba, R. Fergus, W.T.Freeman. PAMI 2008

Page 18: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 18/106

Lots

Of 

Images

Page 19: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 19/106

 Automatic Colorization Result

Grayscale input High resolution

Colorization of input using average

 A. Torralba, R. Fergus, W.T.Freeman. 2008

Page 20: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 20/106

 Automatic Orientation

Many images haveambiguous orientation

Look at top 25%by confidence:

Examples of high and low confidenceimages:

Page 21: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 21/106

 Automatic Orientation Examples

 A. Torralba, R. Fergus, W.T.Freeman. 2008

Page 22: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 22/106

How many images are there?

Page 23: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 23/106

Powers of 10

Number of images on my hard drive: 104

Number of images seen during my first 10 years: 108

(3 images/second * 60 * 60 * 16 * 365 * 10 = 630720000)

Number of images seen by all humanity: 1020

106,456,367,669 humans1 * 60 years * 3 images/second * 60 * 60 * 16 * 365 =1 from http://www.prb.org/Articles/2002/HowManyPeopleHaveEverLivedonEarth.aspx

Number of all images in the universe: 10243

1081 atoms * 1081 * 1081 =

Number of all 32x32 images: 107373

25632*32*3

~ 107373

Page 24: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 24/106

How many images are there?

Chandler, and Field. (2007).

Page 25: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 25/106

How many images are there?

Torralba, Fergus, Freeman. PAMI 2008

Page 26: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 26/106

10% of the objects

account for 90% of the data

~Zipf¶s law

Caltech 101

Tiny images

LabelMe

We need transfer learning

Page 27: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 27/106

10% of the objectsaccount for 90% of 

the data

~Zipf¶s law

Caltech 101

Tiny images

LabelMe

Page 28: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 28/106

Is this something humans do at all?

Page 29: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 29/106

What¶s the Capacity of Visual Long Term Memory?

³Basically, my recollection is that we justseparated the pictures into distinct thematic

categories: e.g. cars, animals, single-person, 2-people, plants, etc.) Only a fewslides were selected which fell into eachcategory, and they were visually distinct.´

 According to Standing

Standing (1973)

10,000 images

83% Recognition

What we know« What we don¶t know«

Sparse Details

DogsDogsPlaying CardsPlaying Cards

³Gist´ Only Highly Detailed

« people canremember thousands

of images

« what people are rememberingfor each item? 

High Fidelity Visual

Memory is possible

(Hollingworth 2004)

Slide by Aude Oliva

Page 30: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 30/106

Massive Memory I: Methods

... ......

Showed 14 observers 2500 categorically unique objects

1 at a time, 3 seconds each

800 ms blank between items

Study session lasted about 5.5 hours

Repeat Detection task to maintain focus

1-back

Followed by 300 2-alternative forced choice tests

1024-back

Slide by Aude Oliva

Page 31: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 31/106

Slide by Aude Oliva

Page 32: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 32/106

how far can we push the fidelity of 

visual LTM representation ?

Same object, different states

Slide by Aude Oliva

Page 33: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 33/106

Visual Cognition

Expert Predictions

92%

Massive Memory I: Recognition Memory Results

Replication of Standing (1973)

Slide by Aude Oliva

Page 34: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 34/106

92% 88% 87%

Massive Memory I: Recognition Memory Results

Slide by Aude Oliva

Page 35: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 35/106

Extrapolation of Repeat Detection Data

Human performances for n = 1024

Power law

(r 2=.988)

Quadratic (r 2=.988)

Brady, Konkle, Alvarez, Oliva (submitted) Slide by Aude Oliva

Page 36: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 36/106

Building datasets

Page 37: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 37/106

Collecting datasets

(towards 106-7 examples)

ESP game (CMU)Luis Von Ahn and Laura Dabbish 2004

LabelMe (MIT)Russell, Torralba, Freeman, 2005

StreetScenes (CBCL-MIT

)Bileschi, Poggio, 2006

WhatWhere (Caltech)Perona et al, 2007

PASCAL challenge

2006, 2007

Lotus Hill InstituteSong-Chun Zhu et al 2007

80 million images

Torralba, Fergus, Freeman, 2007

Page 38: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 38/106

Names and faces

Tamara L. Berg, Alexander C. Berg, Jaety Edwards, Michael Maire, Ryan White, Yee Whye Teh, Erik Learned-Miller, David A. Forsyth. CVPR, 2004

Page 39: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 39/106

Names and faces

30,281 face images, obtained by applying a face finder to approximately half a

million captioned news images

Clustering stage

Tamara L. Berg, Alexander C. Berg, Jaety Edwards, Michael Maire, Ryan White, Yee Whye Teh, Erik Learned-Miller, David A. Forsyth. CVPR, 2004

The general approach involves

using unambiguously labeled

data items to estimate discriminant

coordinates.

use a version of K-means to

allocate ambiguously labeled faces

to one of their labels

clean up the clusters by removing

data items far from the mean, and

re-estimate discriminantcoordinates

merge clusters based on facial

similarities

Page 40: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 40/106

Semiautomatic labelingNumber of 

labeled instances

1

10

1000

100

10000

10 100 1000

>1500 descriptions with less

than 100 samples

Semi-automatic labeling: Abramson & Freund, CVPR 2005

Labeling Google images: Fergus et al, ECCV 2004, ICCV 2005

Challenge: we need accurate labeling (similar to users)

Segments instead of bounding boxes

Overlap with ground truth > 90%

LabelMe statistics

Page 41: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 41/106

Semiautomatic labeling

Sailboats from LabelMe Sailboats from Google, Altavista, Flikr  

SVM

Train asimple detector 

Label Googleimages

Query LabelmeQuery online search tools

Page 42: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 42/106

Semiautomatic labeling

Google ranking

Detector ranking

Precision(object

Presence)

Image rank100 500 1000

Page 43: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 43/106

Examples semi-automatic labeling

Page 44: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 44/106

Optimol

Li-Jia Li, Gang Wang and Li Fei-Fei. OPTIMOL: automatic Object Picture collecTion via Incremental MOdel Learning. IEEE  

Computer Vision and Pattern Recognition (CVPR), Minneapolis, 2007 

Once a model is learned, it can be used to do classification on the images from the web resource. If the image is classifiedas in this object category, it gets accepted and incorporated into the collected dataset. Otherwise, it will be discarded. Themodel will again be updated by the newly accepted images in current round. In this incremental way, the category modelgets more and more robust. As a consequence, the collected dataset gets larger and larger with reliable images.

Page 45: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 45/106

im2gpsInstead of using objects labels, the web provides other kinds of metadata associate

to large collections of images

Hays & Efros. CVPR 2008

20 million geotagged and geographic text-labeled images

Page 46: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 46/106

Hays & Efros. CVPR 2008

im2gps

Page 47: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 47/106

Video Google

S ivic, J. and Zisserman, A. Video Google: A Text Retrieval Approach to Object Matching in Videos

Proceedings of the International Conference on Computer Vision (2003)

Page 48: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 48/106

Visually defined search

Given an object specified by its image, retrieve allshots containing the object:

must handle viewpoint change etc

must be efficient at run time

peopleobjects places Slide by Josef Sivic

Page 49: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 49/106

Object search in video: why is it hard?

an object¶s imaged appearance varies «

scale changes

lighting changes

viewpoint changes

partial occlusion

sheer amount of data

feature length movie ~ 100,000 -150,000 frames

Slide by Josef Sivic

Visual description visual words

Page 50: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 50/106

Image visual nouns

Visual description ± visual words

Slide by Josef Sivic

Page 51: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 51/106

Visual vocabulary unaffected by scale and viewpoint

The same visual word Slide by Josef Sivic

Image representation using visual words

Page 52: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 52/106

Image representation using visual words

Use efficient google like search on visual words

Slide by Josef Sivic

Vid G l

Page 53: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 53/106

Efficient search: In a classical file

structure all words are stored in the

document they appear in. An inverted file

structure has an entry (hit list) for each

word where all occurrences of the word

in all documents are stored. In our case

the inverted file has an entry for each

visual word, which stores all the

matches, i.e. occurrences of the same

word in all frames. The document vector 

is very sparse and use of an inverted file

makes the retrieval very fast. Querying a

database of 4k frames takes about 0.1

second with a Matlab implementation on

a 2GHz pentium.

Video GoogleQuery:

Retrieved

frames:

retrieved shotsxamp e :

Page 54: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 54/106

retrieved shotsxamp e :

Groundhog Day

Video Google, Sivic & Zisserman, ICCV 2003

Slide by Josef Sivic

E l C bl retrieved shots

Page 55: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 55/106

Example: Casablancaretrieved shots

Slide by Josef Sivic

Page 56: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 56/106

Scalable recognition with a vocabulary tree

David Nister and Henrik Stewenius. CVPR 2006.

With a good image similarity

Page 57: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 57/106

With a good image similarity

and a lot of data«

Input image Nearest neighbors

22,000 LabelMe scenes

Hays, Efros, Siggraph 2006Russell, Liu, Torralba, Fergus, Freeman. NIPS 2007

With a good image similarity

Page 58: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 58/106

With a good image similarity

and a lot of data«

Russell, Liu, Torralba, Fergus, Freeman. NIPS 2007

With a good image similarity

Page 59: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 59/106

With a good image similarity

and a lot of data«

Russell, Liu, Torralba, Fergus, Freeman. NIPS 2007

With a good image similarity

Page 60: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 60/106

With a good image similarity

and a lot of data«

Russell, Liu, Torralba, Fergus, Freeman. NIPS 2007

With a good image similarity

Page 61: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 61/106

With a good image similarity

and a lot of data«

Russell, Liu, Torralba, Fergus, Freeman. NIPS 2007

Outputs

Page 62: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 62/106

Outputs

Russell, Liu, Torralba, Fergus, Freeman. NIPS 2007

Page 63: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 63/106

road

window

keyboard

screenscreen

Russell, Liu, Torralba, Fergus, Freeman. NIPS 2007

Next Wednesday

Page 64: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 64/106

Next Wednesday

N. Snavely, S. M. Seitz, R. Szeliski, Photo tourism:Exploring photo collections in 3D, Siggraph 2006(website) (code)

J. Hays, A. A. Efros. Scene Completion Using Millions of Photographs. SIGGRAPH 2007, (website and code)

A. Torralba, R. Fergus, W. T. Freeman, 80 million tinyimages: a large dataset for non-parametric object andscene recognition. PAMI 2008. (website)

SIFT flow

Page 65: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 65/106

SIFT flow

 Dense sampling in time : classical optical flow ::Dense sampling in world images: SIFT flow

Ce Liu, Jenny Yuen, Torralba, Sivic, Freeman

Matching frames / views

Page 66: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 66/106

Matching frames / views

The two images are taken from the samescene with different time and/or 

perspective

Liu, Yuen, Torralba, Sivic, Freeman. ECCV 08

Matching scenes

Page 67: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 67/106

Matching scenes

Two images taken from the same scenecategory, but different instances

Contain different objects with different

scales, perspectives and spatial location

Liu, Yuen, Torralba, Sivic, Freeman. ECCV 08

Image representation

Page 68: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 68/106

Image representation

128 dimensions/pixel

SIFT Visualization: map 128dimensions in 3D color space

Page 69: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 69/106

Page 70: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 70/106

Same scene instance matching

(a) Query image ( b) Dense SIFT (c) Best match (d) SIFT of (c) (e) Warped (c) (f) Warped (d)

Matching different scenes

Page 71: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 71/106

Matching different scenes

Matching: objects

Page 72: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 72/106

Matching: objects

S hi

Page 73: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 73/106

Scene matching

S t hi

Page 74: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 74/106

Scene matching

Failures

Page 75: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 75/106

Failures

The nearest neighbors may not contain similar scenes or object categories (SIFT flow tries to

match image structures anyway)

Dealing with millions of images

Page 76: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 76/106

Dealing with millions of images

Input image

Binary codes for global scene

Page 77: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 77/106

Binary codes for global scene

representation

Short codes allow for storing millions of images

Efficient search: hamming distance

(search millions of images in fewmicroseconds)

Internet scale experiments: compute

nearest neighbors between all images inthe internet

512 bits

Binary codes for images

Page 78: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 78/106

Binary codes for images

Want images with similar content

to have similar binary codes

Use Hamming distance between codes

 ± Number of bit flips

 ± E.g.:

Semantic Hashing [Salakhutdinov & Hinton,

2007]

 ± Text documents

Ham_Dist(10001010,10001110)=1

Ham_Dist(10001010,11101110)=3

Slide Rob Fergus

Binary codes for images

Page 79: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 79/106

Binary codes for images

Permits fast lookup via hashing

 ± Each code is a memory address

 ± Find neighbors by exploring Hamming

ball around query address

 ± Lookup time depends on

radius of ball,

NOT

on # data points

Address Space

Query ImageSemantic Hash

Function

Semantically

similar

images

Query address

   F   i   g   u   r   e   a   d   a   p   t   e   d    f   r   o   m

    S   a    l   a    k    h   u   t   d   i   n   o   v   &

   H   i   n   t   o   n       0   7

Slide Rob Fergus

Compact Binary Codes

Page 80: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 80/106

Compact Binary Codes

Google has few billion images (109)

Big PC has ~10 Gbytes (1011 bits)

Codes must fit in memory (disk too slow) Budget of 102 bits/image

1 Megapixel image is 107 bits

32x32 color image is 104 bits

Semantic hash function must also reduce

dimensionality

Slide Rob Fergus

How many bits do we need?

Page 81: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 81/106

How many bits do we need?

Goal: to decide when two images are similar (two imagesare similar if they contain the same large object classesin similar spatial configurations)

First, let¶s use some hand-wavy arguments to gain someintuition about how many bits do we need. If we have1000 object categories, then we would need 1000 bits totell if an object is present or not (assuming independentobjects). However, if we assume that objects are sparse,

and that, on average, there are only 5 important objectpresent on an image, then we would need only 50 bits todescribe the image content. Adding spatial location canadd another 8*5 = 40 bits.

3 mega pixels

8 million bitsShort

Page 82: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 82/106

8 million bits

1000 pixels

512 bits

image

codes

How many bits do we need?

Page 83: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 83/106

How many bits do we need?

16 bits

64 bits

32 bits

128 bits

256 bits

512 bits

1024 bits

2048 bits

24576 bits

Page 84: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 84/106

M

 e a s  ur i  n gi  m a g

 e s i  mi  l   a

r i   t   y 

 ann o t   a t   e d  d  a t   a

Number of pixels

BuildingSky

Road

Person

Sidewalk

Car 

Sculpture

Bicycle

Number of pixels

Building

Sky

Road

Person

Sidewalk

Car 

Sculpture

Bicycle

Number of pixels

Building

Sky

Road

Person

Sidewalk

Car 

Sculpture

Bicycle

Number of pixels

Building

Sky

Road

Person

Sidewalk

Car 

Sculpture

Bicycle

Number of pixels

Building

Sky

Road

Person

Sidewalk

Car 

Sculpture

Bicycle

Numberofpixels

Building

Sky

Road

Person

Sidewalk

Car 

Sculpture

Bicycle

Numberofpixels

Building

Sky

Road

Person

Sidewalk

Car 

Sculpture

Bicycle

Numberofpixels

Building

Sky

Road

Person

Sidewalk

Car 

Sculpture

Bicycle

Numberofpixels

Building

Sky

Road

Person

Sidewalk

Car 

Sculpture

Bicycle

Numberofpixels

Building

Sky

Road

Person

Sidewalk

Numberofpixels

Building

Sky

Road

Person

Sidewalk

Numberofpixels

Building

Sky

Road

Person

Sidewalk

Car 

Sculpture

Bicycle

Numberofpixels

Building

Sky

Road

Person

Sidewalk

Car 

Sculpture

Bicycle

Numberofpixels

Building

Sky

Road

Person

Sidewalk

Numberofpixels

Building

Sky

Road

Person

Sidewalk

Car 

Sculpture

Bicycle

Numberofpixels

Building

Sky

Road

Person

Sidewalk

Car 

Sculpture

Bicycle

Numberofpixels

Building

Sky

Road

Person

Sidewalk

 S  p a t  i   al   p y r  ami   d m a t   c h 

i  n g [  L  az  e b n

 S  (  h 1  ,h 2  )  = s  um

 (  mi  n (  h 1  ,h 2  )   )  

Page 85: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 85/106

Hashing

Page 86: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 86/106

Hashing

We consider the following learning problem - given a database of images {xi} and a distance function D(i, j) we seek a binary feature

vector yi = f(xi) that preserves the nearest neighbor relationships

using a Hamming distance.

Salakhutdinov and Hinton [SIGIR 2007], Shakhnarovich et al [ICCV 2003], Athitsos et al. [ICDE 2008], Grauman et al[CVPR 2007], Nascimentio et al [ACM Smyp. App. Computing 2002], Wang [ICME 2006], Wang [PAMI 2008],

Locality Sensitive Hashing

Page 87: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 87/106

Locality Sensitive Hashing

Gionis, A. & Indyk, P. & Motwani, R. (1999) Take random projections of data

Quantize each projection with few bits

For our N bit code:

 ± Compute first N PCA

components of data

 ± Each random projection

must be linear combination of 

the N PCA components

Pose estimation

Page 88: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 88/106

Pose estimation

Fast Pose Estimation with Parameter Sensitive Hashing.

Shakhnarovich, Viola, Darrell. ICCV 2003

Learning hamming distances with boosting

Page 89: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 89/106

g g g

Shaknarovich and Darrell

Each image is represented by a binary vector with M bits

x = vector of image features

hi = function with binary output

y = binary vector 

Distance between two images is given by a weightedHamming distance

y = [h1(x), h2(x), ..., hM(x)]

n=1

M

D(i, j) = 7 En|hn(xi) í hn(x j)|

The weights Ei and the functions hn(xi) that map the input vector xi into binary

features are learned.

Learning

Page 90: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 90/106

Learning

Positive examples are pairs of images xi, x j so that x j is one of the N nearest

neighbors of xi. Negative examples are pairs of images that are not neighbors.

In BoostSSC, each regression stump has the form:

 At each iteration n we select the parameters of f n, the regression coefficients (En,

 Fn), the stump parameters (where en is a unit vector, so that eT

n x returns the k-thcomponent of x, and Tn is a threshold), to minimize the square loss:

Where K is the number of training pairs, zk is the neighborhood label (zk = 1 if the

two images are neighbors and zk = í1 otherwise), and wkn is the weight for eachtraining pair at iteration n given by

hn(xi)

Shaknarovich and Darrell

Compressing the gist descriptor 

Page 91: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 91/106

p g g p

GIST 

[Oliva and Torralba¶01]

Original image 1

0

1

1

1

0

0

1

«

Ground truth neighbors Gist Gist (32 ± bits)Input image

Page 92: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 92/106

How many bits do we need?

Page 93: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 93/106

y

road

mountain treecar 

sky

LabelMe: 22,000 images

32 bits

Tiny Images: 107 images

256 bits

Scene matching with camera

Page 94: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 94/106

transformations

Image representation

Page 95: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 95/106

g p

Color layout

GIST 

[Oliva and Torralba¶01]Original image

Scene matching with camera view

Page 96: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 96/106

3. Find a match to fill

the missing pixels

g

transformations: Translation

1. Move camera

2. View from the

virtual camera

4. Locally align images

5. Find a seam

6. Blend in the gradient domain

Scene matching with camera view

Page 97: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 97/106

4. Stitched rotation

g

transformations: Camera rotation

1. Rotate camera

2. View from the

virtual camera

3. Find a match to fill-

in the missing pixels

5. Display on a cylinder 

Scene matching with camera view

Page 98: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 98/106

g

transformations: Forward motion

1. Move camera

2. View from the

virtual camera

3. Find a match to

replace pixels

Tour from a single image

Page 99: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 99/106

Navigate the virtual space using intuitive motion controls

Tour from a single image

Basic camera motions

Page 100: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 100/106

Basic camera motions

Page 101: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 101/106

Basic camera motions

Page 102: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 102/106

Exploring famous sites

Page 103: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 103/106

p g

Direction of forward motion

Page 104: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 104/106

Input image Query region Best match forward

Forward motion (towards image centre)

Forward motion on the ground plane

[Torralba and Sinha¶01, Lalonde¶07,

Hoiem¶07]

If images are from the same

Page 105: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 105/106

g

place«

Google Street View PhotoToursim/PhotoSynth[Snavely et al.,2006](controlled image capture)

(register images based on

multi-view geometry)

Next Wednesday

Page 106: MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 106/106

N. Snavely, S. M. Seitz, R. Szeliski, Photo tourism:Exploring photo collections in 3D, Siggraph 2006(website) (code)

J. Hays, A. A. Efros. Scene Completion Using Millions of Photographs. SIGGRAPH 2007, (website and code)

A. Torralba, R. Fergus, W. T. Freeman, 80 million tinyimages: a large dataset for non-parametric object and

scene recognition. PAMI 2008. (website)