MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 1/106

Lecture 7Powers of 10

6.870 Object Recognition and Scene Understandinghttp://people.csail.mit.edu/torralba/courses/6.870/6.870.recognition.htm



Wednesday

Presenter: Vladimir Bychkovsky

Evaluator: Krista Ehinger



The internet power





http://en.wikipedia.org/wiki/One_red_paperclip





Human Vision

Some key properties:

Many input modalities

Active Supervised, unsupervised,semi supervised learning. Itcan look for supervision.

Performance: amazing

How it works: no idea



Robot Vision

Performance: It does not work

How it works: SIFT+SVM+HMM


Many poor input modalities

Active, but it does not go far



Internet Vision

Performance: The more data, the better

How it works: SIFT+SVM+LSH


Many input modalities

It can reach everywhere Tons of data

Image credit: Matt Britt



Past and future of image datasets in

computer vision

Lenaa dataset in one picture

1972

100

105

1010

1020

Number of

pictures

1015

Human Click Limit(all humanity takingone picture/secondduring 100 years)

Time1996

40.000

COREL

2007

2 billion

2020?



The extremes of learning

Number of

training

samples

1 10 102 103 104 105

Extrapolation problemGeneralization

Transfer learning

Interpolation problemCorrespondence

Finding the differences

106

Traditionaldatasets

Lecture 2Lecture 3 Lecture 7



Scenes are unique



But not all scenes are so original



But not all scenes are so original





Lots

Of

Images

A. Torralba, R. Fergus, W.T.Freeman. PAMI 2008



Lots

Of

Images

A. Torralba, R. Fergus, W.T.Freeman. PAMI 2008



Lots

Of

Images



Automatic Colorization Result

Grayscale input High resolution

Colorization of input using average

A. Torralba, R. Fergus, W.T.Freeman. 2008



Automatic Orientation

Many images haveambiguous orientation

Look at top 25%by confidence:

Examples of high and low confidenceimages:



Automatic Orientation Examples

A. Torralba, R. Fergus, W.T.Freeman. 2008



How many images are there?



Powers of 10

Number of images on my hard drive: 104

Number of images seen during my first 10 years: 108

(3 images/second * 60 * 60 * 16 * 365 * 10 = 630720000)

Number of images seen by all humanity: 1020

106,456,367,669 humans1 * 60 years * 3 images/second * 60 * 60 * 16 * 365 =1 from http://www.prb.org/Articles/2002/HowManyPeopleHaveEverLivedonEarth.aspx

Number of all images in the universe: 10243

1081 atoms * 1081 * 1081 =

Number of all 32x32 images: 107373

25632*32*3

~ 107373




Chandler, and Field. (2007).




Torralba, Fergus, Freeman. PAMI 2008



10% of the objects

account for 90% of the data

~Zipf¶s law

Caltech 101

Tiny images

LabelMe

We need transfer learning



10% of the objectsaccount for 90% of

the data

~Zipf¶s law

Caltech 101

Tiny images

LabelMe



Is this something humans do at all?



What¶s the Capacity of Visual Long Term Memory?

³Basically, my recollection is that we justseparated the pictures into distinct thematic

categories: e.g. cars, animals, single-person, 2-people, plants, etc.) Only a fewslides were selected which fell into eachcategory, and they were visually distinct.´

According to Standing

Standing (1973)

10,000 images

83% Recognition

What we know« What we don¶t know«

Sparse Details

DogsDogsPlaying CardsPlaying Cards

³Gist´ Only Highly Detailed

« people canremember thousands

of images

« what people are rememberingfor each item?

High Fidelity Visual

Memory is possible

(Hollingworth 2004)

Slide by Aude Oliva



Massive Memory I: Methods

... ......

Showed 14 observers 2500 categorically unique objects

1 at a time, 3 seconds each

800 ms blank between items

Study session lasted about 5.5 hours

Repeat Detection task to maintain focus

1-back

Followed by 300 2-alternative forced choice tests

1024-back

Slide by Aude Oliva



Slide by Aude Oliva



how far can we push the fidelity of

visual LTM representation ?

Same object, different states

Slide by Aude Oliva



Visual Cognition

Expert Predictions

92%

Massive Memory I: Recognition Memory Results

Replication of Standing (1973)

Slide by Aude Oliva



92% 88% 87%

Massive Memory I: Recognition Memory Results

Slide by Aude Oliva



Extrapolation of Repeat Detection Data

Human performances for n = 1024

Power law

(r 2=.988)

Quadratic (r 2=.988)

Brady, Konkle, Alvarez, Oliva (submitted) Slide by Aude Oliva



Building datasets



Collecting datasets

(towards 106-7 examples)

ESP game (CMU)Luis Von Ahn and Laura Dabbish 2004

LabelMe (MIT)Russell, Torralba, Freeman, 2005

StreetScenes (CBCL-MIT

)Bileschi, Poggio, 2006

WhatWhere (Caltech)Perona et al, 2007

PASCAL challenge

2006, 2007

Lotus Hill InstituteSong-Chun Zhu et al 2007

80 million images

Torralba, Fergus, Freeman, 2007



Names and faces

Tamara L. Berg, Alexander C. Berg, Jaety Edwards, Michael Maire, Ryan White, Yee Whye Teh, Erik Learned-Miller, David A. Forsyth. CVPR, 2004



Names and faces

30,281 face images, obtained by applying a face finder to approximately half a

million captioned news images

Clustering stage

Tamara L. Berg, Alexander C. Berg, Jaety Edwards, Michael Maire, Ryan White, Yee Whye Teh, Erik Learned-Miller, David A. Forsyth. CVPR, 2004

The general approach involves

using unambiguously labeled

data items to estimate discriminant

coordinates.

use a version of K-means to

allocate ambiguously labeled faces

to one of their labels

clean up the clusters by removing

data items far from the mean, and

re-estimate discriminantcoordinates

merge clusters based on facial

similarities



Semiautomatic labelingNumber of

labeled instances

1

10

1000

100

10000

10 100 1000

>1500 descriptions with less

than 100 samples

Semi-automatic labeling: Abramson & Freund, CVPR 2005

Labeling Google images: Fergus et al, ECCV 2004, ICCV 2005

Challenge: we need accurate labeling (similar to users)

Segments instead of bounding boxes

Overlap with ground truth > 90%

LabelMe statistics



Semiautomatic labeling

Sailboats from LabelMe Sailboats from Google, Altavista, Flikr

SVM

Train asimple detector

Label Googleimages

Query LabelmeQuery online search tools



Semiautomatic labeling

Google ranking

Detector ranking

Precision(object

Presence)

Image rank100 500 1000



Examples semi-automatic labeling



Optimol

Li-Jia Li, Gang Wang and Li Fei-Fei. OPTIMOL: automatic Object Picture collecTion via Incremental MOdel Learning. IEEE

Computer Vision and Pattern Recognition (CVPR), Minneapolis, 2007

Once a model is learned, it can be used to do classification on the images from the web resource. If the image is classifiedas in this object category, it gets accepted and incorporated into the collected dataset. Otherwise, it will be discarded. Themodel will again be updated by the newly accepted images in current round. In this incremental way, the category modelgets more and more robust. As a consequence, the collected dataset gets larger and larger with reliable images.



im2gpsInstead of using objects labels, the web provides other kinds of metadata associate

to large collections of images

Hays & Efros. CVPR 2008

20 million geotagged and geographic text-labeled images



Hays & Efros. CVPR 2008

im2gps



Video Google

S ivic, J. and Zisserman, A. Video Google: A Text Retrieval Approach to Object Matching in Videos

Proceedings of the International Conference on Computer Vision (2003)



Visually defined search

Given an object specified by its image, retrieve allshots containing the object:

must handle viewpoint change etc

must be efficient at run time

peopleobjects places Slide by Josef Sivic



Object search in video: why is it hard?

an object¶s imaged appearance varies «

scale changes

lighting changes

viewpoint changes

partial occlusion

sheer amount of data

feature length movie ~ 100,000 -150,000 frames

Slide by Josef Sivic

Visual description visual words



Image visual nouns

Visual description ± visual words




Visual vocabulary unaffected by scale and viewpoint

The same visual word Slide by Josef Sivic

Image representation using visual words



Image representation using visual words

Use efficient google like search on visual words


Vid G l



Efficient search: In a classical file

structure all words are stored in the

document they appear in. An inverted file

structure has an entry (hit list) for each

word where all occurrences of the word

in all documents are stored. In our case

the inverted file has an entry for each

visual word, which stores all the

matches, i.e. occurrences of the same

word in all frames. The document vector

is very sparse and use of an inverted file

makes the retrieval very fast. Querying a

database of 4k frames takes about 0.1

second with a Matlab implementation on

a 2GHz pentium.

Video GoogleQuery:

Retrieved

frames:

retrieved shotsxamp e :



retrieved shotsxamp e :

Groundhog Day

Video Google, Sivic & Zisserman, ICCV 2003


E l C bl retrieved shots



Example: Casablancaretrieved shots




Scalable recognition with a vocabulary tree

David Nister and Henrik Stewenius. CVPR 2006.

With a good image similarity




and a lot of data«

Input image Nearest neighbors

22,000 LabelMe scenes

Hays, Efros, Siggraph 2006Russell, Liu, Torralba, Fergus, Freeman. NIPS 2007





and a lot of data«

Russell, Liu, Torralba, Fergus, Freeman. NIPS 2007





and a lot of data«






and a lot of data«






and a lot of data«


Outputs



Outputs




road

window

keyboard

screenscreen


Next Wednesday



Next Wednesday

N. Snavely, S. M. Seitz, R. Szeliski, Photo tourism:Exploring photo collections in 3D, Siggraph 2006(website) (code)

J. Hays, A. A. Efros. Scene Completion Using Millions of Photographs. SIGGRAPH 2007, (website and code)

A. Torralba, R. Fergus, W. T. Freeman, 80 million tinyimages: a large dataset for non-parametric object andscene recognition. PAMI 2008. (website)

SIFT flow



SIFT flow

Dense sampling in time : classical optical flow ::Dense sampling in world images: SIFT flow

Ce Liu, Jenny Yuen, Torralba, Sivic, Freeman

Matching frames / views



Matching frames / views

The two images are taken from the samescene with different time and/or

perspective

Liu, Yuen, Torralba, Sivic, Freeman. ECCV 08

Matching scenes



Matching scenes

Two images taken from the same scenecategory, but different instances

Contain different objects with different

scales, perspectives and spatial location

Liu, Yuen, Torralba, Sivic, Freeman. ECCV 08

Image representation




128 dimensions/pixel

SIFT Visualization: map 128dimensions in 3D color space





Same scene instance matching

(a) Query image ( b) Dense SIFT (c) Best match (d) SIFT of (c) (e) Warped (c) (f) Warped (d)

Matching different scenes



Matching different scenes

Matching: objects



Matching: objects

S hi



Scene matching

S t hi



Scene matching

Failures



Failures

The nearest neighbors may not contain similar scenes or object categories (SIFT flow tries to

match image structures anyway)

Dealing with millions of images



Dealing with millions of images

Input image

Binary codes for global scene



Binary codes for global scene

representation

Short codes allow for storing millions of images

Efficient search: hamming distance

(search millions of images in fewmicroseconds)

Internet scale experiments: compute

nearest neighbors between all images inthe internet

512 bits

Binary codes for images




Want images with similar content

to have similar binary codes

Use Hamming distance between codes

± Number of bit flips

± E.g.:

Semantic Hashing [Salakhutdinov & Hinton,

2007]

± Text documents

Ham_Dist(10001010,10001110)=1

Ham_Dist(10001010,11101110)=3

Slide Rob Fergus





Permits fast lookup via hashing

± Each code is a memory address

± Find neighbors by exploring Hamming

ball around query address

± Lookup time depends on

radius of ball,

NOT

on # data points

Address Space

Query ImageSemantic Hash

Function

Semantically

similar

images

Query address

F i g u r e a d a p t e d f r o m

S a l a k h u t d i n o v &

H i n t o n 0 7

Slide Rob Fergus

Compact Binary Codes



Compact Binary Codes

Google has few billion images (109)

Big PC has ~10 Gbytes (1011 bits)

Codes must fit in memory (disk too slow) Budget of 102 bits/image

1 Megapixel image is 107 bits

32x32 color image is 104 bits

Semantic hash function must also reduce

dimensionality

Slide Rob Fergus

How many bits do we need?




Goal: to decide when two images are similar (two imagesare similar if they contain the same large object classesin similar spatial configurations)

First, let¶s use some hand-wavy arguments to gain someintuition about how many bits do we need. If we have1000 object categories, then we would need 1000 bits totell if an object is present or not (assuming independentobjects). However, if we assume that objects are sparse,

and that, on average, there are only 5 important objectpresent on an image, then we would need only 50 bits todescribe the image content. Adding spatial location canadd another 8*5 = 40 bits.

3 mega pixels

8 million bitsShort



8 million bits

1000 pixels

512 bits

image

codes





16 bits

64 bits

32 bits

128 bits

256 bits

512 bits

1024 bits

2048 bits

24576 bits



M

e a s ur i n gi m a g

e s i mi l a

r i t y

ann o t a t e d d a t a

Number of pixels

BuildingSky

Road

Person

Sidewalk

Car

Sculpture

Bicycle

Number of pixels

Building

Sky

Road

Person

Sidewalk

Car

Sculpture

Bicycle

Number of pixels

Building

Sky

Road

Person

Sidewalk

Car

Sculpture

Bicycle

Number of pixels

Building

Sky

Road

Person

Sidewalk

Car

Sculpture

Bicycle

Number of pixels

Building

Sky

Road

Person

Sidewalk

Car

Sculpture

Bicycle

Numberofpixels

Building

Sky

Road

Person

Sidewalk

Car

Sculpture

Bicycle

Numberofpixels

Building

Sky

Road

Person

Sidewalk

Car

Sculpture

Bicycle

Numberofpixels

Building

Sky

Road

Person

Sidewalk

Car

Sculpture

Bicycle

Numberofpixels

Building

Sky

Road

Person

Sidewalk

Car

Sculpture

Bicycle

Numberofpixels

Building

Sky

Road

Person

Sidewalk

Numberofpixels

Building

Sky

Road

Person

Sidewalk

Numberofpixels

Building

Sky

Road

Person

Sidewalk

Car

Sculpture

Bicycle

Numberofpixels

Building

Sky

Road

Person

Sidewalk

Car

Sculpture

Bicycle

Numberofpixels

Building

Sky

Road

Person

Sidewalk

Numberofpixels

Building

Sky

Road

Person

Sidewalk

Car

Sculpture

Bicycle

Numberofpixels

Building

Sky

Road

Person

Sidewalk

Car

Sculpture

Bicycle

Numberofpixels

Building

Sky

Road

Person

Sidewalk

S p a t i al p y r ami d m a t c h

i n g [ L az e b n

S ( h 1 ,h 2 ) = s um

( mi n ( h 1 ,h 2 ) )



Hashing



Hashing

We consider the following learning problem - given a database of images {xi} and a distance function D(i, j) we seek a binary feature

vector yi = f(xi) that preserves the nearest neighbor relationships

using a Hamming distance.

Salakhutdinov and Hinton [SIGIR 2007], Shakhnarovich et al [ICCV 2003], Athitsos et al. [ICDE 2008], Grauman et al[CVPR 2007], Nascimentio et al [ACM Smyp. App. Computing 2002], Wang [ICME 2006], Wang [PAMI 2008],

Locality Sensitive Hashing



Locality Sensitive Hashing

Gionis, A. & Indyk, P. & Motwani, R. (1999) Take random projections of data

Quantize each projection with few bits

For our N bit code:

± Compute first N PCA

components of data

± Each random projection

must be linear combination of

the N PCA components

Pose estimation



Pose estimation

Fast Pose Estimation with Parameter Sensitive Hashing.

Shakhnarovich, Viola, Darrell. ICCV 2003

Learning hamming distances with boosting



g g g

Shaknarovich and Darrell

Each image is represented by a binary vector with M bits

x = vector of image features

hi = function with binary output

y = binary vector

Distance between two images is given by a weightedHamming distance

y = [h1(x), h2(x), ..., hM(x)]

n=1

M

D(i, j) = 7 En|hn(xi) í hn(x j)|

The weights Ei and the functions hn(xi) that map the input vector xi into binary

features are learned.

Learning



Learning

Positive examples are pairs of images xi, x j so that x j is one of the N nearest

neighbors of xi. Negative examples are pairs of images that are not neighbors.

In BoostSSC, each regression stump has the form:

At each iteration n we select the parameters of f n, the regression coefficients (En,

Fn), the stump parameters (where en is a unit vector, so that eT

n x returns the k-thcomponent of x, and Tn is a threshold), to minimize the square loss:

Where K is the number of training pairs, zk is the neighborhood label (zk = 1 if the

two images are neighbors and zk = í1 otherwise), and wkn is the weight for eachtraining pair at iteration n given by

hn(xi)

Shaknarovich and Darrell

Compressing the gist descriptor



p g g p

GIST

[Oliva and Torralba¶01]

Original image 1

0

1

1

1

0

0

1

«

Ground truth neighbors Gist Gist (32 ± bits)Input image






y

road

mountain treecar

sky

LabelMe: 22,000 images

32 bits

Tiny Images: 107 images

256 bits

Scene matching with camera



transformations




g p

Color layout

GIST

[Oliva and Torralba¶01]Original image

Scene matching with camera view



3. Find a match to fill

the missing pixels

g

transformations: Translation

1. Move camera

2. View from the

virtual camera

4. Locally align images

5. Find a seam

6. Blend in the gradient domain




4. Stitched rotation

g

transformations: Camera rotation

1. Rotate camera

2. View from the

virtual camera

3. Find a match to fill-

in the missing pixels

5. Display on a cylinder




g

transformations: Forward motion

1. Move camera

2. View from the

virtual camera

3. Find a match to

replace pixels

Tour from a single image



Navigate the virtual space using intuitive motion controls

Tour from a single image

Basic camera motions









Exploring famous sites



p g

Direction of forward motion



Input image Query region Best match forward

Forward motion (towards image centre)

Forward motion on the ground plane

[Torralba and Sinha¶01, Lalonde¶07,

Hoiem¶07]

If images are from the same



g

place«

Google Street View PhotoToursim/PhotoSynth[Snavely et al.,2006](controlled image capture)

(register images based on

multi-view geometry)

Next Wednesday



N. Snavely, S. M. Seitz, R. Szeliski, Photo tourism:Exploring photo collections in 3D, Siggraph 2006(website) (code)

J. Hays, A. A. Efros. Scene Completion Using Millions of Photographs. SIGGRAPH 2007, (website and code)

A. Torralba, R. Fergus, W. T. Freeman, 80 million tinyimages: a large dataset for non-parametric object and

scene recognition. PAMI 2008. (website)

Documents

MIT6870_ORSU_lecture7: Power of 10