MIT6870_ORSU_lecture7: Power of 10

8/3/2019 MIT6870_ORSU_lecture7: Power of 10

http://slidepdf.com/reader/full/mit6870orsulecture7-power-of-10 1/106

Lecture 7Powers of 10

6.870 Object Recognition and Scene Understandinghttp://people.csail.mit.edu/torralba/courses/6.870/6.870.recognition.htm

Wednesday

Presenter: Vladimir Bychkovsky

Evaluator: Krista Ehinger

The internet power

http://en.wikipedia.org/wiki/One_red_paperclip

Human Vision

Some key properties:

Many input modalities

Active Supervised, unsupervised,semi supervised learning. Itcan look for supervision.

Performance: amazing

How it works: no idea

Robot Vision

Performance: It does not work

How it works: SIFT+SVM+HMM

Many poor input modalities

Active, but it does not go far

Internet Vision

Performance: The more data, the better

How it works: SIFT+SVM+LSH

Many input modalities

It can reach everywhere Tons of data

Image credit: Matt Britt

Past and future of image datasets in

computer vision

Lenaa dataset in one picture

Number of

pictures

Human Click Limit(all humanity takingone picture/secondduring 100 years)

Time1996

40.000

2 billion

The extremes of learning

Number of

training

samples

1 10 102 103 104 105

Extrapolation problemGeneralization

Transfer learning

Interpolation problemCorrespondence

Finding the differences

Traditionaldatasets

Lecture 2Lecture 3 Lecture 7

Scenes are unique

But not all scenes are so original

Images

A. Torralba, R. Fergus, W.T.Freeman. PAMI 2008

Images

A. Torralba, R. Fergus, W.T.Freeman. PAMI 2008

Images

Automatic Colorization Result

Grayscale input High resolution

Colorization of input using average

A. Torralba, R. Fergus, W.T.Freeman. 2008

Automatic Orientation

Many images haveambiguous orientation

Look at top 25%by confidence:

Examples of high and low confidenceimages:

Automatic Orientation Examples

A. Torralba, R. Fergus, W.T.Freeman. 2008

How many images are there?

Powers of 10

Number of images on my hard drive: 104

Number of images seen during my first 10 years: 108

(3 images/second * 60 * 60 * 16 * 365 * 10 = 630720000)

Number of images seen by all humanity: 1020

106,456,367,669 humans1 * 60 years * 3 images/second * 60 * 60 * 16 * 365 =1 from http://www.prb.org/Articles/2002/HowManyPeopleHaveEverLivedonEarth.aspx

Number of all images in the universe: 10243

1081 atoms * 1081 * 1081 =

Number of all 32x32 images: 107373

25632*32*3

~ 107373

Chandler, and Field. (2007).

Torralba, Fergus, Freeman. PAMI 2008

10% of the objects

account for 90% of the data

~Zipf¶s law

Caltech 101

Tiny images

LabelMe

We need transfer learning

10% of the objectsaccount for 90% of

the data

~Zipf¶s law

Caltech 101

Tiny images

LabelMe

Is this something humans do at all?

What¶s the Capacity of Visual Long Term Memory?

³Basically, my recollection is that we justseparated the pictures into distinct thematic

categories: e.g. cars, animals, single-person, 2-people, plants, etc.) Only a fewslides were selected which fell into eachcategory, and they were visually distinct.´

According to Standing

Standing (1973)

10,000 images

83% Recognition

What we know« What we don¶t know«

Sparse Details

DogsDogsPlaying CardsPlaying Cards

³Gist´ Only Highly Detailed

« people canremember thousands

of images

« what people are rememberingfor each item?

High Fidelity Visual

Memory is possible

(Hollingworth 2004)

Slide by Aude Oliva

Massive Memory I: Methods

... ......

Showed 14 observers 2500 categorically unique objects

1 at a time, 3 seconds each

800 ms blank between items

Study session lasted about 5.5 hours

Repeat Detection task to maintain focus

1-back

Followed by 300 2-alternative forced choice tests

1024-back

Slide by Aude Oliva

how far can we push the fidelity of

visual LTM representation ?

Same object, different states

Slide by Aude Oliva

Visual Cognition

Expert Predictions

Massive Memory I: Recognition Memory Results

Replication of Standing (1973)

Slide by Aude Oliva

92% 88% 87%

Massive Memory I: Recognition Memory Results

Slide by Aude Oliva

Extrapolation of Repeat Detection Data

Human performances for n = 1024

Power law

(r 2=.988)

Quadratic (r 2=.988)

Brady, Konkle, Alvarez, Oliva (submitted) Slide by Aude Oliva

Building datasets

Collecting datasets

(towards 106-7 examples)

ESP game (CMU)Luis Von Ahn and Laura Dabbish 2004

LabelMe (MIT)Russell, Torralba, Freeman, 2005

StreetScenes (CBCL-MIT

)Bileschi, Poggio, 2006

WhatWhere (Caltech)Perona et al, 2007

PASCAL challenge

2006, 2007

Lotus Hill InstituteSong-Chun Zhu et al 2007

80 million images

Torralba, Fergus, Freeman, 2007

Names and faces

Tamara L. Berg, Alexander C. Berg, Jaety Edwards, Michael Maire, Ryan White, Yee Whye Teh, Erik Learned-Miller, David A. Forsyth. CVPR, 2004

Names and faces

30,281 face images, obtained by applying a face finder to approximately half a

million captioned news images

Clustering stage

Tamara L. Berg, Alexander C. Berg, Jaety Edwards, Michael Maire, Ryan White, Yee Whye Teh, Erik Learned-Miller, David A. Forsyth. CVPR, 2004

The general approach involves

using unambiguously labeled

data items to estimate discriminant

coordinates.

use a version of K-means to

allocate ambiguously labeled faces

to one of their labels

clean up the clusters by removing

data items far from the mean, and

re-estimate discriminantcoordinates

merge clusters based on facial

similarities

Semiautomatic labelingNumber of

labeled instances

10 100 1000

>1500 descriptions with less

than 100 samples

Semi-automatic labeling: Abramson & Freund, CVPR 2005

Labeling Google images: Fergus et al, ECCV 2004, ICCV 2005

Challenge: we need accurate labeling (similar to users)

Segments instead of bounding boxes

Overlap with ground truth > 90%

LabelMe statistics

Semiautomatic labeling

Sailboats from LabelMe Sailboats from Google, Altavista, Flikr

Train asimple detector

Label Googleimages

Query LabelmeQuery online search tools

Semiautomatic labeling

Google ranking

Detector ranking

Precision(object

Presence)

Image rank100 500 1000

Examples semi-automatic labeling

Optimol

Li-Jia Li, Gang Wang and Li Fei-Fei. OPTIMOL: automatic Object Picture collecTion via Incremental MOdel Learning. IEEE

Computer Vision and Pattern Recognition (CVPR), Minneapolis, 2007

Once a model is learned, it can be used to do classification on the images from the web resource. If the image is classifiedas in this object category, it gets accepted and incorporated into the collected dataset. Otherwise, it will be discarded. Themodel will again be updated by the newly accepted images in current round. In this incremental way, the category modelgets more and more robust. As a consequence, the collected dataset gets larger and larger with reliable images.

im2gpsInstead of using objects labels, the web provides other kinds of metadata associate

to large collections of images

Hays & Efros. CVPR 2008

20 million geotagged and geographic text-labeled images

Hays & Efros. CVPR 2008

im2gps

Video Google

S ivic, J. and Zisserman, A. Video Google: A Text Retrieval Approach to Object Matching in Videos

Proceedings of the International Conference on Computer Vision (2003)

Visually defined search

Given an object specified by its image, retrieve allshots containing the object:

must handle viewpoint change etc

must be efficient at run time

peopleobjects places Slide by Josef Sivic

Object search in video: why is it hard?

an object¶s imaged appearance varies «

scale changes

lighting changes

viewpoint changes

partial occlusion

sheer amount of data

feature length movie ~ 100,000 -150,000 frames

Slide by Josef Sivic

Visual description visual words

Image visual nouns

Visual description ± visual words

Visual vocabulary unaffected by scale and viewpoint

The same visual word Slide by Josef Sivic

Image representation using visual words

Use efficient google like search on visual words

Vid G l

Efficient search: In a classical file

structure all words are stored in the

document they appear in. An inverted file

structure has an entry (hit list) for each

word where all occurrences of the word

in all documents are stored. In our case

the inverted file has an entry for each

visual word, which stores all the

matches, i.e. occurrences of the same

word in all frames. The document vector

is very sparse and use of an inverted file

makes the retrieval very fast. Querying a

database of 4k frames takes about 0.1

second with a Matlab implementation on

a 2GHz pentium.

Video GoogleQuery:

Retrieved

frames:

retrieved shotsxamp e :

Groundhog Day

Video Google, Sivic & Zisserman, ICCV 2003

E l C bl retrieved shots

Example: Casablancaretrieved shots

Scalable recognition with a vocabulary tree

David Nister and Henrik Stewenius. CVPR 2006.

With a good image similarity

and a lot of data«

Input image Nearest neighbors

22,000 LabelMe scenes

Hays, Efros, Siggraph 2006Russell, Liu, Torralba, Fergus, Freeman. NIPS 2007

and a lot of data«

Russell, Liu, Torralba, Fergus, Freeman. NIPS 2007

and a lot of data«

Outputs

window

keyboard

screenscreen

Next Wednesday

N. Snavely, S. M. Seitz, R. Szeliski, Photo tourism:Exploring photo collections in 3D, Siggraph 2006(website) (code)

J. Hays, A. A. Efros. Scene Completion Using Millions of Photographs. SIGGRAPH 2007, (website and code)

A. Torralba, R. Fergus, W. T. Freeman, 80 million tinyimages: a large dataset for non-parametric object andscene recognition. PAMI 2008. (website)

SIFT flow

Dense sampling in time : classical optical flow ::Dense sampling in world images: SIFT flow

Ce Liu, Jenny Yuen, Torralba, Sivic, Freeman

Matching frames / views

The two images are taken from the samescene with different time and/or

perspective

Liu, Yuen, Torralba, Sivic, Freeman. ECCV 08

Matching scenes

Two images taken from the same scenecategory, but different instances

Contain different objects with different

scales, perspectives and spatial location

Liu, Yuen, Torralba, Sivic, Freeman. ECCV 08

Image representation

128 dimensions/pixel

SIFT Visualization: map 128dimensions in 3D color space

Same scene instance matching

(a) Query image ( b) Dense SIFT (c) Best match (d) SIFT of (c) (e) Warped (c) (f) Warped (d)

Matching different scenes

Matching: objects

Scene matching

S t hi

Scene matching

Failures

The nearest neighbors may not contain similar scenes or object categories (SIFT flow tries to

match image structures anyway)

Dealing with millions of images

Input image

Binary codes for global scene

representation

Short codes allow for storing millions of images

Efficient search: hamming distance

(search millions of images in fewmicroseconds)

Internet scale experiments: compute

nearest neighbors between all images inthe internet

512 bits

Binary codes for images

Want images with similar content

to have similar binary codes

Use Hamming distance between codes

± Number of bit flips

± E.g.:

Semantic Hashing [Salakhutdinov & Hinton,

± Text documents

Ham_Dist(10001010,10001110)=1

Ham_Dist(10001010,11101110)=3

Slide Rob Fergus

Permits fast lookup via hashing

± Each code is a memory address

± Find neighbors by exploring Hamming

ball around query address

± Lookup time depends on

radius of ball,

on # data points

Address Space

Query ImageSemantic Hash

Function

Semantically

similar

images

Query address

F i g u r e a d a p t e d f r o m

S a l a k h u t d i n o v &

H i n t o n 0 7

Slide Rob Fergus

Compact Binary Codes

Google has few billion images (109)

Big PC has ~10 Gbytes (1011 bits)

Codes must fit in memory (disk too slow) Budget of 102 bits/image

1 Megapixel image is 107 bits

32x32 color image is 104 bits

Semantic hash function must also reduce

dimensionality

Slide Rob Fergus

How many bits do we need?

Goal: to decide when two images are similar (two imagesare similar if they contain the same large object classesin similar spatial configurations)

First, let¶s use some hand-wavy arguments to gain someintuition about how many bits do we need. If we have1000 object categories, then we would need 1000 bits totell if an object is present or not (assuming independentobjects). However, if we assume that objects are sparse,

and that, on average, there are only 5 important objectpresent on an image, then we would need only 50 bits todescribe the image content. Adding spatial location canadd another 8*5 = 40 bits.

3 mega pixels

8 million bitsShort

8 million bits

1000 pixels

512 bits

16 bits

64 bits

32 bits

128 bits

256 bits

512 bits

1024 bits

2048 bits

24576 bits

e a s ur i n gi m a g

e s i mi l a

r i t y

ann o t a t e d d a t a

Number of pixels

BuildingSky

Person

Sidewalk

Sculpture

Bicycle

Number of pixels

Building

Person

Sidewalk

Sculpture

Bicycle

Number of pixels

Building

Person

Sidewalk

Sculpture

Bicycle

Number of pixels

Building

Person

Sidewalk

Sculpture

Bicycle

Number of pixels

Building

Person

Sidewalk

Sculpture

Bicycle

Numberofpixels

Building

Person

Sidewalk

Sculpture

Bicycle

Numberofpixels

Building

Person

Sidewalk

Sculpture

Bicycle

Numberofpixels

Building

Person

Sidewalk

Sculpture

Bicycle

Numberofpixels

Building

Person

Sidewalk

Sculpture

Bicycle

Numberofpixels

Building

Person

Sidewalk

Numberofpixels

Building

Person

Sidewalk

Numberofpixels

Building

Person

Sidewalk

Sculpture

Bicycle

Numberofpixels

Building

Person

Sidewalk

Sculpture

Bicycle

Numberofpixels

Building

Person

Sidewalk

Numberofpixels

Building

Person

Sidewalk

Sculpture

Bicycle

Numberofpixels

Building

Person

Sidewalk

Sculpture

Bicycle

Numberofpixels

Building

Person

Sidewalk

S p a t i al p y r ami d m a t c h

i n g [ L az e b n

S ( h 1 ,h 2 ) = s um

( mi n ( h 1 ,h 2 ) )

Hashing

We consider the following learning problem - given a database of images {xi} and a distance function D(i, j) we seek a binary feature

vector yi = f(xi) that preserves the nearest neighbor relationships

using a Hamming distance.

Salakhutdinov and Hinton [SIGIR 2007], Shakhnarovich et al [ICCV 2003], Athitsos et al. [ICDE 2008], Grauman et al[CVPR 2007], Nascimentio et al [ACM Smyp. App. Computing 2002], Wang [ICME 2006], Wang [PAMI 2008],

Locality Sensitive Hashing

Gionis, A. & Indyk, P. & Motwani, R. (1999) Take random projections of data

Quantize each projection with few bits

For our N bit code:

± Compute first N PCA

components of data

± Each random projection

must be linear combination of

the N PCA components

Pose estimation

Fast Pose Estimation with Parameter Sensitive Hashing.

Shakhnarovich, Viola, Darrell. ICCV 2003

Learning hamming distances with boosting

Shaknarovich and Darrell

Each image is represented by a binary vector with M bits

x = vector of image features

hi = function with binary output

y = binary vector

Distance between two images is given by a weightedHamming distance

y = [h1(x), h2(x), ..., hM(x)]

D(i, j) = 7 En|hn(xi) í hn(x j)|

The weights Ei and the functions hn(xi) that map the input vector xi into binary

features are learned.

Learning

Positive examples are pairs of images xi, x j so that x j is one of the N nearest

neighbors of xi. Negative examples are pairs of images that are not neighbors.

In BoostSSC, each regression stump has the form:

At each iteration n we select the parameters of f n, the regression coefficients (En,

Fn), the stump parameters (where en is a unit vector, so that eT

n x returns the k-thcomponent of x, and Tn is a threshold), to minimize the square loss:

Where K is the number of training pairs, zk is the neighborhood label (zk = 1 if the

two images are neighbors and zk = í1 otherwise), and wkn is the weight for eachtraining pair at iteration n given by

hn(xi)

Shaknarovich and Darrell

Compressing the gist descriptor

p g g p

[Oliva and Torralba¶01]

Original image 1

Ground truth neighbors Gist Gist (32 ± bits)Input image

mountain treecar

LabelMe: 22,000 images

32 bits

Tiny Images: 107 images

256 bits

Scene matching with camera

transformations

Color layout

[Oliva and Torralba¶01]Original image

Scene matching with camera view

3. Find a match to fill

the missing pixels

transformations: Translation

1. Move camera

2. View from the

virtual camera

4. Locally align images

5. Find a seam

6. Blend in the gradient domain

4. Stitched rotation

transformations: Camera rotation

1. Rotate camera

2. View from the

virtual camera

3. Find a match to fill-

in the missing pixels

5. Display on a cylinder

transformations: Forward motion

1. Move camera

2. View from the

virtual camera

3. Find a match to

replace pixels

Tour from a single image

Navigate the virtual space using intuitive motion controls

Tour from a single image

Basic camera motions

Exploring famous sites

Direction of forward motion

Input image Query region Best match forward

Forward motion (towards image centre)

Forward motion on the ground plane

[Torralba and Sinha¶01, Lalonde¶07,

Hoiem¶07]

If images are from the same

place«

Google Street View PhotoToursim/PhotoSynth[Snavely et al.,2006](controlled image capture)

(register images based on

multi-view geometry)

Next Wednesday

N. Snavely, S. M. Seitz, R. Szeliski, Photo tourism:Exploring photo collections in 3D, Siggraph 2006(website) (code)

J. Hays, A. A. Efros. Scene Completion Using Millions of Photographs. SIGGRAPH 2007, (website and code)

A. Torralba, R. Fergus, W. T. Freeman, 80 million tinyimages: a large dataset for non-parametric object and

scene recognition. PAMI 2008. (website)

MIT6870_ORSU_lecture7: Power of 10

Documents

POWER – THE POWER 4 YOU PFPN 10-1000/10-1500 · POWER – THE POWER 4 YOU PFPN 10-1000/10-1500 ... TROkOMAT PLUS The patented Ziegler automatic priming system Multifunction display

power electronics week 10 - University of Pittsburghakwasins/power electronics applications pv.pdfPower Electronics Applications in Photovoltaic Power GenerationPhotovoltaic Power

Marketing Class 10: The Power of Smarketing

FORM 10-Qd18rn0p25nwr6d.cloudfront.net/CIK-0000092122/fd83733b-ee51-43f… · Form 10-K Annual Report on Form 10-K of Southern Company, Alabama Power, Georgia Power, Gulf Power, Mississippi

Craneo 10 power

FEDERALISM AND THE BALANCE OF POWER: CHINA’S … · federalism and the balance of power 10

Acts 10 38 the power of the devil power point church sermon

Think and Grow Rich Power Affirmations - Chapter 10 Power of the Master Mind

ELECTRIC POWER OUTLOOK - POWER Magazine :: Power ... · NERC Reliability Assessment The 2018 Long-Term Reliability Assessment 10 is NERC’s independent review of the 10-year reliability

The Power of Mindfulness · Microsoft PowerPoint - The Power of Mindfulness.pptx Author: Amanda Created Date: 2/17/2020 10:10:02 AM

POWER TOOLS · 2019. 10. 10. · power tools

The Power of ProteinThe Power of Protein Keywords The Power of Protein Created Date 9/9/2013 10:51:16 AM

Harnessing the power of DX10.ppt [Read-Only]developer.amd.com/wordpress/media/2012/10/Harnessing the power of DX10 GDCD2007.pdf6 August 20, 2007 Harnessing the Power of DirectX 10

The power of study design 2014 10-27

PowerUP Your Life The Power of Prayer The Power of Community The Power of 10 The Power of One

Inbound Certification Class 10: The Power of Smarketing

Power Electronics Chapter 10 Application of Power Electronics

Refraction Power- Point Power Point 10-3

Analysis & Function of Unified Power Quality Conditioner for Power … · 2014-10-10 · Keywords: Active power filter (APF), harmonic compensation, power quality, reactive power

The Power Of Six Sigma 10 Subir Chowdhury