Vision

Silvio SavareseSilvio Savarese 21-Jan-15

Professor Silvio Savarese

Computational Vision and Geometry Lab

CS223AVision in Robotics

Sensing is the future

3


4


5


Everything is a sensor

Modern vision sensors

night

Kinect

thermal

w/ gravity

Sensing is not the hard problem

Intelligent understanding of the sensing data is the challenge!

What does it mean intelligent understanding of the sensing data?

Sensing device

Extract information

Interpretation

Computer vision

Information: features, 3D structure, motion flows, etcInterpretation: recognize objects, scenes, actions, events

Computational device

Computer vision studies the tools and theories that enable the design of machinesthat can extract useful information from imagery data

(images and videos) toward the goal of interpreting the world

http://images.google.com/imgres?imgurl=http://www.computergate.com/products/images/tb181/MDCLQP4.jpg&imgrefurl=http://www.computergate.com/products/category.cfm?prodseq=F2&h=340&w=340&sz=11&tbnid=5a4dx2QsWasJ:&tbnh=115&tbnw=115&hl=en&start=1&prev=/images?q=web+camera&svnum=10&hl=en&lr=&rls=GGLG,GGLG:2005-34,GGLG:enhttp://images.google.com/imgres?imgurl=http://www.computergate.com/products/images/tb181/MDCLQP4.jpg&imgrefurl=http://www.computergate.com/products/category.cfm?prodseq=F2&h=340&w=340&sz=11&tbnid=5a4dx2QsWasJ:&tbnh=115&tbnw=115&hl=en&start=1&prev=/images?q=web+camera&svnum=10&hl=en&lr=&rls=GGLG,GGLG:2005-34,GGLG:en

Computer vision and Applications

12

1990 20102000

EosSystems

Fingerprint biometrics

Augmentation with 3D computer graphics

14

3D object prototyping

15PhotomodelerEosSystems


16

1990 20102000

EosSystems Autostich

New features detector/descriptors CV leverages machine learning

Face detection

Web applications

19

Photometria

Panoramic Photography

kolor

21

3D modeling of landmarks


22

1990 20102000

Kooaba

A9

Kinect


Google Goggles

Large scale image matching Efficient SLAM/SFM Better clouds More bandwidth Increase computational power

Movies, news, sports

Image search engines

http://www.picsearch.com/http://www.picsearch.com/

Google Goggles24

Visual search and landmarks recognition

25

Visual search and landmarks recognition

26

Augmented reality

Motion sensing and gesture recognition

27

Automotive safety

Mobileye: Vision systems in high-end BMW, GM, Volvo models

Source: A. Shashua, S. Seitz

http://www.mobileye.com/


Vision for robotics, space exploration

Factory inspectionSurveillance

Autonomous driving,robot navigation

Assistive technologies

Sources: K. Grauman, L. Fei-Fei, S. Laznebick

Security

http://www.scottcamazine.com/photos/SecurityXrays/images/briefcase7_jpg.jpghttp://www.scottcamazine.com/photos/SecurityXrays/images/briefcase7_jpg.jpg

301990 20102000


Kooaba

A9

Kinect


Google Goggles

EosSystems


31199020102000

2D

3D

Google Goggles


321990

EosSystems

20102000

2D

3D

Google Goggles

Computer vision

3D Reconstruction 3D shape recovery 3D scene reconstruction Camera localization Pose estimation

2D Recognition Object detection Texture classification Target tracking Activity recognition

33

Camera systemsEstablish a mapping from 3D to 2D

Pinhole perspective projectionPinhole cameraf

f = focal length

c = center of the camera

c

)z

yf,

z

xf()z,y,x(

2E

3

f = focal length

uo, vo = offset

non-square pixels,

f

Oc

Projective camera

P

Ow

iw

kw

jwR,T

wPMP

wPTRKInternal parameters

External parameters

= skew angle

R,T = rotation, translation

P

Properties of Projection Points project to points Lines project to lines Distant objects look smaller

Properties of ProjectionAngles are not preservedParallel lines meet!

Parallel lines in the world intersect in the image at a vanishing point

One-point perspective Masaccio, Trinity,

Santa Maria

Novella, Florence,

1425-28

How to calibrate a cameraEstimate camera parameters such pose or focal length

?

Calibration Problem

P1 Pn with known positions in [Ow,iw,jw,kw]

p1, pn known positions in the image

Goal: compute intrinsic and extrinsic parameters

jC

Calibration rig

Calibration Problem

P1 Pn with known positions in [Ow,iw,jw,kw]

p1, pn known positions in the image

Goal: compute intrinsic and extrinsic parameters

jC

Calibration rig

image

Calibration Problem

jC

Calibration rig

How many correspondences do we need?

M has 11 unknown We need 11 equations 6 correspondences would do it

image

Calibration ProcedureCamera Calibration Toolbox for Matlab

J. Bouguet [1998-2000]

http://www.vision.caltech.edu/bouguetj/calib_doc/index.html#examples

Calibration Procedure

Pinhole perspective projectionOnce the camera is calibrated...

TRKM

C

Ow

-Internal parameters K are known-R, T are known but these can only relate C to the calibration rig

Pp

Can I estimate P from the measurement p from a single image?

No - in general [P can be anywhere along the line defined by C and p]

Pinhole perspective projectionRecovering structure from a single view

C

Ow

Pp

Calibration rig

Scene

Camera K

Why is it so difficult?

Intrinsic ambiguity of the mapping from 3D to image (2D)

Recovering structure from a single view

Courtesy slide S. Lazebnik

Two eyes help!

O2 O1

x2

x1

?

Two eyes help!

This is called triangulation

K =knownK =known

R, T

llX X

l'l

Find X that minimizes

),(),( 222

11

2XMxdXMxd

O1 O2

x1

x2

X

Triangulation

Stereo-view geometry

Correspondence: Given a point in one image, how can I find the corresponding point x in another one?

Camera geometry: Given corresponding points in two images, find camera matrices, position and pose.

Scene geometry: Find coordinates of 3D point from its projection into 2 or multiple images.

Epipolar Plane Epipoles e1, e2

Epipolar Lines

Baseline

Epipolar geometry

O1 O2

x2

X

x1

e1 e2

= intersections of baseline with image planes

= projections of the other camera center

Example: Converging image planes

e

e

O1 O2

X

e2x1 x2

e1

Example: Parallel image planes

Baseline intersects the image plane at infinity

Epipoles are at infinity

Epipolar lines are parallel to x axis

Example: Parallel Image Planes

e at

infinit

y

e at

infinit

y

Epipolar Constraint

O1 O2

p2

P

p1

e1e2

F p2 is the epipolar line associated with p2 (l1 = F p2)

FT p1 is the epipolar line associated with x1 (l2 = FT p1)

F e2 = 0 and FT e1 = 0

F is 3x3 matrix; 7 DOF

F is singular (rank two)

021 pFpT

Why F is useful?

- Suppose F is known

- No additional information about the scene and camera is given

- Given a point on left image, how can I find the corresponding point on right image?

l = FT xx

Why F is useful?

F captures information about the epipolar geometry of 2 views + camera parameters

MORE IMPORTANTLY: F gives constraints on how the scene changes under view point transformation (without reconstructing the scene!)

Powerful tool in: 3D reconstruction Multi-view object/scene matching

Multiple view geometry

Structure from motion problem

x1j

x2j

xmj

Xj

M1

M2

Mm

Given m images of n fixed 3D points

xij = Mi Xj , i = 1, , m, j = 1, , n

From the mxn correspondences xij, estimate:

m projection matrices Mi

n 3D points Xj

x1j

x2j

xmj

Xj

motion

structure

M1

M2

Mm

Structure from motion problem

2010.12.18 69

Structure from motion ambiguity

iiii TRKM jij XMx

jXH 1j HM

SFM can be solved up to a N-degree of freedom ambiguity

In the general case (nothing is known) the ambiguity is expressed by an arbitrary affine or projective transformation

jijij XHHMXMx -1

2010.12.18 70

Affine ambiguity

2010.12.18 71

Prospective ambiguity

Self-calibration

Condition N. Views

Constant internal parameters 3

Aspect ratio and skew known

Focal length and offset vary

4

Aspect ratio and skew known

Focal length and offset vary

5

skew =0, all other parameters vary 8

Prior knowledge on cameras or scene can be used to add constraints and remove ambiguities Obtain metric reconstruction (up to scale)

Bundle adjustment

x1j

x2j

x3j

Xj

P1

P2

P3

M1Xj

M2XjM3Xj

?

Non-linear method for refining structure and motion

Minimizing re-projection error

It can be used before or after metric upgrade

2m

1i

n

1j

jiij M,D),M(E

XxX

2010.12.1874

Bundle adjustment

Advantages Handle large number of views Handle missing data

Limitations Large minimization problem (parameters grow with number of views) Requires good initial condition

Non-linear method for refining structure and motion

Minimizing re-projection error

It can be used before or after metric upgrade

2m

1i

n

1j

jiij M,D),M(E

XxX

Results and applications

Courtesy of Oxford Visual Geometry Group

Levoy et al., 00

Hartley & Zisserman, 00

Dellaert et al., 00

Rusinkiewic et al., 02

Nistr, 04

Brown & Lowe, 04

Schindler et al, 04

Lourakis & Argyros, 04

Colombo et al. 05

Golparvar-Fard, et al. JAEI 10

Pandey et al. IFAC , 2010

Pandey et al. ICRA 2011

Microsofts PhotoSynth

Snavely et al., 06-08

Schindler et al., 08

Agarwal et al., 09

Frahm et al., 10

Lucas & Kanade, 81

Chen & Medioni, 92

Debevec et al., 96

Levoy & Hanrahan, 96

Fitzgibbon & Zisserman,

98

Triggs et al., 99

Pollefeys et al., 99

Kutulakos & Seitz, 99

M. Pollefeys et al 98---



Noah Snavely, Steven M. Seitz, Richard Szeliski, "Photo tourism: Exploring photo collections in 3D," ACM

Transactions on Graphics (SIGGRAPH Proceedings),2006,

http://phototour.cs.washington.edu/Photo_Tourism.pdf

Computer vision



78

Classification: Does this image contain a building? [yes/no]

Yes!

Classification:Is this an beach?

Image Search

Organizing photo collections

http://av.rds.yahoo.com/_ylt=A9ibyK4d.QpFu5UA7EFuCqMX;_ylu=X3oDMTBvcjFrYm5wBHBndANhdl9pbWdfaG9tZQRzZWMDbG9nbw--/SIG=11d79a3nr/EXP=1158433437/**http:/www.altavista.com/http://av.rds.yahoo.com/_ylt=A9ibyK4d.QpFu5UA7EFuCqMX;_ylu=X3oDMTBvcjFrYm5wBHBndANhdl9pbWdfaG9tZQRzZWMDbG9nbw--/SIG=11d79a3nr/EXP=1158433437/**http:/www.altavista.com/http://www.picsearch.com/http://www.picsearch.com/

Detection:Does this image contain a car? [where?]

car

Building

clock

personcar

Detection:Which object does this image contain? [where?]

clock

Detection:Accurate localization (segmentation)

Object detection is useful

SurveillanceAssistive technologies

SecurityAssistive driving

Computational photography

http://www.scottcamazine.com/photos/SecurityXrays/images/briefcase7_jpg.jpghttp://www.scottcamazine.com/photos/SecurityXrays/images/briefcase7_jpg.jpg

Categorization vs Single instance

recognitionWhich building is this? Marshall Field building in Chicago

Where is the crunchy nut?

Categorization vs Single instance

recognition

+ GPS

Recognizing landmarks in

mobile platforms

Applications of computer vision

Object: Person, back;1-2 meters away

Object: Police car, side view, 4-5 m away

Object: Building, 45 pose, 8-10 meters awayIt has bricks

Detection: Estimating object semantic

& geometric attributes

Activity or Event recognitionWhat are these people doing?

Visual Recognition

Design algorithms that are capable to

Classify images or videos

Detect and localize objects

Estimate semantic and geometrical attributes

Classify human activities and events

Why is this challenging?

How many object categories are there?

Challenges: viewpoint variation

Michelangelo 1475-1564 slide credit: Fei-Fei, Fergus & Torralba

Challenges: illumination

image credit: J. Koenderink

Challenges: scale

slide credit: Fei-Fei, Fergus & Torralba

Challenges: deformation

Challenges:

occlusion

Magritte, 1957 slide credit: Fei-Fei, Fergus & Torralba

Challenges: background clutter

Kilmeny Niland. 1995

Challenges: intra-class variation

Basic properties

Representation

How to represent an object category; which classification scheme?

Learning

How to learn the classifier, given training data

Recognition

How the classifier is to be used on novel data

Representation

- Building blocks: Sampling strategies

RandomlyMultiple interest operators

Interest operators Dense, uniformly

Ima

ge

cre

dits: F

-F.

Li, E

. N

ow

ak, J.

Siv

ic

Representation

- Building blocks: Choice of descriptors

[SIFT, HOG, codewords.]

Representation

Appearance only or location and appearance

Representation

Invariances

View point

Illumination

Occlusion

Scale

Deformation

Clutter

etc.

Representation

To handle intra-class variability, it is convenient to

describe an object categories using probabilistic

models

Object models: Generative vs Discriminative vs

hybrid

Object categorization:

the statistical viewpoint

)|( imagezebrap

)( ezebra|imagnop

vs.

)(

)(

)|(

)|(

)|(

)|(

zebranop

zebrap

zebranoimagep

zebraimagep

imagezebranop

imagezebrap

Bayes rule:)(

)()()(

Bp

ApB|ApA|Bp



)|( imagezebrap

)( ezebra|imagnop

vs.

Bayes rule:

)(

)(

)|(

)|(

)|(

)|(

zebranop

zebrap

zebranoimagep

zebraimagep

imagezebranop

imagezebrap

posterior ratio likelihood ratio prior ratio

)(

)()()(

Bp

ApB|ApA|Bp



Bayes rule:

)(

)(

)|(

)|(

)|(

)|(

zebranop

zebrap

zebranoimagep

zebraimagep

imagezebranop

imagezebrap

posterior ratio likelihood ratio prior ratio

Discriminative methods model posterior

Generative methods model likelihood and

prior

Discriminative models

Support Vector Machines

Guyon, Vapnik, Heisele,

Serre, Poggio

Boosting

Viola, Jones 2001,

Torralba et al. 2004,

Opelt et al. 2006,

106 examples

Nearest neighbor

Shakhnarovich, Viola, Darrell 2003

Berg, Berg, Malik 2005...

Neural networks

Slide adapted from Antonio TorralbaCourtesy of Vittorio Ferrari

Slide credit: Kristen Grauman

Latent SVM

Structural SVM

Felzenszwalb 00

Ramanan 03

LeCun, Bottou, Bengio, Haffner 1998

Rowley, Baluja, Kanade 1998

Generative models

Nave Bayes classifier Csurka Bray, Dance & Fan, 2004

Hierarchical Bayesian topic models (e.g. pLSA and LDA)

Object categorization: Sivic et al. 2005, Sudderth et al. 2005

Natural scene categorization: Fei-Fei et al. 2005

2D Part based models- Constellation models: Weber et al 2000; Fergus et al 200

- Star models: ISM (Leibe et al 05)

3D part based models: - multi-aspects: Sun, et al, 2009

Basic properties

Representation


Learning


Recognition


Learning parameters: What are you maximizing? Likelihood (Gen.) or performances on train/validation set (Disc.)

Learning


Level of supervision Manual segmentation; bounding box; image labels;

noisy labels

Learning

Batch/incremental

Priors


Level of supervision Manual segmentation; bounding box; image labels;

noisy labels

Learning

Batch/incremental

Training images:Issue of overfitting

Negative images for

discriminative methods

Priors

Basic properties

Representation


Learning


Recognition


Recognition task: classification, detection, etc..

Recognition

Recognition

Recognition task

Search strategy: Sliding Windows

Simple

Computational complexity (x,y, S, , N of classes)

- BSW by Lampert et al 08

- Also, Alexe, et al 10

Viola, Jones 2001,

Recognition

Recognition task


Simple


Localization

Objects are not boxes



Viola, Jones 2001,

Segmentation

Bottom up segmentation

Semantic segmentation

Felzenszwalb and Huttenlocher, 2004

Malik et al. 01

Maire et al. 08

Duygulu et al. 02

Recognition

Recognition task


Simple


Localization

Objects are not boxes

Prone to false positive



Non max suppression:

Canny 86

.

Desai et al , 2009

Viola, Jones 2001,

Successful methods using sliding windows

[Dalal & Triggs, CVPR 2005]

Subdivide scanning window

In each cell compute histogram of gradients

orientation.

Code available: http://pascal.inrialpes.fr/soft/olt/

- Subdivide scanning window

- In each cell compute histogram of

codewords of adjacent segments

[Ferrari & al, PAMI 2008]

Code available: http://www.vision.ee.ethz.ch/~calvin

http://pascal.inrialpes.fr/soft/olt/http://pascal.inrialpes.fr/soft/olt/

Recognition task

Search strategy : Probabilistic heat maps

Recognition

Original

image

Fergus et al 03

Leibe et al 04

Recognition task

Search strategy :

Hypothesis generation + verification

Recognition

Recognition

Category: car

Azimuth = 225

Zenith = 30

Savarese, 2007

Sun et al 2009

Liebelt et al., 08, 10

Farhadi et al 09

- It has metal

- it is glossy

- has wheels

Farhadi et al 09

Lampert et al 09

Wang & Forsyth 09

Recognition task

Search strategy

Attributes

Recognizing 3D objects

CHAIR

BED

TABLE

Xiang & Savarese, 2012-2014

CAR

Semantic:Torralba et al 03

Rabinovich et al 07

Gupta & Davis 08

Heitz & Koller 08

L-J Li et al 08

Bang & Fei-Fei 10

Recognition

Recognition task

Search strategy

Attributes

Context

Geometric Hoiem, et al 06

Gould et al 09

Bao, Sun, Savarese 10

Lab

elm

ed

atas

et [

Ru

ssel

l et

al.,

08

]

Recognition in context

128

Bao, Sun, Savarese CVPR 2010; BMVC 2010;CIVC 2011 (editor choice)IJCV 2012

Lab

elm

ed

atas

et [

Ru

ssel

l et

al.,

08

]

Recognition in context

129

Bao, Sun, Savarese CVPR 2010; BMVC 2010;CIVC 2011 (editor choice)IJCV 2012

Recognition

Recognition task

Search strategy

Attributes

Context

Tracking

131

Stat

e-o

f-th

e-a

rt

Object Tracking

Xia

ng

& S

avar

ese,

20

12

-20

14

Object tracking from Lidar

132

Held, Thrun & Savarese, RSS 2014

133

Current state of computer vision



Perceiving the World in 3D!

Biederman, Mezzanotte and Rabinowitz, 1982

134

Sensibility as human perception

V1

where pathway(dorsal stream)

what pathway(ventral stream)

135


V1Pre-frontal

cortex

136

where pathway(dorsal stream)

what pathway(ventral stream)


137

From images to the 3D scenesChoi & Savarese, 2013

138

A 3DGP encodes geometric and semantic relationships between groups of objects and space elements which frequently co-occur in spatially consistent configurations.


139

Training Dataset 3DGPs


Sofa, Coffee Table, Chair, Bed, Dining Table, Side Table

Estimated Layout 3D Geometric Phrases

140


141

Sofa, Coffee Table, Chair, Bed, Dining Table, Side Table

Estimated Layout 3D Geometric Phrases


142

Car Person Tree Sky

Street Building Else

From images to 3D scenes

Bao & Savarese, 2011-2013

143

Car Person Tree Sky

Street Building Else

Bao

& S

avar

ese,

20

11

From images to 3D scenesBao & Savarese, 2011-2013

Choi & Savarese, 2011-2014

From videos to 3D dynamic scenes

Monocular cameras Un-calibrated cameras Arbitrary motion

Highly cluttered scenesOcclusion Background clutter

Almost in real time!

Sensors

Objects

Summary

3D physical environment

Documents

Vision