Vision-only place recognition · 2014. 6. 14. · – ~1/3 for vision. cyphy laboratory We’ve...

Preview:

Citation preview

Queensland University of TechnologyCyPhy La

b

Vision-only place recognition

Peter Corke

http://tiny.cc/cyphy

ICRA 2014 Workshop on Visual Place Recognition in Changing Environments

cyphy laboratory

Navigation system

• Integrative component–dead reckoning: odometry, VO,inertial etc

cyphy laboratory

Navigation system

• Integrative component–dead reckoning: odometry, VO,inertial etc

• Extroceptive component–GPS, visual place recognition, landmark recognition

cyphy laboratory

The core problem

• Given a new image of a place, determine which previously seen image is the most similar, from which we infer similarity of place

• Similar to the CV image retrieval problem–Differences:

• we can assume temporal and spatial continuity (locality) across images in the sequence

• viewpoint might be quite different• the scene might appear different due to external factors

cyphy laboratory

Semantic classification

• Given a new image of a place determine what type of place it is–eg. kitchen, bathroom, auditorium

• Can be useful if we have strong priors like a map with labelled places

• Can be useful if place types are unique within the environment

cyphy laboratory

Issue #1: Appearance & geometry

• Geometry is the 3D structure of the world• Appearance is a 2D projection of the world• Geometry → appearance (computer graphics)• Appearance⥇ geometry

cyphy laboratory

...issue #1: Appearance & geometry

• door or not a door?

cyphy laboratory

Issue #2: Confounding factors

Weather and Lighting

Shadows Seasons

Image credits (L to R): Milford and Wyeth (ICRA2012), Corke et al (IROS2013), Neubert et al (ECMR2013).

cyphy laboratory

Issue #3: Distractors

• Many pixels in the scene are not discriminative–sky–road–etc

cyphy laboratory

Issue #4: Aliasing

• Where am I?–Can I tell?–Does it matter if I can’t?

cyphy laboratory

Issue #5: Viewpoint

• What do we actually mean by place?–Is this the same place?–What if the same location, but facing the other way?

cyphy laboratory

...issue #5: Viewpoint

• Viewpoints affects the scene globally–all pixels change

• However small elements of the scene are unchanged (invariant)– just shifted

cyphy laboratory

Issue #6: Getting good images

• Robots move–motion blur

• Huge dynamic range outdoors–from dark shadows to highlights

• Huge variation in mean illumination–0.001 lx moonless with clouds–0.27 lx full moon–500 lx office lighting–100,000 lx direct sunlight

• Color constancy

place

• Appearance is a function of–scene 3D geometry–materials–viewpoint– lighting changes (intensity, color)–exogenous factors (leaves, rain, snow)

• This function is complex, non-linear and not invertible

• Lots of undiscriminative stuff like sky, road etc.

cyphy laboratory

Summary: the nub of the problem

cyphy laboratory

The easy way out - go for geometry

• Roboticists began to use laser scanners in the early 1990s

• Increase in resolution, rotation rate, reflectance data• Maximum range and cost little changed

• Appearance is a function of–scene 3D geometry–materials–viewpoint– lighting changes (intensity, color)–exogenous factors (leaves, rain, snow)

• This function is complex, non-linear and not invertible

• Lots of undiscriminative stuff like sky, road etc.

cyphy laboratory

Summary: the nub of the problem

cyphy laboratory

Why do we like lasers?

• metric• sufficient range• we are suckers for colored

3D models of our world

cyphy laboratory

Measurement principles

• Time of flight• Phase shift• Frequency modulated

continuous wave (FMCW) or chirp

cyphy laboratory

2D scanning

• High speed rotating mirror• Typically a pulse every 0.5deg

cyphy laboratory

The curse of 1/R4

cyphy laboratory

3D scanning

• 2-axis scanner• multi-beam laser• pushbroom• flash LIDAR

cyphy laboratory

3D scanning

• 2-axis scanner• multi-beam laser• pushbroom• flash LIDAR

cyphy laboratory

3D scanning

• 2-axis scanner• multi-beam laser• pushbroom• flash LIDAR

cyphy laboratory

3D scanning

• 2-axis scanner• multi-beam laser• pushbroom• flash LIDAR

cyphy laboratory

Long range flash LIDAR

• 1 foot resolution at 1km

US Patent 4,935,616 (1990)

cyphy laboratory

Laser sensing

✓Clearly sufficient✓We have great algorithms for scan matching,

building maps, closing loops etc.✓Great hardware: Sick, Velodyne

- Price point still too high- How will we cope with many vehicles using the

same sensor- Misses out on color and texture.

cyphy laboratory

The perpetual promise of vision

• Visual place recognition is possible

cyphy laboratory

The (amazing) sense of vision

• eye invented 540 million years ago

• 10 different eye designs• lensed eye invented 7

times

cyphy laboratory

Compound Eyes of a Holocephala fusca Robber Fly 

cyphy laboratory

Anterior Median and Anterior Lateral Eyes of an Adult Female Phidippus putnami Jumping Spider 

CRICOS No. 00213Ja university for the worldreal ®

cyphy laboratory

Datasheet for the eye/brain system• 4.5M cone cells

–150,000 per mm2 (~2 µm square)–daylight only

• 100M rod cells–night time only–respond to a single photon

• Total dynamic range 1,000,000:1 (20 bits)

• Human brain–1.5 kg–1011 neurons–~20W– ~1/3 for vision

cyphy laboratory

We’ve been here before

• Eureka project 1987-95• 1000km on Paris highways, upto 130km/h• 1600km Munich to Copenhagen, overtaking, upto

175km/h• distance between interventions: mean 9km, max

158km1987

cyphy laboratory

...we’ve been here before

98% autonomous

1995

cyphy laboratory

Approaches to robust place recognition

• Get better images• Robust 2D image descriptors• Understand variation over time• Use fewer pixels• Use a sequence of recent images• Use some invariant• Use 3D structure

–laser, stereo, active range camera (eg. Kinect)

cyphy laboratory

Getting good images

underexposed

flare

blurry

F =ff

e µ G✓

LTAcos

4 qF2

+h◆

cyphy laboratory

Pixel brightness

pixel area

exposure timescene

luminance

gain

noise

cyphy laboratory

Exposure time T

• T has to be compatible with camera + scene dynamics

cyphy laboratory

Increase ϕ

cyphy laboratory

Photon boosting

cyphy laboratory

Log-response cameras

• Similar dynamic range to human eye (106), but no slow chemical adaption time

592 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 4, APRIL 2001

Fig. 8. Complete camera with a length of 35 mm and a diameter of 25 mm.

Fig. 9. Averaged photoreceptor response of all 100K pixels.

IV. MEASUREMENTS

The sensor performance has been evaluated using a red diodelaser ( nm) and a xenon arc lamp emitting a rela-tively homogeneous white spectrum with a few peaks in thenear infrared. A number of neutral density filters allows to varythe light intensity in a dynamic range of more than 8 decades.It should be mentioned that the absolute value of the outputvoltage has no meaning, as it can be shifted to any value bychanging the reference voltages. Only differences of the voltagesignals correspond to the receptor sensitivity. All measurementswere carried out at a frame rate of 50 Hz.

A. Photoreceptor ResponseFig. 9 shows the averaged response of all 384 288 pixels as

a function of the incident light intensity (xenon arc lamp). Overa dynamic range of 6 decades (from 3 mW/m to 3 kW/m ) thesensor has the expected logarithmic behavior. The slope of thisresponse curve, averaged in a range of three decades, amountsto 250 mV per decade. It can be adjusted, however, to any valuebetween 130 and 720 mV/decade by changing the feedback ca-pacitors of the readout amplifier (Fig. 4). The decreasing slopeat very low intensities results from the low photocurrents in this

Fig. 10. Distribution of pixel offsets at a light intensity of 1 W/m .

region. They reach the value of the diode’s dark current rep-resenting a fixed offset in the response curve. Besides, a longtime is required to charge the parasitic pixel capacitances, espe-cially when switching from calibration to readout. Both effectslimit the sensitivity of the sensor under low light conditions. Thesignal decrease at very high intensities stems from the dischargeof the storage capacitors. Parasitic photocurrents significantlychange the stored correction voltages before the individual pho-toreceptors are read out.

B. Self-Calibration ResultsThe remaining FPN after applying the self-calibration has

been measured with the help of the white arc lamp. The ref-erence current has been adjusted to approximately 1 nA cor-responding to a photocurrent generated at 25 W/m . For bestFPN reduction, photocurrent and reference current should be inthe same range, eleminating possible slope variations. However,there is a certain lower limit for in order to keep the circuitfast enough for performing the self-calibration. The offset dis-tribution at an intensity of 1 W/m can be seen in Fig. 10. Itshows an RMS value of 10.7 mV corresponding to 3.8% of adecade due to the slope of 250 mV/decade. This means a FPNof 0.63% of the total dynamic range. The peak-to-peak valuesare about five times higher than the RMS values leading to 20%of a decade and 3–4% of the total dynamic range, respectively.In addition to the histogram, a Gaussian curve fitted to the mea-sured values has been drawn. Both curves match relatively well,i.e., the pixel offsets are normally distributed.Unfortunately, the architecture of the self-calibrating sensor

does not allowmeasurement of the uncalibrated FPN. In order tostill obtain a reasonable comparison between calibrated and un-calibrated pixel-to-pixel variations, a test column has been builtin the same 0.6- m CMOS process. It consists of simple log-arithmic pixels which do not contain the calibration circuitry.The FPN results of this test sensor [9] amount to 90% of adecade and are more than 20 times worse than the calibratedresults. Referring to another sensor design which has been real-ized in a different CMOS process also using uncalibrated loga-

LOOSE et al.: SELF-CALIBRATING SINGLE-CHIP CMOS CAMERA 595

(a) (b) (c)

Fig. 16. Images of a high-dynamic-range scene taken with (a) a logarithmic CMOS sensor, (b) a logarithmic CMOS sensor with enabled digital zoom (zoomlevel 3), and (c) a CCD camera with opened and closed aperture.

F. Parasitic Discharge of the Analog Memory CellsThe pixel correction voltages are stored on analog memory

cells (capacitors) in order to bridge the time gap between twosuccessive calibration cycles. Since the switch isolating thismemory cell during the storage phase is realized as a MOS tran-sistor, parasitic currents (leakage and photocurrents) dischargethe capacitor with time. This discharge behavior determiningthe maximum time between two calibration cycles has beenexamined after stopping the calibration process. Fig. 15 showsthe mean output voltage of all pixels as a function of time. Itcan be seen that the voltage decreases down to about 2.4 Vand then stays constant. This decrease corresponds to a riseof the storage capacitor voltage because the pixel signalis inverted by the readout amplifier. The output voltage stopsdropping (at 2.4 V) when reaches the upper supplyvoltage .The drop rate significantly depends on the light intensity. At

very high illuminations, the signal has already noticeably de-creased after a few milliseconds. Therefore, the time betweencalibration and readout should be very short in order to expandthe dynamic range as much as possible toward high intensities.On the other hand, however, the pixel circuit takes a long time atlow illuminations to recover its operating point after switchingfrom calibration to readoutmode. Here, a large distance betweencalibration and readout is desired. Consequently, the responsecurve is shifted toward lower or higher intensities by changingthe time between calibration and readout. Usually, the sensor isread out directly (a few hundredmicroseconds) after calibration.

G. Camera ImagesFinally, a high-dynamic-range scene taken with the self-cal-

ibrating camera is presented (Fig. 16). It shows a bright incan-descent bulb and the logo of our laboratory printed on whitepaper. The dynamic range amounts to about 5 decades, whichis too much for a CCD camera [Fig. 16(c)]. Either the logosymbol can be clearly seen but the bulb is completely overex-posed (blooming), or, by closing the aperture, the filament ofthe bulb can be seen but the logo disappears in the black back-ground.The logarithmic sensor [Fig. 16(a)] is able to see the bulb

structure as well as the printed symbol at the same time. Due tothe built-in self-calibration, the FPN is reduced to a level which

TABLE ISENSOR PROPERTIES AND PERFORMANCE SUMMARY

is scarcely noticeable in this image. Fig. 16(b) shows the digitalzoom capability of the camera chip allowing the mapping of asubpart of the sensor area to the full video screen.

V. CONCLUSION

We have presented a 384 288 pixel CMOS image sensorbased on a self-calibrating photoreceptor with logarithmic

LOOSE et al.: SELF-CALIBRATING SINGLE-CHIP CMOS CAMERA 595

(a) (b) (c)

Fig. 16. Images of a high-dynamic-range scene taken with (a) a logarithmic CMOS sensor, (b) a logarithmic CMOS sensor with enabled digital zoom (zoomlevel 3), and (c) a CCD camera with opened and closed aperture.

F. Parasitic Discharge of the Analog Memory CellsThe pixel correction voltages are stored on analog memory

cells (capacitors) in order to bridge the time gap between twosuccessive calibration cycles. Since the switch isolating thismemory cell during the storage phase is realized as a MOS tran-sistor, parasitic currents (leakage and photocurrents) dischargethe capacitor with time. This discharge behavior determiningthe maximum time between two calibration cycles has beenexamined after stopping the calibration process. Fig. 15 showsthe mean output voltage of all pixels as a function of time. Itcan be seen that the voltage decreases down to about 2.4 Vand then stays constant. This decrease corresponds to a riseof the storage capacitor voltage because the pixel signalis inverted by the readout amplifier. The output voltage stopsdropping (at 2.4 V) when reaches the upper supplyvoltage .The drop rate significantly depends on the light intensity. At

very high illuminations, the signal has already noticeably de-creased after a few milliseconds. Therefore, the time betweencalibration and readout should be very short in order to expandthe dynamic range as much as possible toward high intensities.On the other hand, however, the pixel circuit takes a long time atlow illuminations to recover its operating point after switchingfrom calibration to readoutmode. Here, a large distance betweencalibration and readout is desired. Consequently, the responsecurve is shifted toward lower or higher intensities by changingthe time between calibration and readout. Usually, the sensor isread out directly (a few hundredmicroseconds) after calibration.

G. Camera ImagesFinally, a high-dynamic-range scene taken with the self-cal-

ibrating camera is presented (Fig. 16). It shows a bright incan-descent bulb and the logo of our laboratory printed on whitepaper. The dynamic range amounts to about 5 decades, whichis too much for a CCD camera [Fig. 16(c)]. Either the logosymbol can be clearly seen but the bulb is completely overex-posed (blooming), or, by closing the aperture, the filament ofthe bulb can be seen but the logo disappears in the black back-ground.The logarithmic sensor [Fig. 16(a)] is able to see the bulb

structure as well as the printed symbol at the same time. Due tothe built-in self-calibration, the FPN is reduced to a level which

TABLE ISENSOR PROPERTIES AND PERFORMANCE SUMMARY

is scarcely noticeable in this image. Fig. 16(b) shows the digitalzoom capability of the camera chip allowing the mapping of asubpart of the sensor area to the full video screen.

V. CONCLUSION

We have presented a 384 288 pixel CMOS image sensorbased on a self-calibrating photoreceptor with logarithmic

Markus Loose, “A Self-Calibrating CMOS Image Sensor with Logarithmic Response”, Ph. D thesis, Institut Für Hochenergiehysik, Universität Heidelberg, 1999.

cyphy laboratory

Human visual response

S LM

• Rods–night time

• Cones–daytime–3 flavours

CRICOS No. 00213Ja university for the worldreal ® © Peter Corke

silicon photosensor, orpixel

light

colored filter array(CFA)

The silicon equivalent

CRICOS No. 00213Ja university for the worldreal ® © Peter Corke

Dichromats

cyphy laboratory

Why stop at 3 cones?

• FluxData Inc• FS-1665• 3 Bayer + 2 NIR• 3 CCDs

cyphy laboratory

Multispectral cameras

cyphy laboratory

Assorted pixel arrays

Nyquist Frequencyof

Horizontal

Frequency

Vertical Frequency

Nyquist Frequencyof other pixels

0.25fs

0.125fs

G G G G

R GG

R GG

R GG

R GG

B B

B B

R GG

R GG

R GG

R GG

B B

B BR GG

R GG

R GG

R GG

B B

B B

R GG

R GG

R GG

R GG

B B

B B

R GG

R GG

R GG

R GG

B B

B B

R GG

R GG

R GG

R GG

B B

B BR GG

R GG

R GG

R GG

B B

B B

R GG

R GG

R GG

R GG

B B

B BOptical Resolution Limit(N=f/5.6, λ =555nm, p=1.0)

, , ,

(a) 3 colors and 4 exposures CFA in [8] and its Nyquist Limits

1 46

1 57

1 57

1 46

2 3

3 2

1 46

1 57

1 57

1 46

2 3

3 21 46

1 57

1 57

1 46

2 3

3 2

1 46

1 57

1 57

1 46

2 3

3 2

1 46

1 57

1 57

1 46

2 3

3 2

1 46

1 57

1 57

1 46

2 3

3 21 46

1 57

1 57

1 46

2 3

3 2

1 46

1 57

1 57

1 46

2 3

3 2

Horizontal

Frequency

Vertical Frequency

Nyquist Frequencyof other pixels

1

Optical Resolution Limit(N=f/5.6, λ=555nm, p=1.0)

Nyquist Frequencyof

0.25fs(b) 7 colors and 1 exposure CFA in [2] and its Nyquist Limits

Figure 2: Nyquist Limits of previous assorted designs used with sub-micron pixel imagesensors (pixel pitch p = 1.0nm).

resolution limit.

Figure 2 shows the Nyquist limits when the CFA patterns of previous assorted pixels

are used with the sub-micron pixel size image sensor. When the highest frequency of

the input signal is lower than the Nyquist limit, aliasing does not occur, according to the

sampling theorem. Therefore, aliasing is not generated at pixels marked ‘1’ in Figure

2(b).

6

• Better dynamic range–2x2 Bayer filter cells with 3 levels

of neutral density filter• More colors

–3x3 or 4x4 filter cells ➙ 9 or 16 primaries

Wide field of view

cyphy laboratory

Approaches to robust place recognition

• Get better images• Robust 2D image descriptors• Understand variation over time• Use fewer pixels• Use a sequence of recent images• Use some invariant• Use 3D structure

–laser, stereo, active range camera (eg. Kinect)

cyphy laboratory

Whole scene descriptors

• GIST• HoG• SIFT/SURF on whole

image• Color histograms

cyphy laboratory

...issue #5: Viewpoint

• Viewpoints affects the scene globally–all pixels change

• However small elements of the scene are unchanged (invariant)– just shifted

cyphy laboratory

Visual elements

• Bag of visual words (BoW)–FABMAP, OpenFABMAP

cyphy laboratory

• Feature-detection front ends fail completely across extreme perceptual change

cyphy laboratory

Future work

• Really interesting recent work on learning distinctive elements of a scene

• Contextual priming, choose the features for the situation–day/night– indoor/outdoor

cyphy laboratory

Approaches to robust place recognition

• Get better images• Robust 2D image descriptors• Understand variation over time• Use fewer pixels• Use a sequence of recent images• Use some invariant• Use 3D structure

–laser, stereo, active range camera (eg. Kinect)

cyphy laboratory

Understand variation over time

• Traditional visual localization methods are not robust to appearance change

• How do features change over time?• Can we predict appearance based on time?

cyphy laboratory

Generalization of Temporal Change over Space

• Assume we have a “training set” of paired image sequences from locations under two different times of day

cyphy laboratory

Training Images

• Use known matched images to generate a temporal “codebook” across the two appearance configurations

cyphy laboratory

Generalizing about change

cyphy laboratory

Generalising about change: results

cyphy laboratory

Approaches to robust place recognition

• Get better images• Robust 2D image descriptors• Understand variation over time• Use fewer pixels• Use a sequence of recent images• Use some invariant• Use 3D structure

–laser, stereo, active range camera (eg. Kinect)

cyphy laboratory

Camera Resolutions...

courtesy Barry Hendy

Similar story for storage and compute

cyphy laboratory

Pixel subtended angle

• 10 Mpixel sensor, 30deg FOV–0.01 deg per pixel

• 64 pixel sensor, 30deg FOV–4 deg per pixel

cyphy laboratory

Use fewer pixels

cyphy laboratory

How many pixels do you need?

Eynsham dataset or datasets with odometry, it is possible to use a small range or even single value of vk.

By considering the sum of sub-route difference scores s(i) as a sum of normally distributed random variables, each with the same mean and variance, the sum of normalized differences over a sub-route of length n frames has mean zero and variance n, assuming that frames are captured far enough apart to be considered independent. Dividing by the number of frames produces a normalized route difference score with mean zero, variance 1/n. Percentile rank scores can then be used to determine an appropriate sub-route matching threshold. For example, for the primary sub-route length n = 50 used in this paper, a sub-route threshold of -1 yields a 7.7×10-13 chance of the match occurring by chance.

To determine whether the current sub-route matches to any stored sub-routes, the minimum matching score is compared to a matching threshold sm. If the minimum score is below the threshold, the sub-route is deemed to be a match, otherwise the sub-route is assigned as a new sub-route. An example of the minimum matching scores over every frame of a dataset (the Eynsham dataset described in this paper) is shown in Figure 2. In the second half of the dataset the route is repeated, leading to lower minimum matching scores.

Figure 2. Normalized sub-route difference scores for the Eynsham dataset

with the matching threshold sm that yields 100% precision performance.

IV. EXPERIMENTAL SETUP In this section we describe the four datasets used in this

work and the image pre-processing for each study.

A. Datasets A total of four datasets were processed, each of which

consisted of two traverses of the same route. The datasets were: a 70 km road journey in Eynsham, the United Kingdom, 2 km of motorbike circuit racing in Rowrah, the United Kingdom, 40 km of off-road racing up Pikes Peak in the Rocky Mountains, the United States, and 100 meters in an Office building (italics indicate dataset names). The Eynsham route was the primary dataset on which extensive quantitative analysis was performed. The other datasets were added to provide additional evidence for the general applicability of the algorithm. Key dataset parameters are provided in Table I, including the storage space required to represent the entire dataset using low resolution images.

Figure 3 shows aerial maps and imagery of the Eynsham, Rowrah and Pikes Peak datasets, with lines showing the route that was traversed twice. The Eynsham dataset consisted of

high resolution image captures from a Ladybug2 camera (circular array of five cameras) at 9575 locations spaced along the route. The Rowrah dataset was obtained from an onboard camera mounted on a racing bike. The Pikes Peak dataset was obtained from cameras mounted on two different racing cars racing up the mountain, with the car dashboard and structure cropped from the images. This cropping process could most likely be automated by applying some form of image matching process to small training samples from each of the camera types. The route consisted of heavily forested terrain and switchbacks up the side of a mountain, ending in rocky open terrain partially covered in snow.

TABLE I. DATASETS

Dataset Name Distance Number of

frames Distance between

frames Image

Storage Eynsham 70 km 9575 6.7 m (median) 306 kB

Rowrah 2km 440 4.5 m (mean) 7 kB http://www.youtube.com/watch?v=_UfLrcVvJ5o

Pikes Peak

40 km 4971 8 m (mean) 159 kB http://www.youtube.com/watch?v=4UIOq8vaSCc http://www.youtube.com/watch?v=7VAJaZAV-gQ

Office 53 m 832 0.13 m (mean) 1.6 kB http://df.arcs.org.au/quickshare/790eb180b9e87d53/data3.mat

Figure 3. The (a) 35 km Eynsham, (b) 1 km Rowrah and (c) 20 km Pikes Peak routes, each of which were repeated twice. Copyright 2011 Google.

Figure 4. (a) The Lego Mindstorms dataset acquisition rig with 2 sideways facing light sensors and GoPro camera for evaluation of matched routes. (b)

The 53 meter long route which was repeated twice to create the dataset.

B. Image Pre-Processing 1) Eynsham Resolution Reduced Panoramic Images

For the Eynsham dataset, image processing consisted of image concatenation and resolution reduction (Figure 5). The raw camera images were crudely cropped to remove overlap between images. No additional processing such as camera undistortion, blending or illumination adjustment was performed. The subsequent panorama was then resolution reduced (re-sampling using pixel area relation in OpenCV 2.1.0) to the resolutions shown in Table II.

cyphy laboratory

Example Route Match - Eynsham

cyphy laboratory

Eynsham Resolution Reduction Results

Directi

on

of

goodness

cyphy laboratory

Eynsham Pixel Bit Depth Results

32 pixel images

!!!2 bit image

cyphy laboratory

Approaches to robust place recognition

• Get better images• Robust 2D image descriptors• Understand variation over time• Use fewer pixels• Use a sequence of recent images• Use some invariant• Use 3D structure

–laser, stereo, active range camera (eg. Kinect)

cyphy laboratory

Eynsham Sequence Length Results

32 pixel images

cyphy laboratory

Milford and Wyeth, ICRA2012

cyphy laboratory

Remember ALVINN?

• Back in the 70s and 80s, roboticists, AI and computer vision research only used low resolution images–Camera limitations–Compute limitations–Algorithm limitations

cyphy laboratory

Approaches to robust place recognition

• Get better images• Robust 2D image descriptors• Understand variation over time• Use fewer pixels• Use a sequence of recent images• Use some invariant• Use 3D structure

–laser, stereo, active range camera (eg. Kinect)

cyphy laboratory

Shadows are everywhere! Yet, the human visual system is so adept at filtering them out, that we never give shadows a second thought; that is until we need to deal with them in our algorithms. Since the very beginning of computer vision, the presence of shadows has been responsible for wreaking havoc on a variety of applications....

Lalonde, Efros, Narsimhan ECCV 2010

cyphy laboratory

0 100 200 300 400 500 6000

0.1

0.2

0.3

0.4

0.5

0.6

0.7pi

xel i

nten

sity

distance along profile (pixels)

0 100 200 300 400 500 6000.2

0.4

0.6

0.8

1

1.2

1.4

1.6

distance along profile (pixels)

colo

r rat

ios

R/GB/G

cyphy laboratory

T=3000 KT=2000-3000 K

T=5000-5400 K T=8000-10000 K

Blackbody illuminants

logrR = c1

� c2

T logrB = c01

� c02

T

logrB

logrR

rB =BG

rR =RG

cyphy laboratory

Log-log chromaticity

increasing T

mat

erial

prop

erty

u (pixels)

v (p

ixel

s)

500 1000 1500 2000 2500

200

400

600

800

1000

1200

1400

1600

1800

0 0.5 1 1.5 2 2.5 3 3.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

invariant line angle (rad)

inva

rianc

e im

age

varia

nce

cyphy laboratory

Angle of the projection line

q

logrB

logrR

cyphy laboratory

cyphy laboratory

Car park sequence

cyphy laboratory

Car park sequence

cyphy laboratory

Outdoor localization

Fig. 14. The approach does not compensate for shadows containingreflected lighting from objects in the scene. The figure illustrates an examplewhere shadows next to coloured walls are not fully removed.

textures induced by shadows rather than the underlyingstructure. We have applied standard point feature extraction(e.g. SIFT, SURF etc. [20]) to the invariant image withsuccess. Despite the lower SNR of the invariant image allbut the smallest scale features reliably associate with materialrather than lighting features of the scene.

F. Limitations

One of the limitations of this method is that the modelassumes scene lighting by a single Planckian source, (SectionIV) and hence cannot fully compensate when shadows arepartly-illuminated by light reflected from objects populatinga scene. For example, Figure 14 shows a strong shadow nextto a building but the shadow is clearly still evident in theinvariant image. In this case the shadow region is illuminatedby sky light reflected from the coloured wall of the buildingwhich makes its spectrum non-Planckian.

VI. CONCLUSION

In this paper we have described an approach to eliminateshadows from colour images of outdoor scenes that is knownin the computer vision community and applied it to a hardrobotic problem of outdoor vision-based place recognition.We have described the details of key implementation stepssuch as minimising camera spectral channel overlap andestimating the direction of the projection line, and discussedapproaches to overcome practical problems with low andhigh pixel values.

VII. ACKNOWLEDGEMENTS

Peter Corke was supported by Australian Research Coun-cil project DP110103006 Lifelong Robotic Navigation usingVisual Perception. Winston Churchill was supported by anEPSRC Case Studentship with Oxford Technologies Ltd.Paul Newman was supported by an EPSRC Leadership Fel-lowship, EPSRC Grant EP/I005021/1. Authors thank MarkSheehan and Dr. Alastair Harrison for insightful discussionon JR divergence and Dr. Benjamin Davis for maintainingthe robotic platform used for this work. We thank DominicWang for valuable suggestions on this paper.

REFERENCES

[1] J. Lalonde, A. Efros, and S. Narasimhan, “Detecting ground shadowsin outdoor consumer photographs,” Computer Vision–ECCV 2010, pp.322–335, 2010.

[2] W. Churchill and P. Newman, “Practice makes perfect? managingand leveraging visual experiences for lifelong navigation,” IEEEInternational Conference on Robotics and Automation, 2012.

[3] J. Zhu, K. Samuel, S. Masood, and M. Tappen, “Learning to recognizeshadows in monochromatic natural images,” in Computer Vision andPattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010,pp. 223–230.

[4] R. Guo, Q. Dai, and D. Hoiem, “Single-image shadow detectionand removal using paired regions,” in Computer Vision and PatternRecognition (CVPR), 2011 IEEE Conference on. IEEE, 2011, pp.2033–2040.

[5] G. Finlayson, M. Drew, and C. Lu, “Intrinsic images by entropyminimization,” Computer Vision-ECCV 2004, pp. 582–595, 2004.

[6] G. Finlayson, S. Hordley, C. Lu, and M. Drew, “On the removal ofshadows from images,” Pattern Analysis and Machine Intelligence,IEEE Transactions on, vol. 28, no. 1, pp. 59–68, 2006.

[7] S. Narasimhan, V. Ramesh, and S. Nayar, “A class of photometricinvariants: separating material from shape and illumination,” in IEEEInternational Conference on Computer Vision, oct. 2003, pp. 1387–1394 vol.2.

[8] S. Nayar and S. Narasimhan, “Vision in bad weather,” in ComputerVision, 1999. The Proceedings of the Seventh IEEE InternationalConference on, vol. 2. Ieee, 1999, pp. 820–827.

[9] S. Narasimhan and S. Nayar, “Chromatic framework for vision inbad weather,” in Computer Vision and Pattern Recognition, 2000.Proceedings. IEEE Conference on, vol. 1. IEEE, 2000, pp. 598–605.

[10] V. Kwatra, M. Han, and S. Dai, “Shadow removal for aerial imageryby information theoretic intrinsic image analysis,” in ComputationalPhotography (ICCP), 2012 IEEE International Conference on. IEEE,2012, pp. 1–8.

[11] S. Park and S. Lim, “Fast shadow detection for urban autonomousdriving applications,” in Intelligent Robots and Systems, 2009. IROS2009. IEEE/RSJ International Conference on. IEEE, 2009, pp. 1717–1722.

[12] M. Drew and H. Joze, “Sharpening from shadows: Sensor transformsfor removing shadows using a single image,” in Color ImagingConference, 2009, pp. 267–271.

[13] M. Milford, “Visual route recognition with a handful of bits,” inProceedings of Robotics: Science and Systems, Sydney, Australia, July2012.

[14] P. I. Corke, Robotics, Vision & Control: Fundamental Algorithms inMATLAB. Springer, 2011, iSBN 978-3-642-20143-1.

[15] F. Wang, T. Syeda-Mahmood, B. Vemuri, D. Beymer, and A. Rangara-jan, “Closed-form Jensen-Renyi divergence for mixture of gaussiansand applications to group-wise shape registration,” Medical ImageComputing and Computer-Assisted Intervention–MICCAI 2009, pp.648–655, 2009.

[16] Z. Botev, J. Grotowski, and D. Kroese, “Kernel density estimation viadiffusion,” The Annals of Statistics, vol. 38, no. 5, pp. 2916–2957,2010.

[17] M. Sheehan, A. Harrison, and P. Newman, “Self-calibration for a 3dlaser,” The International Journal of Robotics Research, 2011.

[18] A. Hamza and H. Krim, “Image registration and segmentation bymaximizing the Jensen-Renyi divergence,” in Energy MinimizationMethods in Computer Vision and Pattern Recognition. Springer, 2003,pp. 147–163.

[19] P. Felzenszwalb and D. Huttenlocher, “Efficient graph-based imagesegmentation,” International Journal of Computer Vision, vol. 59,no. 2, pp. 167–181, 2004.

[20] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-up robustfeatures (surf),” Computer Vision and Image Understanding, vol. 110,no. 3, pp. 346–359, 2008.

cyphy laboratory

Place 5

cyphy laboratory

Place 8

cyphy laboratory

Image similarity

•5 places➥2-9 images of each➥total 28 images (48x64)➥compared using ZNCC

cyphy laboratory

PR curve

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

recall

pre

cisi

on

greyscaleshadow invariant

cyphy laboratory

Park example

u (pixels)

v (p

ixels

)

100 200 300 400 500 600 700 800

100

200

300

400

500

600

u (pixels)

v (p

ixels)

100 200 300 400 500 600 700 800

100

200

300

400

500

600

u (pixels)

v (pix

els)

100 200 300 400 500 600 700 800

100

200

300

400

500

600

cyphy laboratory

Fail!

cyphy laboratory

Approaches to robust place recognition

• Get better images• Robust 2D image descriptors• Understand variation over time• Use fewer pixels• Use a sequence of recent images• Use some invariant• Use 3D structure

–laser, stereo, active range camera (eg. Kinect)

Determining distance1. Occlusion2. Height in visual field3. Relative size4. Texture density5. Aerial perspective6. Binocular disparity7. Accommodation8. Convergence9. Motion perspective

PERCEPTION, LAYOUT, AND VIRTUAL REALITY 29

Chauvet et al., 1995; Hobbs, 1991) and Egyptian art (seeHagen, 1986; Hobbs, 1991), where it is often used alone,with no other information to convey depth. Thus, onecan make a reasonable claim that occlusion was the firstsource of information discovered and used to depict spa-tial relations in depth.

Because occlusion can never be more than ordinal in-formation—one can only know that one object is in frontof another, but not by how much—it may not seem im-pressive. Indeed, some researchers have rejected it as in-formation about depth (e.g., Landy, Maloney, Johnston,& Young, 1995). But the range and power of occlusion isstriking: As is suggested in Figure 1, it can be trusted atall distances without attenuation, and its depth thresholdexceeds that of all other sources. Even stereopsis seemsto depend on partial occlusion (Anderson & Nakayama,1994). Normalizing size over distance, occlusion pro-vides depth thresholds of 0.1% or better. This is thewidth of one sheet of paper against another at 30 cm, thewidth of a person against a wall at 500 m, or the width ofa car against a building at 2 km. Cutting and Vishton (1995)have provided more background on occlusion along withjustifications for this plotted function, as well as forthose of the other sources of information discussed here.

2. Height in the visual field measures relations amongthe bases of objects in a 3-D environment as projected tothe eye, moving from the bottom of the visual field (or

image) to the top, and assuming the presence of a groundplane, of gravity, and the absence of a ceiling (see Dunn,Gray, & Thompson, 1965). Across the scope of manydifferent traditions in art, a pattern is clear: If one sourceof information about layout is present in a picture be-yond occlusion, that source is almost always height in thevisual field. The conjunction of occlusion and height,with no other sources, can be seen in the paintings at Chau-vet; in classical Greek art and in Roman wall paintings;in 10th-century Chinese landscapes; in 12th- to15th-century Japanese art; in Western works of Cimabue, Duc-cio di Buoninsegna, Simone Martini, and Giovanni diPaolo (13th–15th centuries); and in 15th-century Persian art(see Blatt, 1984; Chauvet et al., 1995; Cole, 1992; Hagen,1986; Hobbs, 1991; Wright, 1983). Thus, height appearsto have been the second source of information discovered,or at least mastered, for portraying depth and layout.

The potential utility of height in the visual field is sug-gested in Figure 1, dissipating with distance. This plot as-sumes an upright, adult observer standing on a flat plane.Since the observer’s eye is at a height of about 1.6 m, nobase closer than 1.6 m will be available; thus, the func-tion is truncated in the near distance, which will have im-plications later. I also assume that a height difference ofabout 5! of arc between two nearly adjacent objects isjust detectable; but a different value would simply shiftthe function up or down. When one is not on a flat plane,

Figure 1. Just-discriminable ordinal depth thresholds as a function of the logarithm of distance from the ob-server, from 0.5 to 10,000 m, for nine sources of information about layout. I assume that more potent sources ofinformation are associated with smaller depth-discrimination thresholds; and that these thresholds reflectsuprathreshold utility. This array of functions is idealized for the assumptions given in Table 1. From “PerceivingLayout and Knowing Distances: The Integration, Relative Potency, and Contextual Use of Different InformationAbout Depth,” by J. E. Cutting and P. M. Vishton, 1995, in W. Epstein and S. Rogers (Eds.), Perception of Spaceand Motion (p. 80), San Diego: Academic Press, Copyright 1995 by Academic Press. Reprinted with permission.How the eye measures reality and virtual reality 1995Cutting, J. E. & Vishton, P. M. | Reprinted from Perception of Space and Motion, W. Epstein and S. Rogers, Perceiving layout and knowing distances: The interaction, relative potency, and contextual use of different information about depth.page 80., Copyright (1995), with permission from Elsevier

Determining distance1. Occlusion2. Height in visual field3. Relative size4. Texture density5. Aerial perspective6. Binocular disparity7. Accommodation8. Convergence9. Motion perspective

1. Occlusion2. Height in visual field3. Relative size4. Texture density5. Aerial perspective6. Binocular disparity7. Accommodation8. Convergence9. Motion perspective

How do we estimate distance?

1. Occlusion2. Height in visual field3. Relative size4. Texture density5. Aerial perspective6. Binocular disparity7. Accommodation8. Convergence9. Motion perspective

How do we estimate distance?

1. Occlusion2. Height in visual field3. Relative size4. Texture density5. Aerial perspective6. Binocular disparity7. Accommodation8. Convergence9. Motion perspective

How do we estimate distance?

1. Occlusion2. Height in visual field3. Relative size4. Texture density5. Aerial perspective6. Binocular disparity7. Accommodation8. Convergence9. Motion perspective

How do we estimate distance?

1. Occlusion2. Height in visual field3. Relative size4. Texture density5. Aerial perspective6. Binocular disparity7. Accommodation8. Convergence9. Motion perspective

How do we estimate distance?

1. Occlusion2. Height in visual field3. Relative size4. Texture density5. Aerial perspective6. Binocular disparity7. Accommodation8. Convergence9. Motion perspective

How do we estimate distance?

1. Occlusion2. Height in visual field3. Relative size4. Texture density5. Aerial perspective6. Binocular disparity7. Accommodation8. Convergence9. Motion perspective

How do we estimate distance?

http://www.youtube.com/watch?v=6GliSCGkpZ4

Eye Movement Terminology  YouTube 2008Sam Tapsell | Used with permission.

1. Occlusion2. Height in visual field3. Relative size4. Texture density5. Aerial perspective6. Binocular disparity7. Accommodation8. Convergence9. Motion perspective

How do we estimate distance?

video  from  handheld  camera  while  walking,  with  near  and  far  objects  moving  past

3D camera

cyphy laboratory

Use 3D structure to identify places

cyphy laboratory

place

• Appearance is a function of–scene 3D geometry–materials–viewpoint– lighting changes (intensity, color)–exogenous factors (leaves, rain, snow)

• This function is complex, non-linear and not invertible

• Lots of undiscriminative stuff like sky, road etc.

cyphy laboratory

Summary: the nub of the problem

cyphy laboratory

Approaches to robust visual place recognition

• Get better images• Robust 2D image descriptors• Understand variation over time• Use fewer pixels• Use a sequence of recent images• Use some invariant• Use 3D structure

–laser, stereo, active range camera (eg. Kinect)

cyphy laboratory

• We’re doing: robust vision, semantic vision, vision & action, algorithms & architectures

• Looking for 16 postdocs

We’re hiring

www.roboticvision.org

Recommended