Object Recognition using Local Invariant Features Claudio Scordino [email protected] July 5 th 2006

Object Recognition using Local Invariant Features

Claudio [email protected]

July 5th 2006

Object Recognition

Widely used in the industry for Inspection Registration Manipulation Robot localization and mapping

Current commercial systems Correlation-based template matching Computationally infeasible when object

rotation, scale, illumination and 3D pose vary Even more infeasible with partial occlusion

Alternative: Local Image Features

Local Image Features

Unaffected by Nearby clutter Partial occlusion

Invariant to Illumination 3D projective transforms Common object variations

...but, at the same time, sufficiently distinctive to identify specific objects among many alternatives!

Related work Line segments, edges and regions grouping

Detection not good enough for reliable recognition Peaks detection in local image variations

Example: Harris corner detector Drawback: image examined at only a single scale Different key locations as the image scale changes

Eigenspace matching, color and receptive field histograms Successful on isolated objects Unextendable to cluttered and partially occluded

images

SIFT Method

Scale Invariant Feature Transform (SIFT)

Staged filtering approach Identifies stable points (image “keys”)

Computation time less than 2 secs

SIFT Method (2)

Local features: Invariant to image translation, scaling,

rotation

Partially invariant to illumination changes

and 3D projection (up to 20° of rotation)

Minimally affected by noise

Similar properties with neurons in Inferior

Temporal cortex used for object recognition

in primate vision

First stage

Input: original image (512 x 512 pixel)

Goal: key localization and image description

Output: SIFT keys Feature vector describing the local image

region sampled relative to its scale-space coordinate frame

First stage (2)

Description: Represents blurred image gradient locations

in multiple orientations planes and at multiple scales

Approach based on a model of cells in the celebral cortex of mammalian vision

Less than 1 sec of computation time

Build a pyramid of images Images are difference-of-Gaussian (DOG)

functions Resampling between each level

Key localization

Algorithm: Expand original image by a factor of 2 using

bilinear interpolation For each pyramid level:1. Smooth input image through a convolution with

the 1D Gaussian function (horizontal direction):

2

2

2

2

1)(

x

exg

2with obtaining Image A

Key localization (2)

2. Smooth Image A through a further convolution with th 1D Gaussian function (vertical direction) obtaining Image B

3. The DOG image of this level is B-A

4. Resample Image B using bilinear interpolation with pixel spacing 1.5 in each direction and use the result as Input Image of the new pyramid level

Each new sample is a constant linear combination of 4 adjacent pixels

Key localization (3)

Find maxima and minima of the DOG images:

2nd level

1st level

Key orientation

1. Extract image gradients and orientation at each pyramid level. For each pixel Aij compute

2. Mij thresholded at a value of 0.1 times the maximum possible gradient value

Provides robustness to illumination

21,

2,1 )()( jiijjiijij AAAAMImage Gradient Magnitude

),(2arctan 1,,1 ijjijiijij AAAAR Image Gradient Orientation

Key orientation (2)

3. Create an orientation histogram using a circular Gaussian-weighted window with σ=3 times the current smoothing scale

The weights are multiplied by Mij

The histogram is smoothed prior to peak selection

The orientation is determined by the peak in the histogram

Experimental results

Original image

Keys on image after rotation (15°), scaling (90%), horizontal

streching (110%), change of brightness (-10%) and contrast (90%), and addition of

pixel noise

78%

Experimental results (2)

Image transformation

Location and scale match

Orientation match

Decrease constrast by 1.2

89.0 % 86.6 %

Decrease intensity by 0.2

88.5 % 85.9 %

Rotate by 20° 85.4 % 81.0 %

Scale by 0.7 85.1 % 80.3 %

Stretch by 1.2 83.5 % 76.1 %

Stretch by 1.5 77.7 % 65.0 %

Add 10% pixel noise 90.3 % 88.4 %

All previous 78.6 % 71.8 %20 different images, around 15,000 keys

Image description

Approach suggested by the response properties of complex neurons in the visual cortex

A feature position is allowed to vary over a small region, while orientation and spatial frequency are maintained

Image descripted through 8 orientation planes

Keys inserted according to their orientations

Second stage

Goal: identify candidate object matches The best candidate match is the nearest

neighbour (i.e., minimum Euclidean distance between decriptor vectors)

The exact solution for high dimensional vectors is known to have high complexity

Second stage (2)

Algorithm: approximate Best-Bin-First (BBF) search method (Beis and Lowe)

Modification of the k-d tree algorithm Identifies the nearest neighbours with high

probability and small computation The keys generated at the larger scale are

given twice the weight of those at the smaller scale

Improves recognition by giving more weight to the least-noisy scale

Third stage

Description: final verification

Algorithm: low-residual least-squares fit Solution of a linear system: x = [ATA]-1ATb When at least 3 keys agree with low

residual, there is strong evidence for the presence of the object

Since there are dozens of keys in the image, this works also with partial occlusion

Perspective projection

Partial occlusion

Computation time: 1.5 secs on Sun Sparc 10(0.9 secs first stage)

Connections to human vision

Performance of human vision is obviously far superior than current computer vision...

The brain uses a highly computational-intensive parallel process instead of a staged filtering approach

Connections to human vision

However... the results are much the same Recent research in neuroscience showed

that the neurons of Inferior Temporal cortex

Recognize shape features The complexity of the features is roughly the

same as for SIFT They also recognize color and texture properties

in addition to shape Further research:

3D structure of objects Additional feature types for color and texture

Augmented Reality (AR)

Registration of virtual objects into a live video sequence

Current AR systems: Rely on markers strategically placed in the

environment Need manual camera calibration

Related work

Harris corner detector and Kanade-Lucas-Tomasi (KLT) tracker

Not enough feature invariance Parallelogram-shaped and elliptical image

regions tracking Requires planar structures in viewed scene

Pre-built user-supplied CAD object models Not always available Limited to objects that can be easily modelled

Off-line batch processing of the entire video

AR using SIFT

Flexible automated AR Not needed:

Camera pre-calibration Prior knowledge of scene geometry Manual initialization of the tracker Placement of special markers Special tools or equipment (just a camera)

Short time and small effort to setup Robust 6 degrees of freedom

AR using SIFT (2)

Need only a set of reference images taken by a handheld uncalibrated camera from arbitrary viewpoints

Acquired from unknown spatially separated viewpoints by a handheld camera

At least two images 5 to 20 images separated by at most 45°

Used to build a 3D model of the viewed scene

AR using SIFT (3)

First (off-line) stage:1. Extract SIFT features from reference images2. Establish multi-view correspondences3. Build a metric model of the real world 4. Compute calibration parameters and camera

poses5. The user places the virtual object

The placement is achieved by anchoring object projection in the first image

Then, a second projection is adjusted in the second image

Finally, the user fine-tunes position, orientation and size

AR using SIFT (4)

Second (on-line) stage:1. Features are detected in the current frame 2. Features are matched to those of the model

using the BBF algorithm3. The matches are used to compute the

current pose of the camera4. Solution is stabilized by using the values

computed for the previous frame

AR using SIFT: prototype Software

C programming language OpenGL and GLUT libraries

Hardware: IBM ThinkPad Pentium 4-M processor (1.8 GHz) Logitech QuickCam Pro 4000

camera

Operation Computation time

Feature extraction 150 msec

Feature matching 40 msec

Camera pose computation 25 msec

4 FPS

AR using SIFT: drawbacks

The tracker is very slow 4 FPS (Frame Per Second) Too slow for real-time operations (25 FPS) The main bottleneck is feature extraction

Unable to handle occlusion of inserted virtual content by real objects

A full model of the observed scene is required

AR using SIFT: examples

Videos: mug tabletop

http://www.cs.ubc.ca/~lowe/papers/gordon/mug.avi

http://www.cs.ubc.ca/~lowe/papers/gordon/tabletop.avi

Conclusions

Object recognition using SIFT Reliable recognition Several characteristics in common with

human vision

Augmented reality using SIFT Very flexible Not possible in real-time due to the high

computation times In future possible using faster processors

References David G. Lowe, "Object recognition from local

scale-invariant features" International Conference on Computer Vision, Corfu, Greece (September 1999), pp. 1150-1157

Stephen Se, David G. Lowe and Jim Little, "Vision-based mobile robot localization and mapping using scale-invariant features" Proceedings of IEEE International Conference on Robotics and Automation, Seoul, Korea (May 2001), pp. 2051-58

Iryna Gordon and David G. Lowe, "Scene modelling, recognition and tracking with invariant image features" International Symposium on Mixed and Augmented Reality (ISMAR), Arlington, VA (Nov. 2004), pp. 110-119

For any question...

David LoweComputer Science Department2366 Main MallUniversity of British ColumbiaVancouver, B.C., V6T 1Z4,

Canada

E-mail: [email protected]

Documents

Object Recognition using Local Invariant Features Claudio Scordino [email protected] July 5 th 2006