View
215
Download
0
Tags:
Embed Size (px)
Citation preview
Object Recognition
Widely used in the industry for Inspection Registration Manipulation Robot localization and mapping
Current commercial systems Correlation-based template matching Computationally infeasible when object
rotation, scale, illumination and 3D pose vary Even more infeasible with partial occlusion
Alternative: Local Image Features
Local Image Features
Unaffected by Nearby clutter Partial occlusion
Invariant to Illumination 3D projective transforms Common object variations
...but, at the same time, sufficiently distinctive to identify specific objects among many alternatives!
Related work Line segments, edges and regions grouping
Detection not good enough for reliable recognition Peaks detection in local image variations
Example: Harris corner detector Drawback: image examined at only a single scale Different key locations as the image scale changes
Eigenspace matching, color and receptive field histograms Successful on isolated objects Unextendable to cluttered and partially occluded
images
SIFT Method
Scale Invariant Feature Transform (SIFT)
Staged filtering approach Identifies stable points (image “keys”)
Computation time less than 2 secs
SIFT Method (2)
Local features: Invariant to image translation, scaling,
rotation
Partially invariant to illumination changes
and 3D projection (up to 20° of rotation)
Minimally affected by noise
Similar properties with neurons in Inferior
Temporal cortex used for object recognition
in primate vision
First stage
Input: original image (512 x 512 pixel)
Goal: key localization and image description
Output: SIFT keys Feature vector describing the local image
region sampled relative to its scale-space coordinate frame
First stage (2)
Description: Represents blurred image gradient locations
in multiple orientations planes and at multiple scales
Approach based on a model of cells in the celebral cortex of mammalian vision
Less than 1 sec of computation time
Build a pyramid of images Images are difference-of-Gaussian (DOG)
functions Resampling between each level
Key localization
Algorithm: Expand original image by a factor of 2 using
bilinear interpolation For each pyramid level:1. Smooth input image through a convolution with
the 1D Gaussian function (horizontal direction):
2
2
2
2
1)(
x
exg
2with obtaining Image A
Key localization (2)
2. Smooth Image A through a further convolution with th 1D Gaussian function (vertical direction) obtaining Image B
3. The DOG image of this level is B-A
4. Resample Image B using bilinear interpolation with pixel spacing 1.5 in each direction and use the result as Input Image of the new pyramid level
Each new sample is a constant linear combination of 4 adjacent pixels
Key localization (3)
Find maxima and minima of the DOG images:
2nd level
1st level
Key orientation
1. Extract image gradients and orientation at each pyramid level. For each pixel Aij compute
2. Mij thresholded at a value of 0.1 times the maximum possible gradient value
Provides robustness to illumination
21,
2,1 )()( jiijjiijij AAAAMImage Gradient Magnitude
),(2arctan 1,,1 ijjijiijij AAAAR Image Gradient Orientation
Key orientation (2)
3. Create an orientation histogram using a circular Gaussian-weighted window with σ=3 times the current smoothing scale
The weights are multiplied by Mij
The histogram is smoothed prior to peak selection
The orientation is determined by the peak in the histogram
Experimental results
Original image
Keys on image after rotation (15°), scaling (90%), horizontal
streching (110%), change of brightness (-10%) and contrast (90%), and addition of
pixel noise
78%
Experimental results (2)
Image transformation
Location and scale match
Orientation match
Decrease constrast by 1.2
89.0 % 86.6 %
Decrease intensity by 0.2
88.5 % 85.9 %
Rotate by 20° 85.4 % 81.0 %
Scale by 0.7 85.1 % 80.3 %
Stretch by 1.2 83.5 % 76.1 %
Stretch by 1.5 77.7 % 65.0 %
Add 10% pixel noise 90.3 % 88.4 %
All previous 78.6 % 71.8 %20 different images, around 15,000 keys
Image description
Approach suggested by the response properties of complex neurons in the visual cortex
A feature position is allowed to vary over a small region, while orientation and spatial frequency are maintained
Image descripted through 8 orientation planes
Keys inserted according to their orientations
Second stage
Goal: identify candidate object matches The best candidate match is the nearest
neighbour (i.e., minimum Euclidean distance between decriptor vectors)
The exact solution for high dimensional vectors is known to have high complexity
Second stage (2)
Algorithm: approximate Best-Bin-First (BBF) search method (Beis and Lowe)
Modification of the k-d tree algorithm Identifies the nearest neighbours with high
probability and small computation The keys generated at the larger scale are
given twice the weight of those at the smaller scale
Improves recognition by giving more weight to the least-noisy scale
Third stage
Description: final verification
Algorithm: low-residual least-squares fit Solution of a linear system: x = [ATA]-1ATb When at least 3 keys agree with low
residual, there is strong evidence for the presence of the object
Since there are dozens of keys in the image, this works also with partial occlusion
Perspective projection
Partial occlusion
Computation time: 1.5 secs on Sun Sparc 10(0.9 secs first stage)
Connections to human vision
Performance of human vision is obviously far superior than current computer vision...
The brain uses a highly computational-intensive parallel process instead of a staged filtering approach
Connections to human vision
However... the results are much the same Recent research in neuroscience showed
that the neurons of Inferior Temporal cortex
Recognize shape features The complexity of the features is roughly the
same as for SIFT They also recognize color and texture properties
in addition to shape Further research:
3D structure of objects Additional feature types for color and texture
Augmented Reality (AR)
Registration of virtual objects into a live video sequence
Current AR systems: Rely on markers strategically placed in the
environment Need manual camera calibration
Related work
Harris corner detector and Kanade-Lucas-Tomasi (KLT) tracker
Not enough feature invariance Parallelogram-shaped and elliptical image
regions tracking Requires planar structures in viewed scene
Pre-built user-supplied CAD object models Not always available Limited to objects that can be easily modelled
Off-line batch processing of the entire video
AR using SIFT
Flexible automated AR Not needed:
Camera pre-calibration Prior knowledge of scene geometry Manual initialization of the tracker Placement of special markers Special tools or equipment (just a camera)
Short time and small effort to setup Robust 6 degrees of freedom
AR using SIFT (2)
Need only a set of reference images taken by a handheld uncalibrated camera from arbitrary viewpoints
Acquired from unknown spatially separated viewpoints by a handheld camera
At least two images 5 to 20 images separated by at most 45°
Used to build a 3D model of the viewed scene
AR using SIFT (3)
First (off-line) stage:1. Extract SIFT features from reference images2. Establish multi-view correspondences3. Build a metric model of the real world 4. Compute calibration parameters and camera
poses5. The user places the virtual object
The placement is achieved by anchoring object projection in the first image
Then, a second projection is adjusted in the second image
Finally, the user fine-tunes position, orientation and size
AR using SIFT (4)
Second (on-line) stage:1. Features are detected in the current frame 2. Features are matched to those of the model
using the BBF algorithm3. The matches are used to compute the
current pose of the camera4. Solution is stabilized by using the values
computed for the previous frame
AR using SIFT: prototype Software
C programming language OpenGL and GLUT libraries
Hardware: IBM ThinkPad Pentium 4-M processor (1.8 GHz) Logitech QuickCam Pro 4000
camera
Operation Computation time
Feature extraction 150 msec
Feature matching 40 msec
Camera pose computation 25 msec
4 FPS
AR using SIFT: drawbacks
The tracker is very slow 4 FPS (Frame Per Second) Too slow for real-time operations (25 FPS) The main bottleneck is feature extraction
Unable to handle occlusion of inserted virtual content by real objects
A full model of the observed scene is required
AR using SIFT: examples
Videos: mug tabletop
Conclusions
Object recognition using SIFT Reliable recognition Several characteristics in common with
human vision
Augmented reality using SIFT Very flexible Not possible in real-time due to the high
computation times In future possible using faster processors
References David G. Lowe, "Object recognition from local
scale-invariant features" International Conference on Computer Vision, Corfu, Greece (September 1999), pp. 1150-1157
Stephen Se, David G. Lowe and Jim Little, "Vision-based mobile robot localization and mapping using scale-invariant features" Proceedings of IEEE International Conference on Robotics and Automation, Seoul, Korea (May 2001), pp. 2051-58
Iryna Gordon and David G. Lowe, "Scene modelling, recognition and tracking with invariant image features" International Symposium on Mixed and Augmented Reality (ISMAR), Arlington, VA (Nov. 2004), pp. 110-119
For any question...
David LoweComputer Science Department2366 Main MallUniversity of British ColumbiaVancouver, B.C., V6T 1Z4,
Canada
E-mail: [email protected]