An improved feature matching technique for stereo vision applications with the use of self-organizing map

INTERNATIONAL JOURNAL OF PRECISION ENGINEERING AND MANUFACTURING Vol. 13, No. 8, pp. 1359-1368 AUGUST 2012 / 1359 DOI: 10.1007/s12541-012-0179-z

NOMENCLATURE

f = Focal length of the camera d = Baseline i.e the distance between the left and the right

camera z = Depth of the object xl' , yl' = Left camera position xr' , yr' = Right camera position k = Multiplicative factor IL, IR = Left image and Right image captured using the stereo

camera Fl, Fr = Set of features in the left and the right image xi = Set of input variables to the self-organizing map network yi = Activation of the process unit of the feedback layer

wij = Connection strength associated to every processing unit wPi = Weight associated with the winning unit RP = Position of the unit in the grid

α = Learning rate factor σ = Width of the Gaussian function SL, SR = Set of pixels in the left and the right image

kη = Standard learning rate ,hk gkσ σ = Neighborhood parameters to control the disparity rate

,ij ijp qd d = Horizontal and vertical disparity

1. Introduction

Vision-based robot navigation has long been a fundamental goal in the field of robotics and computer vision research. Vision based systems are employed in a wide range of robotic applications such as object recognition,1,2 obstacle avoidance,3-7 navigation,8-10 and more recently in Simultaneous Localization and Mapping (SLAM).11-14 Stereo vision is one of the most important enhancements of

An Improved Feature Matching Technique for Stereo Vision Applications with the Use of Self-Organizing Map

Kajal Sharma1, Sung Gaun Kim1,#, and Manu Pratap Singh2 1 Division of Mechanical and Automotive Engineering, Kongju National University, 275, Budae-dong, Cheonan, Chungnam, Korea, 331-717

2 Department of Computer Science, Institute of Computer and Information Science, Dr. B. R. Ambedkar University, Agra, Uttar Pradesh, India, 282002 # Corresponding Author / E-mail: [email protected], TEL: +82-41-521-9253, FAX: +82-41-521-9547

KEYWORDS: Stereo vision, SIFT, Self-organizing map

Stereo vision cameras are widely used for finding a path for obstacle avoidance in autonomous mobile robots. The Scale Invariant Feature Transform (SIFT) algorithm proposed by Lowe is used to extract distinctive invariant features from images. While it has been successfully applied to a variety of computer vision problems based on feature matching including machine vision, object recognition, image retrieval, and many others, this algorithm has high complexity and long computational time. In order to reduce the computation time, this paper proposes a SIFT improvement technique based on a Self-Organizing Map (SOM) to perform the matching procedure more efficiently for feature matching problems. Matching for multi-dimension SIFT features is implemented with a self-organizing map that introduces competitive learning for matching features. Experimental results on real stereo images show that the proposed algorithm performs feature group matching with lower computation time than the SIFT algorithm proposed by Lowe. We performed experiments on various set of stereo images under dynamic environment with different camera viewpoints that is based on rotation and illumination conditions. The numbers of matched features were increased to double as compared to the algorithm developed by Lowe. The results showing improvement over the SIFT proposed by Lowe are validated through matching examples between different pairs of stereo images. The proposed algorithm can be applied to stereo vision based robot navigation for obstacle avoidance, as well as many other feature matching and computer vision applications.

Manuscript received: June 27, 2011 / Accepted: February 1, 2012

© KSPE and Springer 2012

1360 / AUGUST 2012 INTERNATIONAL JOURNAL OF PRECISION ENGINEERING AND MANUFACTURING Vol. 13, No. 8 computer vision. It can be used in 3D reconstruction, scene analysis, and other depth related applications. Stereo vision can quickly provide 3D information and, when coupled with a Scale-Invariant Feature Transform (SIFT) detector, it can provide distinct landmarks.

A variety of keypoint detectors (Shi and Tomasi,15 SIFT,16 Speed Up Robust Features (SURF)17 descriptor, affine covariant, etc) have been developed in efforts to solve the problem of extracting points of interest in image sequences. These works mainly employ the same approach: extraction of points that represents regions with high intensity gradients. Schmid and Mohr18 used Harris corners to show that invariant local feature matching could be extended to general image recognition problems wherein a feature is matched against a large database of images. They used a rotationally invariant descriptor of the local image regions and allowed features to be matched under arbitrary orientation change between the two images. Although it is rotational invariant, the Harris corner detector is very sensitive to changes in image scale, and as such it does not provide a good basis for matching images of different size. Lowe16 overcomes such problems by detecting the points of interest over the image and scaling through the locations of the local extrema in a pyramidal Difference of Gaussians (DOG). The Lowe’s descriptor, which is based on selecting stable features in the scale space, is named the Scale Invariant Feature Transform (SIFT). The features are invariant to image translation, scaling, and rotation, and partially invariant to illumination changes and affine or 3D projection. Though having the aforementioned advantages, the SIFT algorithm has high complexity and requires long computational time, resulting in slow image matching speed. It is therefore difficult to meet the real-time requirements. Some extensions of the SIFT descriptor have been proposed recently, in efforts to improve either matching properties or to reduce computational complexity. For instance, Mikolajczyk and Schmid19

experimentally compared the performances of several currently used local descriptors and found that the SIFT descriptors were the most effective, as they yielded the best matching results. Recently developed techniques to improve SIFT have targeted minimization of the computational time,20 while limited research aimed at improving accuracy has been reported. Nearest-neighbor search methods such as kd-tress search21 which are efficient in low-dimensional spaces, do not perform better than an exhaustive search in the high-dimensional space of the SIFT features. An improvement in computation can be achieved by an approximate best-bin-first method.22 However, even with this improvement, the computation time increases with the number of stored keypoints, while the success in finding the nearest neighbor decreases. It is therefore useful for 3D object recognition to reduce the number of stored keypoints in an efficient manner.

This paper improves the SIFT algorithm matching process, reduces the computational time, and solves the real-time problem. The work presented in this paper demonstrates increased matching process performance robustness with minimization of the computational time with the use of a Self-Organizing Map (SOM). The effectiveness of self-organizing map based feature matching

has motivated the authors to use it for obstacle avoidance in robot navigation applications. It can be applied to a variety of computer vision problems based on feature matching including machine vision, object recognition, image retrieval, and many others.

The most relevant contribution of this paper is the proposal of an improved time efficient SIFT feature matching technique based on Kohonen’s Self-Organizing Map (SOM) neural network methodology in order to reduce the computation time while providing efficient feature matching. The interpretation of visual information feature matching is done using an Artificial Neural Network (ANN). The self-organizing map introduced by Kohonen has been shown to be an effective artificial neural network model for the visualization of high-dimensional data. The unsupervised learning feature of the SOM is used to find the winner neuron and matching is performed by associating similar winning pixels in the left and the right images of stereo pair images. Through experiments it was found that the proposed self-organizing map based feature matching yields highly distinctive SIFT keypoints that are invariant to image scale and rotation. The approach thereby provides better matching between different stereo images in comparison to other algorithms with fast computation time.

The remainder of this paper is organized as follows: Section 2 presents a brief introduction of the stereo vision methodology. Section 3 and Section 4 present a detailed description of the SIFT algorithm and the Self-Organizing Map (SOM) based feature matching technique. Experimental results using stereo images and a discussion of the findings are presented in Section 5. Finally, we conclude and discuss future work in Section 6.

2. Stereo Vision The purpose of stereo vision is to provide a set of features that

will be defined by their 3-D world position in the camera coordinate system (Fig. 1). When a point in the left image and another point in the right image are matched, that is, they are considered to be projections of the same physical identity in the 3-D world, the difference in their relative positions is recorded as the disparity.23 Eq. (1) shows the relationship between disparity (in pixels) and depth.

Fig. 1 Stereo vision methodology with left camera and right camera separated by baseline d

(x, y, z)

(xl', yl' ) Right Camera

Left Camera

d

Baseline

f

(xr' , yr' )

O

INTERNATIONAL JOURNAL OF PRECISION ENGINEERING AND MANUFACTURING Vol. 13, No. 8 AUGUST 2012 / 1361

Eq. (2) can be obtained from the imaging geometry shown in Fig. 1 by considering similar triangles:

Baseline x FocalLengthDepthDisparity

= (1)

' ' ' '/ 2 / 2, ,l r l rx x d x x d y y y

f z f z f f z+ −

= = = = (2)

Solving for (x, y, and z) gives:

' ' ' '

' ' ' ' ' '

( ) ( ), ,2( ) 2( )

l r l r

l r l r l r

d x x d y y dfx y zx x x x x x+ +

= = =− − −

(3)

The quantity ' '( )l rx x− that appears in Eq. (3) is called the disparity, and the quantity z is called the depth.

3. Scale Invariant Feature Transform Technique Scale-Invariant Feature Transform (SIFT) is a method for

extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The following outlines the major stages of computation used to generate the set of image features.

3.1 Scale-space extrema detection

The first stage identifies key locations in scale space by looking for locations that are maxima or minima of a difference-of-Gaussian function. It has been shown by Koenderink24 and Lindeberg25 that, under a variety of reasonable assumptions, the only possible scale-space kernel is the Gaussian function. Let the scale space of an image be L(x, y, σ), resulting from the convolution of a variable-scale Gaussian, G(x, y, σ), with an input image, I(x, y), given by Eq. (4):

( , , ) ( , , ) * ( , )L x y G x y I x yσ σ= (4)

where * is the convolution operation in x and y, and

2 2( ) 22

1( , , ) / 22

x yG x y σ e σσ

− +=∏

Stable keypoint locations in scale space can be computed from the difference of Gaussians separated by a constant multiplicative factor k given by Eq. (5):

( , , ) ( ( , , ) ( , , )) * ( , )

( , , ) ( , , )D x y G x y k G x y I x y

L x y k L x yσ σ σ

σ σ= −= −

(5)

3.2 Keypoint localization

At each candidate location, a detailed model is fit to determine location and scale. Keypoints are selected based on measures of their stability. Brown26 has recently developed a method for fitting a 3D quadratic function to the local sample points, and his approach uses the Taylor expansion of the scale-space function, D (x, y, σ),

shifted so that the origin is at the sample point and is defined by Eq. (6):

2

2

1( )2

TTD DD x D x x x

x x∂ ∂

= + +∂ ∂

(6)

where D and its derivatives are evaluated at the sample point and x = (x, y, σ)T is the offset from this point. The location of the extremum is x̂ determined by taking the derivative of this function with respect to x and setting it to zero, and can be expressed by Eq. (7):

2 1

2ˆ D Dx

x x

−∂ ∂= −

∂ ∂ (7)

3.3 Orientation assignment

One or more orientations are assigned to each keypoint location based on local image gradient directions. For each image sample, L(x, y), at this scale, the gradient magnitude, m(x, y), and orientation, θ(x, y), is precomputed using pixel differences by Eqs. (8) and (9):

2 2( , ) ( ( 1, ) ( 1, )) ( ( , 1) ( , 1))m x y L x y L x y L x y L x y= + − − + + − − (8)

1( , ) tan (( ( , 1) ( , 1)) / ( 1, ) ( 1, )))x y L x y L x y L x y L x yθ −= + − − + − − (9)

3.4 Keypoint descriptor The local image gradients are measured at the selected scale in

the region around each keypoint. These are transformed into a representation that allows for significant levels of local shape distortion and change in illumination.

4. Self-Organizing Map based Feature Matching A Self-Organizing Map (SOM) is a neurocomputational

algorithm to map high-dimensional data to a lower dimensional space through a competitive and unsupervised learning process. In this paper, the feature vectors extracted from the Scale-Invariant Feature Transform (SIFT) are used to compose a topological map. The topological and metric relationship is obtained using a Self-Organizing Map (SOM) that performs vector quantization and simultaneously organizes the quantized vectors on a regular low-dimensional grid. This algorithm is frequently used to visualize and interpret large high-dimensional data sets. Important SOM features include information compression while trying to preserve the topological and metric relationship of the primary data items. The SOM is both a projection method that maps high-dimensional data space into low-dimensional space, and a clustering method that maps similar data samples to nearby neurons. The main characteristic of the projection provided by the algorithm is the preservation of neighborhood relations; as far as possible, nearby data vectors in the input space are mapped onto neighboring locations in the output space. It is desirable to have some order in the activation of a unit in the feedback layer in relation to the activations of its neighboring units.27-29

In the proposed approach, a bumblebee stereo camera is used to provide a stereo view of the environment. Consecutive image pairs

1362 / AUGUST 2012 INTERNATIONAL JOURNAL OF PRECISION ENGINEERING AND MANUFACTURING Vol. 13, No. 8 acquired by the cameras are then matched to estimate the matching between the pair of images. Indeed, due to the high complexity and the long computational time of the original SIFT algorithm proposed by Lowe, a significant advance is the possibility of reducing computation time by improving the matching process using a Self-Organizing Map (SOM). It is found that the SOM cannot be directly applied to feature matching in stereo pair images, and thus certain modifications are applied to deal with stereo constraints.

Our methodology operates using scale invariant feature vectors instead of an image database for the input to the Self-Organizing Map (SOM). The vectors extracted from the SIFT feature are used to compose a topological map. The map is obtained using a self- organizing map based on the Kohonen neural network.27 We thereby obtain a 2D neuron grid. Each neuron is associated with a weight vector with 128 element descriptors. During the matching, feature vectors are presented as the inputs to the Self-Organizing Map (SOM). The learning algorithm is based on the concept of nearest neighbor learning. Once the network is trained, input data are distributed throughout the grid of neurons. The left image is considered as the reference image and the right image is considered as the matching image in the stereo pair, and they are expressed in terms of the winning neurons in the self-organizing map network. The next step is finding the winning neuron for each pixel of the right image. When the winning neuron is found, the pixel is associated to it. The matching is done between the pixels in the left and right stereo pair images and feature matching is performed by associating the similar winning pixels. The same procedure is then performed on the pixels of the left image. In this step, after finding the winning neuron, it is also computed which pixel of the right image is the most similar to the pixels of the left image. The winning data is chosen by this procedure by matching the similar pixels in the left image and the right image. The matching between the pixels of the pair of stereo images is accomplished by iteratively following this procedure.

The Scale-Invariant Feature Transform (SIFT) is a well-known method to provide a set of keypoints detected in the scale-space that are characterized by a descriptor invariant to scale and orientation. Let IL and IR

be the left and right images captured using the stereo

Fig. 2 Steps of the algorithm proposed in this paper

vision camera. The first stage detects the set of SIFT features in two images, F l and F r, left and right, respectively. For each pair of images, we detect the point of interest, compute SIFT descriptors, and perform stereo matching with the Self-Organizing Map (SOM). Both sets of features are the input to the self-organizing map, which computes stereo matching. Next, the feature association stage performs matching between sets of features that belong to the acquired stereo images. Fig. 2 presents the steps of the algorithm proposed in this paper.

Let us consider the set of input variables {xi} defined as the real vector X ={x1, x2, x3, …, xk}∈ Rn. This input pattern vector is applied to the processing elements of the input layer. The processing elements of the input layer are connected with wi = [wi1, wi2, …, win]T ∈ Rn each element in the SOM grid. This grid contains the feedback layer region. We associate connection strength to every processing element of the feedback layer. The initial value of the WT is selected randomly. The input feature pattern vector X is then applied to the processing units of the input layer of the self-organizing map, as shown in Fig. 3. The linear output of these processing units feeds the weighted input through feed forward connection to the SOM grid. The activation of the jth process unit of the feedback layer can be represented by Eq. (10):

1

K

i ij ii

y w x=

= ∑ (10)

where j = 1 to N (number of units in the feedback layer). When a new input arrives, the topological map determines the

neuron that best matches the input vector. A winning unit in Eq. (11), say P, corresponds to the minimal distance, and represents the pixel in the right image that could be a match in the left image selected among all the processing units of the feedback layer as:

1 1( ) min ( )

K n

i P i ji ii i

x w x w= =

− = −∑ ∑ (11)

; for all j

Hence, during learning, the nodes that are topographically close to a certain geometric distance will activate each other to learn from the same input vector X and the weights associated with the winning unit P and its neighboring units r are updated by Eq. (12):

( 1) ( ) ( , )[ ( ) ( )]iP iP i ijw t w t P r x t w t+ = + −D (12)

for i = 1 to K and P = 1 to N; here D (P, r) is the neighborhood function and it can be represented as:

Capture Stereo Images from Bumblebee Camera

Feature Vector generation using SIFT

Winner calculation with Self-Organizing Map (SOM)

Feature association of matched pixels between left and right images

Left Image Right Image

Fig. 3 Self-Organizing Map27


2

2( , ) ( ).exp2 ( )P rR R

P r tt

ασ

⎡ ⎤−= −⎢ ⎥

⎢ ⎥⎣ ⎦D

where RP refers to the position of the Pth unit in the grid, α(t) is the learning rate factor (0 < α(t) < 1), and the parameter σ(t) defines the width of the Gaussian function. σ(t) gradually decreases so as to reduce the neighborhood region in successive iterations of the training process. The neurons are arranged in different topologies, such as a sheet, cylinder, and toroid. In this topological map, the vectors that are similar in the input space are grouped together or clustered. The map is a two dimensional grid of processing elements, called neurons.

The modified algorithm converts the stereo-correspondence problem into an estimation of a mapping from each pixel of one image to the corresponding pixel in the other image of the stereo pair. Let SL represent the set of pixels in the left image, SR the set of pixels in the right image, and Sx = {xx, yx} the set of position vectors of the pixels. The modified algorithm basically entails matching between pixels of IL and IR, in the left and the right image, and associating the similar winning pixels to perform feature matching.

It is found that SOM cannot be directly applied to stereopsis, and certain modifications are called for in order to take care of stereo constraints. We now explain how to modify the SOM to discover a possible isomorphic mapping between two images, the novelty in modification consisting primarily in taking into account the stereo-matching constraints. This leads to the creation of the modified SOM. The steps are as follows:

Step 1: Initialize a node for each feature point in the left image, with the corresponding coordinates and intensity as the initial weights. Let 1 2 3( , , )ij ij ijw w w denote the weights of the node corresponding to pixel at ( , )i j with intensity 3 .ijw These feature vectors serve as inputs to the network. Select a pixel at random from the right image, and feed the corresponding feature vector as input to the self-organizing network. Let 1 2 3( , , )ij ij ijα α α be the input feature vector corresponding to pixel at ( , ).m n

Step 2: Let ( , )x y be the index of the winning neuron in the network for ( , )m n th input, then

3

2

, 1( , ) arg min ( )ij mn

k ki j kx y w α

=

= −∑ (13)

The winner node ( , )x y corresponds to the minimal distance, and represents the pixel in the left image which could be a match for the ( , )m n th in the right image.

Step 3: Let M be the height and N, the width of an image. Update only the first two components of all neuron weight vectors as follows:

( ')( ')( ( ', ') ( )( )ij ij m i n j ijk k k k k kw w h i j g I wα + +← + Δ − (14)

where

'i i x= − ' 2 ' 2

2' ( ', ') exp2k k

hk

i jj j yh i j ησ

⎛ ⎞+= − = −⎜ ⎟

⎝ ⎠

2

3 32

( )( ) exp ( )2

xy ijk

gk

Ig I I w wσ

⎛ ⎞ΔΔ = − Δ = −⎜ ⎟⎜ ⎟

⎝ ⎠

for k = 1, 2, and ∀ i,m∈ {1, 2, …, M}, ∀ j, n∈ {1, 2, …, N}, kη is the standard learning rate, and ,hk gkσ σ are the neighborhood parameters, which control the rate of propagation of disparity in the topological neighborhood and within the range of the object. Repeat the above three steps Nx times where Nx is a predetermined number; typically Nx = 100 x MN.

Step 4: The disparity vector ( , )ij ijp qd d for each pixel ( , )i j is

defined as: 1 2; ,ij ij ij ijq pd i w d j w= − = − corresponding to vertical and

horizontal disparity. The algorithm consists of the three main steps: initialization,

winner neuron selection, and weight update. During initialization, the neurons in the competitive layer are initialized with the feature vectors in the reference image IL. During learning, the input to the network is a randomly selected pixel IR (m, n) of the matching image. The learning phase conducts a search of the global minimum of a Euclidean distance between the input feature vector and the weights of the neuron. If the winning neuron is found, then we can perform matching between the pixels in the left and right images. The procedure for placing a vector from data space onto the map entails finding the node with the closest weight vector to the vector taken from data space and assigning the map coordinates of this node to the vector. The SOM arranges feature vectors according to their internal similarity, creating a continuous topological map of the input space. Each processing element is connected to each element of the input layer and to its neighboring processing elements. The connections between the layers (weights) represent the strength of the connection.

The two pixels to be matched correspond to the centers of the two windows. Several constraints are used to determine the best matching pair. In feature extraction, the Sobel operation is applied to input images to obtain the variation and orientation of each pixel; these data will form the basic input features. During training, the differences in intensity, variation and orientation between two local windows are fed to the SOM network. The SOM network after training should have the ability to differentiate the matched pairs from unmatched pairs. The trained SOM network is first used to generate an initial or primitive disparity map, which will then be used as a reference map for the subsequent matching process. Because the matching problem can be treated as a many-to-one problem, in order to reduce the search space, the process also involves application of some useful constraints so as to effectively extract out the best matching pixel from among multiple candidates.

In stereo vision matching we are only concerned in the detection of true matches. This is because, if a given pair of features is not a true match, it is obviously a false one i.e., only two possible classes are used. According to the modified SOM if the neuron wins, its synaptic weight vector is updated and also the corresponding synaptic weight vectors of the neurons belonging to a certain neighborhood. Each neuron and all neurons surrounding it form the neighborhood required by the modified SOM. Therefore, the cluster center m is updated in the following two cases: 1 when the central neuron results in being the winner and 2 when the central neuron belongs to the neighborhood of any other neuron associated to false matches who result in being the winner. Also, the movement of any

1364 / AUGUST 2012 INTERNATIONAL JOURNAL OF PRECISION ENGINEERING AND MANUFACTURING Vol. 13, No. 8 m is insignificant when it is updated due to the activation of a neuron different from itself including the case in which the winning neuron is the central one. Whereby and taking into account that we are only concerned with the detection of true matches, we have decided that the learning in our model is performed only by the central neuron if it results in being the winner.

After completing the steps above, feature matching between different image pairs acquired by the stereo vision camera can be evaluated. In order to compare the real feature matching performance, different tests are performed, and they are described in the next section.

5. Experimental Details and Results In our experiment, we used the Bumblebee 2 stereo vision

camera developed by Point Grey Research Inc., designed as an entry level camera for professional use (Fig. 4). The main advantage of this camera system was that it comes mechanically pre-calibrated for lens distortion and camera misalignments. The Bumblebee stereo vision camera uses two Sony progressive scan CCDs, each with a HFOV up to 100 degrees, and communicates via an IEEE 1394 connection. It has a 12 cm baseline and it can output 640x480 images at 48 FPS, or 1024x768 images at 20 FPS, via its IEEE-1394 (FireWire) interface. The camera was designed for applications such as people tracking, mobile robotics, navigation, mining, and other computer vision applications. The stereo camera specifications for the Bumblebee camera are summarized in Table 1.

The left and right stereo pair images were captured with the Bumblebee pointgrey camera and feature matching between left and right stereo pair images was performed with the original SIFT algorithm and the modified Self-Organizing Map (SOM) based feature matching algorithm. The left and right images that were captured by the stereo vision Bumblebee pointgrey camera are shown in Fig. 5. Fig. 6-10 show the features that were obtained from five pairs of stereo images acquired by our vision system with the original SIFT matching algorithm and also show the results obtained with the improved self-organizing map based feature

matching algorithm. Experimental results on real stereo images showed that the proposed algorithm performed feature group matching in a more time efficient manner than the original SIFT algorithm. The results of the SIFT improvement were validated through matching examples between different pairs of stereo images in different environmental conditions.

In our experiments, the method proposed by Lowe was used in the Original SIFT feature matching. The components of the SIFT framework for keypoint detection are Scale-space extrema detection, Keypoint localization, Orientation assignment, Keypoint descriptor. The description of the algorithm is given in Section 3. The SIFT keypoints of images were first extracted from a set of reference images and stored in a database. To find corresponding features between the two images, which will lead to feature extraction, feature matching based on Euclidean distance was used. A feature point was recognized in a new image by individually comparing each feature from the new image to the database and finding

Fig. 5 Left and right images captured by stereo vision pointgrey Bumblebee camera

(a) Original SIFT- 30 features matched in 0.07813 sec

(b) Self-Organizing Map based feature matching- 63 features matched in 0.03719 sec

Fig. 6 (a) Shows the result obtained with the original SIFT matching algorithm, consisting of 30 features matched in 0.07813 sec. (b) shows the result obtained with the improved self-organizing map based feature matching algorithm, consisting of 63 features matched in 0.03719 sec. The lines mark the position of features that have a matching feature in the counterpart image

Fig. 4 Pointgrey Research Bumblebee Camera

Table 1 Stereo Camera Specifications Image Sensor Two Sony 1/3” Progressive Scan CCD, Color/BW

Resolution and FPS 640x480 at 48 FPS or 1024x768 at 20 FPS

Focal Length 3.8 mm with 70° HFOV, 6 mm with 50° HFOVBaseline 12 cm

Gain Control Automatic/Manual Shutter Automatic/Manual, 0.01 ms to 66.63 ms at 15 fps.

Dimensions 157x36x47.4 mm Mass 342 grams


candidate matching features based on Euclidean distance of their feature vectors. From the full set of matches, subsets of keypoints that agree on the object and its location, scale, and orientation in the new image were identified to filter out good matches. The determinations


(b) Self-Organizing Map based feature matching- 28 features

matched in 0.03576 sec

Fig. 7 (a) Shows the result obtained with the original SIFT matching algorithm, consisting of 16 features matched in 0.05773 sec. (b) shows the result obtained with the improved self-organizing map based feature matching algorithm, consisting of 28 features matched in 0.03576 sec. The lines mark the position of features that have a matching feature in the counterpart image


(b) Self-Organizing Map based feature matching- 28 features

matched in 0.03462 sec

Fig. 8 (Rotated view) (a) Shows the result obtained with the original SIFT matching algorithm, which matched 17 features in 0.05672 sec. (b) shows the result obtained with the improved self-organizing map based feature matching algorithm, which matched 28 features in 0.03462 sec. The lines mark the positions of features that have matching features in the other image

of consistent clusters were performed rapidly by using an efficient hash table implementation of the generalized Hough transform. Each cluster of 3 or more features that agree on an object and its



Fig. 9 (Rotated view) (a) Shows the result obtained with the original SIFT matching algorithm, consisting of 21 features matched in 0.05885 sec. (b) shows the result obtained with the improved self-organizing map based feature matching algorithm, consisting of 48 features matched in 0.03934 sec. The lines mark the position of features that have a matching feature in the counterpart image



Fig. 10 (Illuminated view) (a) Shows the result obtained with the original SIFT matching algorithm, consisting of 37 features matched in 0.05039 sec. (b) shows the result obtained with the improved self-organizing map based feature matching algorithm, consisting of 61 features matched in 0.04840 sec. The lines mark the position of features that have a matching feature in thecounterpart image

1366 / AUGUST 2012 INTERNATIONAL JOURNAL OF PRECISION ENGINEERING AND MANUFACTURING Vol. 13, No. 8 pose is then subject to further detailed model verification and subsequently outliers are discarded. Finally the probability that a particular set of features indicates the presence of an object is computed, given the accuracy of fit and number of probable false matches.

For image matching, descriptor vectors of all keypoints are stored in a database, and matches between keypoints are found based on Euclidean distance. Once Difference-of-Gaussian (DoG) images have been obtained, keypoints are identified as local minima/maxima of the DoG images across scales. This is done by comparing each pixel in the DoG images to its eight neighbors at the same scale and nine corresponding neighboring pixels in each of the neighboring scales. If the pixel value is the maximum or minimum among all compared pixels, it is selected as a candidate keypoint.

According to the Nearest Neighborhood procedure 1iF or each

feature in the reference image feature set the corresponding 2iF

must be looked for in the test feature set. The corresponding feature is one with the smallest Euclidean distance to the feature 1 2( , ).i jF F A pair of corresponding features is called a match 1 2( , ).i jM F F

In this paper an improvement over the original SIFT algorithm is proposed that is time efficient feature matching technique based on Kohonen’s Self-Organizing Map (SOM) neural network methodology. The detailed algorithm is given in Section 4. The result of the research presented in this paper is consideration of the improvement of the original SIFT algorithm with respect to the processing time and number of matched features.

Lowe16 proposed the using of the ratio between the Euclidean distance to the nearest and the second nearest neighbors as a threshold. Under the condition that the object does not contain repeating patterns, one suitable match is expected and the Euclidean distance to the nearest neighbor is significantly smaller than the Euclidean distance to the second nearest neighbor. If no match is correct, all distances have a similar, small difference from each other. A match is selected as positive only if the distance to the nearest neighbor is 0.8 times larger than that from the second nearest one. Among positive and negative matches, correct as well as false matches can be found. Lowe claims16 that the threshold of 0.8 provides 95% of correct matches as positive and 90% of false matches as negative. The total amount of the correct positive matches must be large enough to provide reliable object recognition. In the proposed algorithm an improvement to the feature matching of the SIFT algorithm with respect to the number of correct positive matches is presented.

Comparisons of the results obtained with the original SIFT matching algorithm and the improved Self-Organizing Map (SOM) based feature matching are shown in Table 2. The results using different set of stereo pair images with different camera viewpoints i.e. rotated, illuminated are presented and corresponding processing time and number of matched features obtained with original SIFT and the proposed algorithm are also presented.

In order to determine the amount of reduced processing time in comparison to original SIFT procedure, it is assumed that the number of extracted features in the lower octave with respect to the

higher octave is decreased 4 times due to the down-sampling by the factor of 2 in both image directions. Hence, the matching time cost in the case of matching stereo images is reduced 1.5 times in comparison to the SIFT algorithm proposed by Lowe. The number of matched features were increased to double as compared to the SIFT algorithm proposed by Lowe. Our methodology generated more number of features in less time and thus the proposed technique is efficient for feature matching in the stereo images. Some of the features generated by Lowe’s algorithm are discarded Table 2 Comparison of the stereo image results obtained with the original SIFT algorithm proposed by Lowe and self-organizing map based feature matching algorithm

Stereo PairImages usedfor analysis

Original SIFT algorithm proposed by Lowe

Improved Self-OrganizingMap (SOM) based feature matching

Computational

Time (sec)

Number of matched

Features

Computational

Time (sec)

Number of matched

Features Stereo Pair 1 0.07813 30 0.03719 63 Stereo Pair 2 0.09130 19 0.03441 29 Stereo Pair 3 0.04816 14 0.03214 25 Stereo Pair 4 0.10483 68 0.07022 120 Stereo Pair 5 0.06730 39 0.03352 51 Stereo Pair 6 0.05773 16 0.03576 28 Stereo Pair 7 0.04773 16 0.04483 25 Stereo Pair 8 0.04374 8 0.04281 9 Stereo Pair 9 0.05611 51 0.05321 74 Stereo Pair 10 0.07562 31 0.04294 68 Stereo Pair 11

(Rotated) 0.04926 25 0.03535 29

Stereo Pair 12(Rotated) 0.08034 19 0.03684 38




Stereo Pair 16(Illumination) 0.05885 21 0.03934 48





Fig. 11 Comparison graph showing the computation time for calculation of each stereo pair with the original SIFT algorithm andthe modified self-organizing map based feature matching algorithm


in orientation assignment step while our methodology consists of winner calculation technique so the rate of number of features increases. Our approach consists of more number of stable features generated with the self-organizing feature matching methodology.

As evident, the experiments were conducted on the different sets of image pair under different conditions such as illumination during the image acquisition, different camera viewpoint, etc. The advantage of the proposed matching technique over the original SIFT matching technique is evident from Figure 11. Analyzing all results, the following conclusions may be inferred: 1. The learning process of the self-organizing map improves the

matching results. As training progresses the results are better. 2. Our self-organizing map based proposed methodology for the

feature matching in stereo images produces better results than the feature matching technique given by Lowe.

3. Due to the definition of learning rates in Eq. 12, we have verified that the approach developed in this paper requires less processing time in comparison to the processing time of SIFT algorithm proposed by Lowe.

4. As stated the processing complexity of the proposed self-organizing map has been reduced and the matching time cost is reduced 1.5 times in comparison to the SIFT method given by Lowe.

Two main experiments were conducted to identify the

differences between the original SIFT algorithm proposed by Lowe and the proposed self-organizing map based feature matching. Fig. 11 shows the performance results using the original SIFT algorithm and the self-organizing map based feature matching. The advantages of the proposed matching technique over the original SIFT matching technique are evident from Fig. 11. Different sets of real time stereo images were matched to evaluate the performance of the computational matching time of the proposed approach with respect to the original SIFT. The proposed approach results in reduced processing time for matching of the stereo images with efficient feature matching.

6. Conclusion and Future Work This work proposed a new approach to feature matching in real-

time stereo images with different camera positions. The original SIFT algorithm developed by Lowe was modified by using the self-organizing map. The proposed approach was evaluated using a set of real scenarios, with different sets of stereo pair images under different camera viewpoints. The results showed the advantages of the self-organizing map in relation to the original SIFT in terms of time performance. The experimental results show the effectiveness of the proposed approach.

Considering time efficiency, our approach can be used as a foundation for the future development of stereo vision and path finding applications in the robotic field. Future works will be focused on the implementation of stereo vision-based robot navigation for the building of simultaneous localization and map.

ACKNOWLEDGEMENT This work was supported by the research grant of the Kongju

National University in 2011.

REFERENCES

1. Belongie, S., Malik, J., and Puzicha, J., “Shape Matching and Object Recognition Using Shape Contexts,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 24, No. 4, pp. 509-521, 2002.

2. Berg, A. C., Berg, T. L., and Malik, J., “Shape matching and object recognition using low distortion correspondence,” IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 1, pp. 26-33, 2005.

3. Hosoda, K., Sakamoto, K., and Asada, M., “Trajectory generation for obstacle avoidance of uncalibrated stereo visual servoing without 3D reconstruction,” Proc. of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vol. 1, pp. 29-34, 1995.

4. Michels, J., Saxena, A., and Ng, A. Y., “High speed obstacle avoidance using monocular vision and reinforcement learning,” Proc. of the 22nd International Conference on Machine Learning, Vol. 2, pp. 593-600, 2005.

5. Sabe, K., Fukuchi, M., Gutmann, J. S., Ohashi, T., Kawamoto, K., and Yoshigahara, T., “Obstacle Avoidance and Path Planning for Humanoid Robots using Stereo Vision,” Proc. of the IEEE International Conference on Robotics and Automation (ICRA), Vol. 1, pp. 592-597, 2004.

6. Bertozzi, M. and Broggi, A., “GOLD: A parallel real-time stereo vision system for generic obstacle and lane detection,” IEEE Trans. on Image Process, Vol. 7, No. 1, pp. 62-81, 1998.

7. Bensrhair, A., Bertozzi, M., Broggi, A., Fascioli, A., Mousset, S., and Toulminet, G., “Stereo vision-based feature extraction for vehicle detection,” Proc. of the IEEE Intell. Vehicle Symp., Vol. 2, pp. 465-470, 2002.

8. Chae, Y., Choi, C., Kim, J., and Jo, S., “Noninvasive sEMG-based Control for Humanoid Robot Teleoperated Navigation,” Int. J. Precis. Eng. Manuf., Vol. 12, No. 6, pp. 1105-1110, 2011.

9. Murray, D. and Little, J., “Using real-time stereo vision for mobile robot navigation,” J. of Autonomous Robots, Vol. 8, No. 2, pp. 161-171, 2000.

10. Ohya, I., Kosaka, A., and Kak, A., “Vision-based navigation by a mobile robot with obstacle avoidance using single-camera vision and ultrasonic sensing,” IEEE Trans. on Robotics and Automation, Vol. 14, No. 6, pp. 969-978, 1998.

11. Choi, K. S. and Lee, S. G., “Enhanced SLAM for a Mobile Robot using Extended Kalman Filter and Neural Networks,” Int. J. Precis. Eng. Manuf., Vol. 11, No. 2, pp. 255-264, 2010.

1368 / AUGUST 2012 INTERNATIONAL JOURNAL OF PRECISION ENGINEERING AND MANUFACTURING Vol. 13, No. 8 12. Davison, A. J., Reid, I. D., Molton, N. D., and Stasse, O.,

“MonoSLAM: Real-time single camera SLAM,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 29, No. 6, pp. 1052-1067, 2007.

13. Paz, L. M., Pinies, P., Tardos, J. D., and Neira, J., “Large scale 6 dof slam with stereo-in-hand,” IEEE Trans. on Robotics, Vol. 24, No. 5, pp. 946-957, 2008.

14. Sim, R., Elinas, P., Griffin, M., and Little, J. J., “Vision-based SLAM using the Rao-Blackwellised particle filter,” Proc. of the IJCAI Workshop on Reasoning with Uncertainty in Robotics (RUR), 2005.

15. Shi, J. and Tomasi, C., “Good Features to Track,” Proc. of the 9th IEEE Conference on Computer Vision and Pattern Recognition, pp. 593-600, 1994.

16. Lowe, D. G., “Distinctive image features from scale-invariant keypoints,” Int. J. of Computer Vision, Vol. 60, No. 2, pp. 91-110, 2004.

17. Bay, H., Tuytelaars, T., and Van Gool, L., “SURF: Speeded Up Robust Features,” Proc. of the European Conference on Computer Vision, pp. 404-417, 2006.

18. Schmid, C. and Mohr, R., “Local grayvalue invariants for image retrieval,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 19, No. 5, pp. 530-534, 1997.

19. Mikolajczyk, K. and Schmid, C., “Scale & affine invariant interest point detectors,” Int. J. of Computer Vision, Vol. 60, No. 1, pp. 63-86, 2004.

20. Ke, Y. and Sukthankar, R., “PCA-SIFT: A More Distinctive Representation for Local Image Descriptors,” Proc. of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. 2, pp. 506-513, 2004.

21. Friedman, J. H., Bentley, J. L., and Finkel, R. A., “An algorithm for finding best matches in logarithmic expected time,” ACM Trans. on Mathematical Software, Vol. 3, pp. 209-226, 1977.

22. Beis, J. S. and Lowe, D. G., “Shape indexing using approximate nearest-neighbour search in high-dimensional spaces,” Proc. of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1000-1006, 1997.

23. Scharstein, D. and Szeliski, R., “A taxonomy and evaluation of dense two-frame stereo correspondence algorithms,” Int. J. of Computer Vision, Vol. 47, No. 1-3, pp. 7-42, 2002.

24. Koenderink, J. J., “The structure of images,” Biol. Cybern., Vol. 50, pp. 363-370, 1984.

25. Lindeberg, T., “Scale-space theory: a basic tool for analyzing structures at different scales,” J. of Applied Statistics, Vol. 21, No. 2, pp. 224-270, 1994.

26. Brown, M. and Lowe, D. G., “Invariant features from interest point groups,” Proc. of the British Machine Vision Conference, pp. 656-665, 2002.

27. Kohonen, T., “The Self-Organizing Map,” Proc. of the IEEE,

Vol. 78, No. 9, pp. 1464-1480, 1990.

28. Kohonen, T., “Self-Organizing Maps,” Springer, 1995.

29. Chen, O. T.-C., Sheu, B. J., and Fang, W.-C., “Image compression using self organization networks,” Proc. of the IEEE Trans. on Circuits and Systems on Video Tech., Vol. 4, No. 5, pp. 480-489, 1994.

Documents

An improved feature matching technique for stereo vision applications with the use of self-organizing map