Vision-based global localization for mobile robots with hybrid maps of objects and spatial layouts

Information Sciences 179 (2009) 4174–4198

Contents lists available at ScienceDirect

Information Sciences

journal homepage: www.elsevier .com/locate / ins

Vision-based global localization for mobile robots with hybrid mapsof objects and spatial layouts

Soonyong Park a, Soohwan Kim a, Mignon Park b, Sung-Kee Park a,*

a Center for Cognitive Robotics Research, Korea Institute of Science and Technology (KIST), 39-1 Hawolgok-dong, Sungbuk-gu, Seoul, Republic of Koreab Department of Electrical and Electronic Engineering, Yonsei University, 134 Sinchon-dong, Seodaemun-gu, Seoul, Republic of Korea

a r t i c l e i n f o a b s t r a c t

Article history:Received 7 May 2008Received in revised form 19 June 2009Accepted 25 June 2009

Keywords:Hybrid mapGlobal localizationMobile robotStereo visionObject recognition

0020-0255/$ - see front matter � 2009 Published bdoi:10.1016/j.ins.2009.06.030

* Corresponding author. Fax: +82 2 958 5629.E-mail address: [email protected] (S.-K. Park).

This paper presents a novel vision-based global localization that uses hybrid maps ofobjects and spatial layouts. We model indoor environments with a stereo camera usingthe following visual cues: local invariant features for object recognition and their 3D posi-tions for object pose estimation. We also use the depth information at the horizontal cen-terline of image where the optical axis passes through, which is similar to the data from a2D laser range finder. This allows us to build our topological node that is composed of ahorizontal depth map and an object location map. The horizontal depth map describesthe explicit spatial layout of each local space and provides metric information to computethe spatial relationships between adjacent spaces, while the object location map containsthe pose information of objects found in each local space and the visual features for objectrecognition. Based on this map representation, we suggest a coarse-to-fine strategy for glo-bal localization. The coarse pose is estimated by means of object recognition and SVD-based point cloud fitting, and then is refined by stochastic scan matching. Experimentalresults show that our approaches can be used for an effective vision-based map represen-tation as well as for global localization methods.

� 2009 Published by Elsevier Inc.

1. Introduction

How to represent environmental map is a fundamental problem in mobile robotics and it has been receiving significantattentions for many years [32], which requires to determine the shape and extent of the physical environment and to locatespatial entities such as landmarks and objects in a reference frame. Traditionally, robot maps have been classified into twoapproaches: metric and topological. The metric maps [6,11,25] represent the environment with geometrical features, andthey are global representations of the environment in the context of using a single reference frame. The topological maps[5,20], on the other hand, describe the spatial relationships between adjacent spaces that have their own reference framesand links relevant data to the spaces. It has a graph-like structure, where nodes correspond to local spaces and arcs connectpairs of neighboring spaces. Recently, the idea of a hybrid map representation has been actively studied [30,33], where it hasthe structure that combines a local metric approach with a global topological approach. In this method, the former is used forprecise navigation in a local space and the latter is applied to move between local spaces.

In recent years, in addition to the metric and topological maps, cognitive maps [13,17] have been actively studied forincluding semantic information in the maps, which can give robots the ability to communicate with humans using a com-mon set of terms and spatial concepts. Since the cognitive map was first introduced by Tolman [28,34], there have been

y Elsevier Inc.

http://dx.doi.org/10.1016/j.ins.2009.06.030

mailto:[email protected]

http://www.sciencedirect.com/science/journal/00200255

http://www.elsevier.com/locate/ins

S.-K. Park et al. / Information Sciences 179 (2009) 4174–4198 4175

many studies [40,41] attempting to understand how a human creates a cognitive map from his/her perceptions of the envi-ronment, and those have become the references of cognitive map for robot. Kuipers [22] initially attempted to conceptualizea human cognitive map for robotic mapping. He proposes a cognitive map that consists of five different types of information,each with its own representation: observations, route descriptions, fixed features, topological relations, and metrical rela-tions. More recently, Yeap and Jefferies [42] examined the nature of the early cognitive map by means of reviewing andimplementing the theories that have been postulated to explain the human cognitive mapping process. They classify con-ventional cognitive maps into two paradigms: space-based and object-based ones. Then, they postulate that the human cog-nitive mapping process begins by computing a description of the local environment, such as appearances and extents. Inaddition, the environment is conveniently abstracted based on the object entities found there. The work proposed in thispaper attempts to create such a cognitive map, which is represented by the shape of the local space and the object entitiesfound in the space. In order to validate the proposed map, an attempt is also made to estimate a robot pose by suggesting alocalization method.

In vision-based cognitive mapping and localization, the recognition modality is an essential factor because the semanticinformation is based on its recognition capability. Therefore, most of the recent studies on vision-based mapping and local-ization have been based on either a place recognition approach or object recognition approach. Most place-recognition-based systems [8,18,21,29,31,39] represent the environment as representative views captured from the environment, oras a set of specific types of visual features, such as local invariant features [24] extracted from the environmental images.Then, the localization problem can be solved with place recognition methods: matching the visual features detected inthe query image to those contained on a pre-built visual feature database, or finding the reference image in a representativeimage database that is the most similar in appearance to the query image. On the other hand, object-recognition-based sys-tems [27,35,38] represent a specific place by means of the object entities that exist in that place. The localization is then per-formed by verifying that the recognized object is contained in the place. The localization method proposed in this paperresembles the object-recognition-based approach. In our localization method, therefore, it is an important prerequisite thateach local environment should provide a unique identification based on the objects found in it.

One method is proposed in this paper for cognitive map representation, which models the local spaces by combining itsspatial layout with the object entities included in the space and connects the local spaces with route-based topologicalgraph. The spatial layout describes an empty geometric shape of the local space and provides metric information that is usedto compute the spatial relationships between neighboring spaces. The objects found in each local space give the space a un-ique identification, and their positions are described with respect to each local space’s reference frame. Thus, the proposedmap has a hybrid structure; that is, it becomes a global topological and local hybrid map. In this paper, we also propose anew method for global localization using the proposed map with object recognition, point cloud fitting, and stochastic scanmatching. The object recognition first allows a robot to infer the candidate local spaces, and then the point cloud fitting en-ables the robot to obtain coarse pose information relative to the reference frame of each local space. Finally, the stochasticscan matching refines the inaccuracies of those coarse poses and simultaneously determines the correct space among thecandidate local spaces.

In this paper, we use only a stereo vision to extract spatial layouts as well as to perform object recognition. Two kinds ofvisual features are adopted for this purpose: a PCA (Principle Component Analysis) feature [19] and stereo depth [37]. ThePCA feature represents a normalized SIFT (Scale-Invariant Feature Transform) descriptor [24] by applying Principle Compo-nents Analysis, which can improves the SIFT’s matching performance for object recognition.

The remainder of this paper is organized as follows. Section 2 gives brief descriptions of related works. Section 3 presentsa method to represent the proposed map. Our global localization process and detailed algorithms will be presented in Section4. Experimental verifications are presented in Section 5 and some concluding remarks are given in Section 6.

2. Related work

2.1. Hybrid map

Recently, there have been several studies about the hybrid mapping of object entities and spatial layouts. Zender and Kru-ijff [44] suggested a multi-layered conceptual spatial map composed of metric, topological, and conceptual layers. The metriclayer establishes an absolute reference frame for each local space, with the spaces connected to each other by means of atopological graph. The explicit shape of a space is represented by line features. The topological layer divides a set of nodesinto areas. The conceptual layer represents information about spatial areas and the objects found within those areas in anontological reasoning module. Ekvall et al. [10] proposed a SLAM system integrated with object recognition. This systemadds the detected objects to a 2D line map. Only the objects detected from more than one position have their exact positionsestimated by triangulation. The two methods described above represent a spatial layout with line features extracted fromlaser range scans. Galindo et al. [14] proposed a multi-hierarchical semantic map consisted of spatial and conceptualhierarchies that are connected through anchoring. Their spatial layout is represented as a grid map made from sonar rangescans, and the topological relationships between local spaces are expressed in a spatial hierarchy. The conceptual hierarchydescribes the relationships between objects and each local space. This method detects the localization error through reason-ing about the expected locations of objects, and a Naı̈ve technique is used for object recognition. A recent work presented by

4176 S.-K. Park et al. / Information Sciences 179 (2009) 4174–4198

Yim et al. [43] used a grid map, in which the positions of objects used as visual landmarks were indicated. The center point ofan object within an image is considered as representative of the object’s position. Its 2D position in Euclidean space is thenmeasured from IR range scans. However, this method makes it difficult to compute the exact 3D position of an object becausea 2D IR range sensor is used.

One similarity among the all previous researches is that they estimate robot pose with range scan data or odometric data.The objects detected during localization are simply used to compensate for localization error or uncertainty. Another sim-ilarity in their methods is the combined use of a vision sensor and a 2D range sensor; that is, the vision sensor is used forobject recognition and the range sensor is applied to extract the spatial layout. Compared with those previous methods,our approach uses only a vision sensor to extract the spatial layout and to recognize objects.

2.2. Global localization

Global localization is the estimation of a robot’s pose (position and orientation) in a previously learned map withoutknowing its initial pose, which gives the robot the ability to handle initialization and the kidnapping problem. The previousapproaches to vision-based localization differ according to how the map is represented and what is recognized to identify thecurrent environment.

Over the past years, several place-recognition-based localization systems have been developed, in which a previouslylearned set of environmental images or visual feature models are used to represent an environment. Se et al. [31] repre-sented a global space as a set of interest local invariant features and the 3D coordinates of these features, and the Kalmanfilter was applied to learn the map. The global localization is performed by matching the detected invariant features in thecurrent image to the pre-built feature map. They use a trinocular stereo camera. Karlsson et al. [18] proposed a similar meth-od, in which a single camera was used to generate the 3D coordinates of the local invariant features. Sim et al. [29] suggesteda vision-based SLAM method using the Rao–Blackwellised particle filter. They use the same type of maps as Se et al. [31].Košecká et al. [21] developed a topological map, where each local space was characterized by a set of representative views.They employ a gradient orientation histogram to capture the essential appearance of the local space. A sparser representa-tion is obtained in a subsequent learning stage by means of Learning Vector Quantization (LVQ). Wang et al. [39] proposed acoarse-to-fine localization method, in which coarse localization was performed by using a Location Vector Space Model(LVSM) and then a voting algorithm was used for fine localization. Cummins and Newman [8] suggested a new algorithmcalled FAB–MAP for fast appearance-based localization and mapping. They extend the bag-of-words image retrieval systemsto capture certain combinations of appearance words that tend to co-occur, because they are generated by common objectsin the environment. As these methods are based on the appearance of the environment, many more training images have tobe stored from the environment. The robot pose is obtained by recovering the relative pose to the most similar location.

Additionally, object-recognition-based approaches have been used to perform localization. Tomoto and Yuta [35] pro-posed an object-based topological map and used 3D line-based polyhedral object for estimating robot pose. Vasudevan etal. [38] developed an object-based hierarchical probabilistic representation of space, in which they represented a globaltopological map with object graphs. The SIFT feature [24] is employed to recognize the textured objects. As those environ-mental maps are represented by means of making the relationships between objects, their methods are more compact thanthe place-recognition-based methods. However, the robot pose can only be obtained by computing the relative pose to therecognized object. Therefore, the robot is required to move around so that it can recognize another object for verifying thecurrent space. Ranganathan and Dellaert [27] suggested a 3D generative model for representing places using objects anddeveloped learning and inference algorithms for making these models. They apply the Swendsen–Wang algorithm, a clusterMarkov Chain Monte Carlo (MCMC) method, and the Markov Random Field (MRF) technique to infer the place model. The 3Dlocation of each object is represented relative to a coordinate frame of the local space. However, it is unclear how to repre-sent and compute the spatial relationships between local spaces.

This paper aims to develop a cognitive map for indoor environment, in which we represent the map by combining thespatial layout of the local space with the spatial objects found in the space. The spatial layout is described as the omni-direc-tional horizontal depth information that represents the shape and extent of the local space. Therefore, there is no need for alarge number of training images for environment modeling. Although this work is based on object recognition, the robot poseis estimated with respect to the space where the recognized object is contained. In other words, the robot pose is not com-puted relative to the recognized object but rather with respect to the reference frame of the space. In the context of com-munication with humans, our method enables a robot to carry out complex navigational tasks. For example, the robotcan perform a task like go to the kitchen and fetch milk from the refrigerator by inferring where the kitchen is and wherethe refrigerator is simultaneously.

3. Map representation

In the present work, we build our map by human augmented method [36] instead of autonomous mapping, which meansthat the map representation is the central issue in this paper. Therefore, a user brings the robot to some memorable place,like a living room, and commands it to build a map around its location. The proposed map is composed of a global topologicalmap and local hybrid maps. The topological map represents the spatial relationships between the local spaces and includes


all the spaces, where the local spaces form nodes of the topological map. A local hybrid map means that a local space is de-scribed in terms of the object entities found in the space and the geometric shape of the space. In this paper we assume thatthe mobile robot navigates only in indoor environment of which floor is a flat and horizontal plane.

3.1. Local hybrid map

The local hybrid map is composed of an omni-directional horizontal depth and an object location data.

3.1.1. Horizontal depth mapThe horizontal depth map represents the spatial layout of each node in the form of a range scan, which is based on depth

information from the stereo camera. We compute the depth information using the Absolute Sum of Differences (ASD) meth-od for the stereo image [37]. As shown in Fig. 1a, during the omni-directional panning motion of the stereo camera, the depthinformation of the horizontal plane with the ground is acquired. In this paper, we store the data for every one-degree incre-ment. The gathered data are defined with respect to the node reference frame, such as shown in Fig. 1b. The horizontal depthmap of node Nk is described by the expression:

Fig. 1.horizon

ZðNkÞ ¼ zðNkÞðhÞjh ¼ 0;1;2; . . . ;359� �

ð1Þ

where ZðNkÞ is the set of depth information for node Nk and zNk ðhÞ is the depth information at h degrees. This is somewhatsimilar to a 2D laser range finder. In the case where the focal plane of the stereo camera is perpendicular to the ground,we only extract the depth information at every horizon pixels in which the optical axis passes through. Unfortunately, wemay not extract the depth information at every centered horizontal image pixel because there is insufficient intensity var-iation in certain environmental areas. Therefore, those pixels where the depth information cannot be estimated are assignedwith null data. As shown in Fig. 1a, only the depth information at the horizontal cross section points, such as ‘‘A” and ‘‘B”, areextracted. That provides an efficient method for establishing the similarity measure between the map data and the sensingdata because the environmental features, placed on the horizontal focal plane of the camera, are only horizontally shifted,even though the depth information is changed with the moving camera. In addition, the horizontal depth map can providemetric information for estimating the spatial relationships between neighboring nodes.

3.1.2. Object location mapThe object location map is composed of the specific objects that characterize each node. It has two kinds of information

about the objects: appearance and location. The appearance information describes the PCA features for object recognition,which are seen at the origin point of a node reference frame. As shown in Fig. 1b, the location information means thelocations of objects with respect to the node reference frame, and it is represented as a point cloud corresponding tothe 3D coordinates of each object’s PCA features.

The object location map is generated from omni-directional environmental images and its corresponding depth informa-tion from a stereo camera. Fig. 2 shows an example of an object location map. The seven images are omni-directional

(a) Gathering horizontal depth information. (b) Example of a hybrid map. It is composed of horizontal depth information and object locations. Thetal depth (z) is extracted in every direction, at one degree increments.

Fig. 2. Example of an object location map.


environmental images taken from node Nk. Some images, in which an object is seen, are used for the object model views.Node Nk includes objects e1,e2, and e3. From the model views of directions 4, 5, and 7, the PCA features of each object areextracted and registered as object feature models. These feature models are used for object recognition in the localizationprocess. In addition, the stereo depth information, which is associated with each object’s PCA features, is extracted, andtransformed into 3D coordinates relative to the reference frame of node Nk. Thus, the 3D coordinates form a point cloudand its mean value is representative of the object’s location in 3D space. The components of the object location map for nodeNk are defined as follows:

OðNkÞ ¼ EðNkÞ; FðNkÞ;V ðNkÞn o

EðNkÞ ¼ e1; e2; . . . ; eif gFðNkÞ ¼ Fe1 ; Fe2 ; . . . ; Fei

� �Fei¼ fjjj ¼ 1;2;3; . . . ; n� �

V ðNkÞ ¼ Ve1 ;Ve2 ; . . . ;Vei

� �Vei¼ Nkvjjj ¼ 1;2;3; . . . ;n� �

ð2Þ

where OðNkÞ is the object location map of node Nk, which consists of a set of object entities EðNkÞ, a set of PCA feature models forthe objects FðNkÞ, and a set of point clouds for the objects V ðNkÞ. The feature model Fei

of object ei is defined by a set of PCAfeature vectors fj. The point cloud Vei

of object ei is defined by a set of the 3D position vectors Nkvj corresponding to thePCA features fj. The position vector is defined with respect to the reference frame of node Nk.

In the context of localization, the object location map can provide an efficient means of avoiding a problem with symme-tries in geometrically non-distinctive environments. Since most local spaces in indoor environments have unique object setsthat characterize the spaces, if these objects can be recognized, the robot is able to determine its current location. Further-more, since the robot exploits omni-directional images, it can localize itself even in a dynamic environment where people aremoving around.


3.2. Global topological map

The global map is represented as a topological scheme, and the relationships between the nodes are organized by a graphstructure [2,30]:

Fi

M ¼ N;Af gN ¼ N1;N2; . . . ;Nkf g

0Nk ¼ ZðNkÞ;OðNkÞn o

A ¼ A1;A2; . . . ;Akf gAk ¼ dkj;Rkj; tkjjk–j;dkj ¼ djk;Rkj–Rqjk; tkj–tjk

� �ð3Þ

M is the global topological map, composed of a set of nodes N and a set of arcs A. The node Nk corresponds to a local hybridmap. The arc Ak connecting adjacent nodes Nk and Nj has the distance value dkj, as well as coordinate transformation param-eters consisting of rotation Rkj and translation tkj. These are measured by using the horizontal depth information. Fig. 3shows an example of a global topological map, and the adjacent matrix [15] in Fig. 4 is used for storing its structure, whichcan also be used for optimal path finding [26]. Therefore, in the sense of how one local space is connected to the next, theadjacent matrix can provide route information in the form of optimal path.

In our proposed method, we do not use a unified global reference frame. This means that, rather than using a single metricreference frame, each local space has its own local reference frame. Therefore, our proposed map requires lower space com-plexity and has higher scalability than other global maps. In addition, this approach can avoid the accumulation of propa-gation errors, which is a crucial problem when using a global reference frame.

Each node has depth information and object location information with respect to its own local reference frame, and themetric relations between neighboring nodes are only locally consistent. Because the coordinate transformation parametersbetween connected nodes are also necessary for path planning and following [26], these are stored in the form of the adja-cent matrix, just as the distance information seen in Fig. 4. In addition, each node can have descriptive information such asroom-1 or living room, which can provide effective information for human–robot interaction using voice commands.

For this suggestion to be suitable, the Euclidean space covered by a specific node has to include the centers of adjacentnodes that are connected to that node. This is a necessary condition for estimating the coordinate transformation matrices(R,t) for adjacent nodes. This constraint is shown conceptually in Fig. 5, where {Nk�1}, {Nk}, and {Nk+1} denote the referenceframe of each node respectively, and dashed lines represent the Euclidean space that each node covers.

To compute the coordinate transformation matrices between adjacent nodes, we employ an algorithm very similar to thefine pose estimation based on the stochastic scan matching method, which will be explained in the next section. As shown inFig. 5, node Nk is connected to node Nk�1. The pose of {Nk} relative to {Nk�1} is then computed as follows. The node pose

g. 3. Example of a global topological map of an indoor environment. Each arc has relative positions and angles between neighboring nodes.

Fig. 4. An adjacent matrix for the distances between neighboring nodes for a global topological map. In this figure, D represents a maximum value fordistance setting.

Fig. 5. Constraint condition between adjacent nodes.


denotes the coordinate transformation parameters (x,y,h) between {Nk�1} and {Nk}. The parameters align the horizontaldepth map of Nk to that of Nk�1. Fig. 8 shows a graphical representation of the pose estimation strategy, in which {N} and{R} correspond to {Nk�1} and {Nk}, respectively. Initially, the samples (i.e., candidates for the node pose) are evenly distrib-uted around the approximate position of the origin of {Nk} with respect to {Nk�1}. The approximate position is measured byodometry in advance. Then the weights of the samples are updated through the summation of the differences between thehorizontal depth information of node Nk�1 and that of the samples. Consequently, the node pose is obtained with a weightedsummation of the samples. The proposed algorithm is basically the same as the fine pose estimation, except that the samplesare initially distributed around the origin of node reference frame.

4. Global localization

Using the hybrid map defined in Section 3, the global localization process is performed in three stages: perception, coarsepose estimation, and fine pose estimation. In the first stage, the robot recognizes objects from acquired images of its envi-ronment and determines the candidate nodes where it is expected to be located. Once the candidate nodes are identified, thecoarse poses with respect to these nodes are computed by using 3D point cloud fitting. In the final stage, the correct node isdetermined and the fine pose relative to this correct node is estimated by means of a Monte Carlo method based on stochas-tic scan matching.

4.1. Perception

The robot gathers omni-directional environment images for localization and the corresponding depth information fromthe stereo camera without any moving behavior. The PCA features are then extracted from the collected images. The 3D coor-dinates of the PCA features are also computed from the stereo depth information, which are defined with respect to the robotreference frame {R}.

Object recognition is performed by matching the PCA features extracted from the omni-directional images for localizationto the object feature models stored for all of the nodes. By doing so, each node can obtain the set of PCA features that arematched with the PCA features of the objects recognized by the robot. These sets of matched PCA features are then usedto compute coarse pose relative to each candidate node. In this paper, we employ the spectral matching method referencedin [23] for the object recognition.

Candidate nodes are determined based on the object recognition results. A candidate node contains one or more objectsidentical to the objects recognized by the robot. In addition, the set of identical objects included in each candidate node

Fig. 6. Distance and angular relationships between adjacent objects. (a) Objects recognized by the robot. (b) Identical objects included in a node.


should have the same configuration as the set of recognized objects. An identical object means that it is contained in a can-didate node and is the same object that the robot recognizes. Suppose that the robot recognized objects e1,e2,e3, such asshown in Fig. 6a, and that these are also included in node N, such as shown in Fig. 6b, then the objects included in the nodeare identical objects.

For each candidate node, we can define a vicinity probability that the robot would be located in that node. This probabilityis applied to remove the useless candidate nodes. If the probability of a candidate node is under a threshold value, the node isdropped out from the candidate nodes. Experimentally, the threshold value was set to 0.5 in this paper. In the following, wewill explain how to compute the vicinity probability with the example shown in Fig. 6.

As shown in Fig. 6a, if the robot recognizes the objects e1,e2, and e3, the position vectors of these objects are computed byaveraging the 3D coordinates of each object’s PCA features. Each object has distance relationships li [mm] and angular rela-tionships hi [�] with only two neighboring objects. We consider that the vicinity probability should be a function of severalfactors. First, it should depend on the number of common objects between the recognized objects and the identical objectsincluded in each candidate node. Second, it should depend on the spatial relationships between the recognized objects andthe identical objects: the distance and angular relationships. Therefore, the vicinity probability should be increased based onan increase in the number of common objects and a decrease in differences of the spatial relationships. The vicinity proba-bility is then computed by

pðzjN;mÞ 11þ k � z ð4Þ

where p(zjN,m) is the vicinity probability that the robot would be located in candidate node N, and m denotes the objectlocation map of node N. k is a weighting factor for the number of common objects, and is defined as k = exp(nc)�0.5. nc denotesthe number of common objects between the recognized objects and identical objects included in node N. The similarity mea-sure z compares the distance and angular relationships between the recognized objects and the identical objects. That is,

z ¼ aXnc

i¼1

li � l0i�� þ b

Xnc

i¼1

hi � h0i�� ð5Þ

where a and b (in this paper a = 0.005 and b = 0.005) are weighting factors due to scale differences in the distance (millime-ters) and angle (degrees). l0i and h0i are the distance and angular relationships of the identical objects in Fig. 6b, respectively.When nc = 2, we ignore the term of angular difference in (5) (in this paper a = 0.01 and b = 0). However, we cannot computethe vicinity probability for a candidate node that includes only one identical object, since it is impossible to define the dis-tance and angular relationships with a single object. Because of this, the node that includes only one identical object is re-tained as a candidate node.

4.2. Coarse pose estimation

A coarse pose is computed by fitting the 3D point cloud of the objects being recognized to the corresponding 3D pointcloud of the identical objects included in a candidate node. We calculate coarse poses with respect to all candidate nodes.The coarse pose is defined in 3D, because 3D point cloud is used to compute it. As a first step, the 3D coarse pose is describedin the form of a homogeneous transformation matrix [7], such as RT

N . This matrix describes the robot reference frame {R} withrespect to the node reference frame {N}, and consists of a rotation matrix and an origin vector. The rotation matrix denotesthe orientation of the moving frame {R} with respect to the fixed frame {N}. The origin vector is a position vector that locatesthe origin of the frame {R}. Fig. 7 shows an example of computing the transformation matrix. In Fig. 7a, Rpi are the 3D posi-tion vectors of the PCA features of the recognized objects, and are defined with respect to the robot reference frame {R}.These vectors form a point cloud. Nvi in Fig. 7b are the 3D position vectors of the PCA features of the identical objects included

Fig. 7. (a) Point cloud for the recognized objects. (b) Point cloud of objects included in a candidate node. (c) Point cloud fitting between the recognizedobjects and identical objects included in the candidate node.


in a candidate node. They correspond to the PCA features of the objects recognized by the robot in Fig. 7a. Nvi are defined withrespect to the node reference frame {N}, where i = 1–n. n is the number of matched PCA features between recognized objectsand identical objects included in the candidate node. Consequently, RT

N fits Rpi into Nvi as shown in Fig. 7c, and this transfor-mation is described by the expression:

Nvi ¼ RTN

Rpi ð6Þ

Since Rpi and Nvi are already known, we can compute RTN in (6) by using a parameter estimation method. For the work asso-

ciated with this paper, the fitting method is employed as referenced in [3]. This method, which requires the coordinates ofthree or more non-collinear points, is based on the Singular Value Decomposition (SVD) and provides least-squares estimatesof the rigid body transformation parameters. To begin with, let Nvi and Rpi in (6) be

Nvi ¼ vxi vyi vzi� �T

; Rpi ¼ pxi pyi pzi

� �T ð7Þ

Expand (7) for all position vectors and take the mean:

N �v ¼ �vx �vy �vz� �T ¼ 1

n

Xn

i¼1

vxi

Xn

i¼1

vyi

Xn

i¼1

vyi

" #T

ð8Þ

R �p ¼ �px �py �pz� �T ¼ 1

n

Xn

i¼1

pxi

Xn

i¼1

pyi

Xn

i¼1

pzi

" #T

ð9Þ

where N �v and R �p are mean position vectors. Then, the relative position vectors of Nvi and Rpi are computed with the meanvectors as follows:

Nv0i ¼ Nvi � N �v; Rp0i ¼ Rpi � R �p ð10Þ

Now, let

C ¼ 1n

Xn

i¼1

Nv0iRp0Ti ð11Þ

where C is the cross-dispersion matrix. The singular value decomposition of matrix C in (11) provides

C ¼ U �W � VT ð12Þ

where U and V are orthogonal matrices, and W is a diagonal matrix that contains the singular values of matrix C. The numberof non-zero singular values indicates the rank of C. From (12), the rotation matrix is computed as follows:

RNR ¼ U � diag 1;1;det U � VT

� �� VT ð13Þ

where RRN is a rotation matrix that describes the unit vectors of {R}’s three principal axes relative to {N}. Namely, the rotation

matrix describes the orientation of the robot. From the mean vectors in (8) and (9) and the rotation matrix in (13), the originvector is given by

NqR ¼ R �p� RNRN �v ð14Þ


where NqR is the origin vector that locates the origin of {R} relative to {N}. Thus, the origin vector describes the position of therobot. Therefore, the homogeneous transformation matrix is given by a pairing of the rotation matrix and origin vector:

Fig. 8.samplinsample

RTN ¼

RRN

NqR

0 0 0 1

" #¼

r11 r12 r13 qx

r21 r22 r23 qy

r31 r32 r33 qz

0 0 0 1

26664

37775 ð15Þ

From the elements of the rotation matrix RRN and origin vector NqR in (15), we can compute the 3D robot pose. The robot pose

is expressed by 6 DOF (x,y,z,hx,hy,hz) relative to {N}, where (x,y,z) are the position coordinates and the same as NqR, and(hx,hy,hz) are rotation angles relative to the x-axis, y-axis, and z-axis, respectively. From (15), we can compute the rotationangles as follows [7]:

hy ¼ A tan 2 �r31;ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffir2

11 þ r221

q �hz ¼ A tan 2 r21= cosðhyÞ; r11= cosðhyÞ

� hx ¼ A tan 2 r32= cosðhyÞ; r33= cosðhyÞ

� ð16Þ

where the Atan2 (y,x) function calculates the arc tangent of the two variables y and x. This is similar to calculating the arctangent of y/x, except that the signs of both arguments are used to determine the quadrant of the result [7]. Here, we denotethe 3D robot pose as a coarse pose not only because it is inaccurate, but because it is not yet possible to establish the correctnode. The inaccuracy of the computed pose (x,y,z,hx,hy,hz) stems from the measurement error of the vision sensor and thematching error between the PCA features of the objects being recognized and those of the identical objects included in eachcandidate node. For instance, as the robot moves only on a 2D horizontal plane, it should satisfy a ground constraint:z = zc,hx = hy = 0, where zc is a constant. In most cases, however, the z coordinate is variable, while the rotation angles hx

and hy are typically not zero. To compensate for the inaccuracies inherent in the coarse pose, it is necessary to estimate afine pose, which is more accurate and also satisfies the ground constraint.

4.3. Fine pose estimation

This section discusses our methodology for determining the correct node and estimating the 2D fine pose. Section 4.1 de-scribed how we determine the candidate nodes where the robot is expected to be located. However, this does not determinethe correct node. If there is no candidate node that gets an outstanding vicinity probability in (4), it is difficult to determinethe correct node from among the candidate nodes. The fine pose estimation therefore involves using a Monte Carlo method[1,16] based on stochastic scan matching in conjunction with the omni-directional horizontal depth and coarse pose infor-mation. The Monte Carlo algorithm is widely accepted as an inference method that can cope well with decision making in thecontext of multimodal uncertainty. In addition, since stereo depth information is robust in a variety of environments andeven in the presence of noise, the omni-directional horizontal depth information based on the stereo depth enables the robotto estimate a more accurate pose.

Fig. 8 shows a graphical representation of the fine pose estimation strategy. The basic approach is to perform iterativegeneration of random samples (i.e. the hypotheses for robot pose) according to their weights until all of the samples

Graphical representation of the fine pose estimation process. x̂0 denotes the approximate pose that is used as the seed position for initial randomg. x̂ represents the estimated pose after sample convergence. (a) Initial boundary of random samples (dotted ellipse). (b) Boundary of random

s after convergence (dotted ellipse).


converge from their initial positions, so that the estimated pose maximizes the overlap between the horizontal depth map ofthe correct node and the horizontal depth information acquired by the robot. In this paper, we employ the sampling impor-tance resampling (SIR) [1,16] approach to construct samples at every iteration. The proposed basic procedure to estimate thefine pose is as follows:

4.3.1. InitializationThe initial random samples are evenly generated around the coarse poses relative to all of the candidate nodes under the

assumption that the true position {R} is included in the boundary of the samples (Fig. 8a and Fig. 16 (first)):

sðnÞ0 ¼ x̂ðNkÞ0 þ q � BwðnÞ0 ð17Þ

where x̂ðNkÞ0 ¼ ðxðNkÞ; yðNkÞ; hðNkÞ

z Þ is the 2D coarse pose relative to the reference frame of the candidate node Nk. To maintain theground constraint, the dimensions of the coarse pose are reduced from 3D to 2D. wðnÞt is a vector of the standard normal ran-dom variables, B is a 3 � 3 diagonal matrix with non-zero elements corresponding to process noises (ex,ey,eh), and q is anoise scale factor for the initial random sampling. n denotes a number from the random samples, n = 1, . . . ,ns. For this paper,we set ex = ey = 100 mm, eh = 5�, and q = 10.

4.3.2. IterationFrom the old sample set fsðnÞt�1;p

ðnÞt�1; c

ðnÞt�1;n ¼ 1; . . . ;nSg at iteration step t-1, a new sample set fsðnÞt ;pðnÞt ; cðnÞt g is constructed

for iteration step t. Here, pðnÞt is the weight of the sample sðnÞt at iteration step t, and cðnÞt is the cumulative weight. The nth of ns

new samples is constructed in following four step approaches:

Step-1: Select a sample s0ðnÞt as follows:(a) Generate a random number r 2 [0,1], uniformly distributed.(b) Find, by binary subdivision, the smallest j for which cðjÞt�1 P r.(c) Set s0ðnÞt ¼ sðjÞt�1.

Step-2: Predict the poses of the random samples as follows:

sðnÞt ¼ s0ðnÞt þ BwðnÞt ð18Þ

Step-3: Compute the weights of the random samples that have the poses predicted in Step 2.

pðnÞt ¼ pðdt jxt ¼ sðnÞt Þ ð19Þ

where pðnÞt represents a probability density function and dt is a similarity measure. To constitute dt, we first define the omni-directional horizontal depth information for localization as follows:

zsðhÞh ¼ 0;1;2; . . . ;359 ð20Þ

where zs(h) denotes the distance with respect to polar coordinates. At this time, assuming that the data is obtained at pose xt

in candidate node Nk, it can be transformed with respect to the node’s reference frame:

zstðhÞ ¼ R �zsðhÞ cos h

zsðhÞ sin h

� �þ t ð21Þ

where R and t represent the coordinate transformation matrix from the reference frame of node Nk to the assumed robotpose xt. Here, we again transform the vector zst(h) into the polar coordinate data zst(h

0). Therefore, the similarity measure

dt is defined as follows:

dtðxtÞ ¼X359�

i¼0�zðNkÞðiÞ � zstðiÞ�� ð22Þ

where zðNkÞ is the horizontal depth information of node Nk. The similarity measure cannot be defined for every angle becauseit is impossible to transform the coordinates at an image pixel where stereo depth information cannot be extracted. Thus, thesummation of the above equation is implemented only in image pixels where depth information can be extracted. Theweight is then computed as follows:

pðnÞt ¼ð1þ �dtðxtÞÞ�1Pnsn¼1ð1þ �dtðxtÞÞ�1 ð23Þ

where �dtðxtÞ is the average of dt(xt). In order to assure the rapid and accurate convergence of the samples, we modify theweight function in (23) as follows:


p0ðnÞt ¼exp �G� pmax � pðnÞt

� �n oPns

n¼1 exp �G� pmax � pðnÞt

� �n o ð24Þ

where pmax ¼ maxn pðnÞt

� �and G = K/pmax. K is a weighting factor for the convergence of the samples. This new weight func-

tion exponentially increases the weight of each sample. For this paper, we set K = 10. Now, we replace the weight in (23) with

the one in (24): pðnÞt p0ðnÞt . For the next iteration, ðsðnÞt ;pðnÞt ; cðnÞt Þ are stored where

cð0Þt ¼ 0; cðnÞt ¼ cðn�1Þt þ pðnÞt ðn ¼ 1; . . . ; nSÞ ð25Þ

Step-4: Repeat Step 1 through Step 3 until all of the samples converge into a candidate node. In addition, the range of sam-ple distribution should satisfy a boundary condition such that the mean distance between each sample position andthe average position of the samples is less than a threshold value (Fig. 8b and Fig. 16 (third)):

Pnsn¼1

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiðxn � �xÞ2 þ ðyn � �yÞ2

qns

< Dthreshold ð26Þ

where (xn,yn) represents the position of the nth sample and ð�x; �yÞ denotes the average position of the samples. Experimen-tally, we set the threshold value as Dthreshold = 150 mm.

4.3.3. EstimationAfter the iteration process, the correct node is established and determined from the convergence of all the samples. The

fine pose is then estimated with the weighted summation of the sample poses as follows:

e½xt � ¼XnS

n¼1

pðnÞt � sðnÞt

� �ð27Þ

Also, a confidence measure c(xt) for the decision maker can be defined as to whether the estimated pose is picked up or not.To do this, the minimum similarity measure dt,min(xt) is initially found among the samples, and then the pixel number (Mt,min)is counted by using the image pixels for which the depth information is defined. Now, the confidence measure can be com-puted as follows:

cðxtÞ ¼0; if Mt;min < 4

11þc�dt;min

; if Mt;min P 4

(ð28Þ

Here, the estimated localization information can be selected when the confidence measure is over a threshold value. For thispaper, we set c = 0.005 and the threshold value to 0.5. However, the confidence measure could be under the threshold valuewhen it is difficult to extract the depth information about all of the horizontal pixels. In this case, one of the coarse posescomputed in the coarse pose estimation stage should be selected to provide the localization information. It is defined relativeto the candidate node which has the largest vicinity probability among the candidate nodes. We believe that selecting thecoarse pose in such a case is a more appropriate strategy in a real environment.

5. Experimental results

The map representation method and global localization strategy described above were implemented in our mobile robotand tested in two different kinds of indoor environments. The robot (Fig. 9) was equipped with a stereo camera and two laserrange finders (SICK LMS-200) [46]. The stereo camera was made by Videre Design (STH-MDI-C) [47]. It was used as the mainsensing modality and was fixed on a pan/tilt mount. For estimating depth information, we adopted the svs s/w library sup-ported by Videre Design. To obtain a wide field of view, we used a lens with 6 mm focal length to obtain more visualinformation.

The goal of the experiments described in the reminder of this section was to demonstrate that this map can effectivelyrepresent indoor environments and enable the robot to reliably estimate its pose.

5.1. Indoor environment-1

In this section we intensively explain and demonstrate the details of the proposed map construction and global localiza-tion methods in a home-like environment. The experimental environment (Fig. 10) was about 10 m � 5 m in size and con-tained various household objects, such as a bed, bookshelf, doll, refrigerator, microwave oven, pictures, flower vase, stereosystem, calendar, sofa, and tables.

Fig. 10. A home-like experimental environment and the object entities included in the environment. The map of the environment is composed of fivenodes.

Fig. 9. Infotainment robot, a mobile robot equipped with a stereo camera fixed on a pan/tilt mount and two laser range finders.


5.1.1. Map constructionA map of the experimental environment was generated by following the proposed map representation method. Fig. 10

shows the actual environment and its layout. The map was composed of five nodes. There were fourteen objects used as vi-sual landmarks, and their positions are shown in Fig. 10. Only objects that have many visual features were chosen for suc-cessful object recognition.

Fig. 11 shows the omni-directional environmental images gathered at the origin point of each node. The PCA feature mod-els for object recognition were learned from all of the images in which any of the objects used as visual landmarks could beseen, but each node only registers the objects present within its own boundary as members of the object location map. Acircle was used for each boundary, with the origin point at the center and a radius of 3500 mm. Each boundary was permit-ted to overlap with those of the neighboring nodes. For example, although object A was seen in the fourth direction of theenvironmental images of node 2, it was not registered as a member of node 2’s object location map since it fell outside theboundary of node 2. Fig. 12 shows the horizontal depth map of each node and the positions of the objects included in eachnode. These positions were obtained by averaging the 3D coordinates of each object’s PCA feature model. The coordinateswere defined relative to the reference frame of each node.

As shown in Figs. 11 and 12, each node was allowed to share objects that appeared in the overlapping area betweenneighboring nodes, whereas the appearance of each common object differed in accordance with the location where eachnode’s omni-directional environmental images were taken. This difference can give the robot useful information for distin-guishing the correct node among the candidate nodes, in which the appearances of the objects recognized by the robot aremost similar to the identical objects included in the correct node. In Fig. 12, it can also be observed that the z coordinates ofcommon objects were slightly different from each other according to the sharing nodes. This occurred because the PCA fea-ture models of common objects, which were extracted from each sharing node’s environmental images, differed from eachother.

Using the alignment algorithm of horizontal depth map presented in Section 3, we computed the alignment parametersbetween adjacent nodes. There was some overlap between the horizontal depth maps of neighboring nodes, and the odom-etry information of the robot was also provided. Table 1 shows the alignment results (x [mm],y [mm],h [�]). The ground truth

Fig. 11. Omni-directional environmental images of each node. Each object’s feature models for object recognition are learned from these images. Theobjects enclosed within the red boxes belong to the node where the images were captured. The horizontal depth map of each node is also built from thedisparity images corresponding to the above images. Direction 4 coincides with the front of the robot, i.e. the direction of the x-axis. (For interpretation ofthe references in color in this figure legend, the reader is referred to the web version of this article.)


poses were measured with a robot pose estimation system that uses the Iterative Closest Point (ICP) algorithm and laserrange data [45]. The ICP algorithm is the most popular scan matching method for estimating the relative pose betweentwo successive locations by minimizing the distance error between the range scans obtained at each location. Moreover,the laser range data is more accurate than visual data and localization systems using laser range finders have shown remark-able results. Therefore, we considered the poses computed with the ICP and laser range data as the ground truth.

When the horizontal depth information that exceeds the boundary of each node is pruned, it is possible that the overlapbetween the horizontal depth maps of neighboring nodes is insufficient. In such case, it is difficult to compute alignmentparameters with the proposed alignment algorithm. Thus, for the algorithm to be practical, when computing the alignmentparameters, all of the horizontal depth information should be accepted and applied retroactively without pruning. The align-ment results are shown visually in Fig. 13. It is clear that the shape of the alignment map is very similar to the layout of theexperimental environment in Fig. 10. It should be noted that we did not use a unified reference coordinate system, whichwould have given us a benefit that we do not need to accurately align all of the nodes’ reference frames. Therefore, for align-ment mapping, each successive pair of horizontal depth maps were aligned and subsequently combined based on transfor-mations from node 1 to node 2, from node 2 to node 3, from node 3 to node 4, and from node 3 to node 5.

5.1.2. Global localizationThis section presents the experimental verification of the proposed global localization method. The localization experi-

ments were carried out with the obtained map. To analyze error distributions, the ground truth poses were also measuredusing the ICP algorithm and laser range data.

Four types of experiments were conducted. The first evaluated the proposed global localization algorithm in detail. Thesecond evaluated the performance of the global localization algorithm in a dynamic situation. The third evaluated the per-formance of the stochastic scan matching algorithm (SSM in short) used in the fine pose estimation. The fourth evaluated theaccuracy of the fine pose estimation.

The first experiment consisted of performing global localization at a random location. For the purpose, we captured theomni-directional environmental images for localization at the random position. These images are shown in Fig. 14. Seven

Fig. 12. Horizontal depth map of each node and the positions of the objects included in each node. (a) Node 1 includes objects A,B, and C. (b) Node 2includes objects D,L,M, and N. (c) Node 3 includes objects D,E,F,K, and L. (d) Node 4 includes objects E,F,G,H, and I. (e) Node 5 includes objects G,H, I, J, and K.


objects were recognized by the object recognition process: D,E,F,H,K,L, and N. Horizontal depth information from the testposition was also extracted from the disparity images corresponding to the images in Fig. 14. Fig. 15 shows the horizontaldepth information and the positions of the recognized objects. From Figs. 14 and 15, the positions of recognized objects can

Fig. 13. Alignment map integrated with the horizontal depth map of each node in Fig. 12. The coordinate systems indicate the metric relationships (rotationand translation) between neighboring nodes.

Fig. 14. Omni-directional environmental images for localization. The images were captured at a random position. Direction 4 coincides with the front of therobot, i.e. the direction of the x-axis.

Table 1Results from computing alignment parameters.

Nodes Ground truth (x (mm), y (mm), h (�) Estimated pose (x (mm), y (mm), h (�) Error

1–2 (�2645.95,�561.28,1.83) (�2613.32,�521.37,0.97) (�32.63,�39.91,0.86)2–3 (�1688.69,407.24,�4.11) (�1734.63,440.88,�3.39) (45.94,�33.64,�0.72)3–4 (�2020.53,864.23,�178.73) (�2028.74,822.09,�180.17) (8.21,42.14,1.44)3–5 (�2244.95,�1063.07,�89.50) (�2257.37,�1046.43,�87.88) (12.42,�16.64,�1.62)


be identified. Object E was located in front of the robot. Objects F and H were located on the left hand side. Objects D,K,L, andN were located on the right hand side.

Nodes 2, 3, 4, and 5 were selected as candidate nodes. Table 2 shows the common objects from those recognized by therobot and the identical ones included in each candidate node. The vicinity probability in (4) and the coarse pose relative toeach candidate node are also reflected in the table. Among the candidate nodes, node 3 obtained the largest vicinity prob-ability, and the number of common objects was the greatest. Since the robot only moves on a 2D horizontal plane, the valuesof z,hx, and hy were expected to be zeros. However, not all of the computed coarse poses satisfied the ground constraint. Ascan be seen in Table 2, from among the candidate nodes, the coarse pose with respect to node 3 satisfied the ground con-straint relatively well, and the distance from the node to the robot was also the shortest. Node 3 was therefore determinedto likely be the correct node.

After the coarse pose estimation, the correct node and fine pose were determined by using the stochastic scan matchingalgorithm. Fig. 16 shows how the samples converged during the fine pose estimation process. Three snapshots were selectedfor explanation and the sample distribution was presented at each step. In the beginning, each of the 300 samples was uni-formly distributed over each candidate node. The 2D coarse poses (x,y,hz) in Table 2 were used as the seed poses for initial

Fig. 15. Horizontal depth information for localization and the positions of the recognized objects.

Table 2Results of coarse pose estimation.

Node Common objects p(zjN,m) Coarse pose

x (mm) y (mm) z (mm) hx (�) hy (�) hz (�)

2 D, L, N 0.66 �2701.68 1008.14 49.32 3.33 �1.89 65.133 D, E, F, K, L 0.82 �897.08 613.88 �12.83 �1.37 �0.33 68.804 E, F, H 0.69 �1178.31 203.15 �40.87 0.18 �1.24 �105.655 H, K 0.61 �1621.92 1283.42 59.72 �4.81 2.52 162.05

Fig. 16. Converging samples into the correct node during fine pose estimation. At the beginning (first), after the 3rd iteration (second), and after the 8thiteration (third).


Fig. 17. Omni-directional environmental images for localization in a dynamic situation. The images were captured at the same location as Fig. 14.

Fig. 18. Horizontal depth information for localization in a dynamic situation and the position of the object being recognized. Dotted circles indicate peoplemoving in the robot’s surroundings.

Table 3Results of pose estimation relative to node 3 in a dynamic situation.

x (mm) y (mm) hz (�) z (mm) hx (�) hy (�)

Coarse pose �861.40 621.48 70.11 15.35 1.17 �2.49Fine pose �860.26 609.81 70.87Ground truth �848.52 596.23 71.03Error 11.74 �13.58 0.16

Fig. 19. (a) The two sets of horizontal depth information were misaligned before the stochastic scan matching. (b) The aligned horizontal depth informationafter testing the stochastic scan matching.



random sampling. After the third iteration, most of the samples were concentrated on node 3 (second snapshot). The finalsnapshot shows all of the samples converged into node 3 when the robot uniquely determined the correct node and finepose. Therefore, node 3 was determined to be the correct node. The fine pose was (x,y,h) = (�869.24 mm, 605.13 mm,70.71�), and the corresponding pose estimated from the laser range data was (�848.52 mm, 596.23 mm, 71.03�). The finepose was described with respect to the reference frame of the correct node.

The second experiment focused on the aspect of global localization performance in a dynamic situation: the robustness ofthe pose estimation in the presence of people and after removing some of the objects included in each node. For this exper-iment, we captured the omni-directional environmental images for localization at the same location as used in the firstexperiment. Fig. 17 shows the omni-directional environmental images, in which three people are shown in directions 3,6, and 7. Furthermore, we can only find object E when comparing with Fig. 14. Some of the objects were occluded by thepeople and the others were removed from their positions.

The object recognition process recognized object E. The horizontal depth information of the test position was also ex-tracted from the disparity images corresponding to the images in Fig. 17. Fig. 18 shows the horizontal depth informationand the position of the recognized object. Since only one object was recognized, we could not define the vicinity probabilityin (4). Nodes 3 and 4 were selected as candidate nodes, and node 3 was finally determined as the correct node. Table 3 showsthe pose estimation results relative to node 3 and the error between the ground truth and fine pose.

The third experiment focused on two aspects of our stochastic scan matching performance: the accuracy of the pose esti-mation and the robustness with respect to errors due to dynamic objects in the scene. We also compared our method withthe ICP algorithm [45]. The experiment consisted of matching two consecutive sets of horizontal depth information acquiredat the same robot location. These two sets of horizontal depth information were different due to sensor noise and the pres-ence of people moving in the robot’s surroundings. The ground truth was also precisely determined (0 mm, 0 mm, 0�). Weused the horizontal depth information shown in Fig. 15 as the first set of horizontal depth information. Figs. 17 and 18,respectively show the omni-directional environmental images in a dynamic situation and the second set of horizontal depthinformation acquired at the same location.

Fig. 19a shows the two sets of horizontal depth information, in which we added a rotation and translation noise(x = y = 400 mm, h=45�) to the initial location of the second horizontal depth information. The initial pose and misaligned

Fig. 20. (a) Localization results for different test positions in the home-like environment. (b) Error distribution for the localization results.

Table 4Means (l) and standard deviations (r) of estimated pose errors.

x Error y Error h Error

l r l r l r

SSM �17.76 26.16 13.51 66.75 �0.16 0.33ICP 11.79 292.83 43.94 355.96 �0.92 23.48


pose are denoted by circles with arrows. Fig. 19b shows the result of correcting the pose error and aligning the two sets ofhorizontal depth information. The aligned pose was (�26.21 mm, 21.84 mm, �0.11�).

We repeated the experiment 500 times by adding random noise to the initial location of the second horizontal depthinformation up to ±400 mm in x and y, and up to ±45� in h. We also implemented the ICP algorithm in the same manner.Table 4 gives the means (l) and standard deviations (r) of the pose errors from the SSM and the ICP method. These statisticalresults indicate that the proposed method is robust and accurate.

Fig. 21. An office-like experimental environment. The map of the environment is composed of 22 nodes.

Fig. 22. Alignment map and positions of the objects used as visual landmarks. The map is integrated with the horizontal depth map of each node. Thecoordinate systems indicate the metric relationships (rotation and translation) between neighboring nodes.


The fourth experiment focused on verifying the accuracy of the fine pose estimation in the home-like environment. Forthis experiment, a total of 150 different test positions were randomly chosen from the whole experiment environment. Un-like conventional Monte Carlo localization [12], the robot did not need additional motion behavior to localize. Fig. 20a showsthe localization results for the test positions, indicating the robot’s estimated position and orientation relative to its environ-ment. Fig. 20b shows error distribution diagrams, in which the mean and median errors were less than (38.61 mm,39.95 mm, 2.91�) and (30.54 mm, 33.47 mm, 2.62�), respectively. As these figures demonstrate, the proposed localizationalgorithm gave a very satisfactory performance in experimental testing.

Fig. 23. (a) Omni-directional environmental images of node 2. The objects enclosed within the red boxes belong to node 2. Direction 4 coincides with thefront of the robot, i.e., the direction of the x-axis. (b) Horizontal depth map of node 2 and the positions of the objects included in node 2. Node 2 containsobjects 3, 5, 6, and 7. (For interpretation of the references in color in this figure legend, the reader is referred to the web version of this article.)

Fig. 24. Omni-directional environmental images for localization. The images were captured at a random position near node 2. Direction 4 coincides withthe front of the robot, i.e. the direction of the x-axis.

Fig. 25. Horizontal depth information and the positions of the recognized objects in static and dynamic situations. Dotted circles indicate people in therobot’s surroundings.

Table 5Results of pose estimation relative to node 2.

x (mm) y (mm) hz (�) z (mm) hx (�) hy (�)

Ground truth 331.41 �62.59 26.19Static situation Coarse pose 365.76 �48.00 25.54 63.72 0.38 2.54

Fine pose 341.88 �57.78 26.79Error �10.47 �4.81 �0.60

Dynamic situation Coarse pose 268.99 �39.09 27.07 66.95 �1.58 2.75Fine pose 289.32 �46.37 27.13Error 42.09 �16.22 �0.94


5.2. Indoor environment-2

In this section we evaluate the performance of our global localization method in a large office-like environment. To ana-lyze the error distributions, ground truth poses were also measured using laser range data. The experimental environment(Fig. 21) was about 17 m � 7 m and was heavily cluttered. This environment was composed of two rooms, which were par-titioned into a number of similar cubicles. Fig. 21 shows the layout of the experimental environment and the positions of theobjects used as visual landmarks. The map consisted of twenty-two nodes and there were forty-five objects. Fig. 22 showsthe alignment map integrated with the horizontal depth map of each node. The positions of the objects are also shown as redcircles.

Two types of experiments were performed. The first evaluated the global localization performance in a dynamic situation.The second proved the accuracy of fine pose estimation in the office-like environment.

The first experiment consisted of performing global localization in the same location with two different sets of sensingdata, acquired in static and dynamic situations. Thus the sensing data were different each other due to sensor noise and dy-namic objects. The robot was placed at a random location nearby node 2. Fig. 23a shows the omni-directional environmentalimages of node 2. The horizontal depth map and the positions of the objects included in node 2 are shown in Fig. 23b. Then,the robot acquired two different sets of omni-directional environmental images for localization in static and dynamic situ-ations (Fig. 24). Fig. 25a shows the horizontal depth information and the positions of the recognized objects (3), (6) and (7) inthe static situation. Fig. 25b shows the horizontal depth information and the positions of the recognized objects (6) and (7) inthe dynamic situation. In Fig. 24a, we can find four people in directions 1, 2, 4, and 5. Object 3 was occluded by one of thesepeople. Table 5 shows the pose estimation results relative to node 2 in static and dynamic situations. As can be seen, therobot correctly estimated its pose in both static and dynamic situations.

The second experiment focused on verifying the accuracy of the fine pose estimation in the office-like environment. Forthis experiment, a total of 220 different test positions were randomly chosen from the whole experiment environment.Fig. 26a shows the localization results for these test positions, indicating the robot’s estimated position and orientation rel-ative to its environment. Fig. 26b shows error distribution diagrams, in which the mean and median errors were less than(38.39 mm, 39.68 mm, 5.26�) and (30.16 mm, 34.98 mm, 4.08�), respectively. As these figures demonstrate, the proposedlocalization algorithm gave a very satisfactory performance in experimental testing.

Fig. 26. (a) Localization results for different test positions in the office-like environment. (b) Error distribution for the localization results.


6. Conclusions and future work

In this paper, a new approach is presented for global localization with hybrid maps of objects and spatial layouts. Withextensive experiments carried out with a real robot in two different indoor environments, we showed that the proposedmethod can be used for an effective and accurate vision-based map representation and global localization process.

The proposed map has a hybrid structure that includes global topological and local hybrid maps. The global topologicalmap is represented through a graph structure. Each node stands for a distinct local space, and these spaces are linked with


arcs which provide route information on how the local spaces are interconnected. The local hybrid map includes detailedinformation about each local space in terms of the objects found in that space and its spatial layout. It is composed of anobject location map and a horizontal depth map. The object location map enables the robot to draw inferences about a localspace in which it is expected to be located. In addition, it can assist the robot in localizing within a geometrically non-dis-tinctive environment. The horizontal depth map represents the spatial layout of each local space, and provides metric infor-mation to compute the spatial relationships between adjacent local spaces.

By using the proposed map, it is possible to reliably estimate the robot pose. Global localization is performed in threesteps: perception, coarse pose estimation, and fine pose estimation. In the perception step, the robot can infer a local spacethrough the objects it recognizes. This is similar to the way humans are believed to perceive spaces. On the basis of recog-nized objects, the coarse pose is computed by means of the point cloud fitting method. The fine pose compensates for anyinaccuracies in the coarse pose and it is estimated by using a Monte Carlo method based on stochastic scan matching. Webelieve in this paper that the coarse-to-fine, two-way pose estimation is necessary because of the following points. First,without fine pose estimation, the robot cannot calculate its pose precisely due to the occlusion problem. Second, withoutcoarse pose estimation, we have to spread a number of samples on every local map, which takes a huge amount of timeto make the samples convergence. Consequently, the fine pose estimation cannot be applied alone, and the coarse pose esti-mation is also not enough on its own; therefore, we decided to carry out two-way estimation to make use of their comple-mentary advantages.

In this paper, the map construction was not carried out fully autonomous; the node positions were selected manually andthe feature model of each object included in each node was trained by supervised method. Future work is therefore plannedinvolving the study on object entities and spatial-layout-based SLAM system; that is, a method whereby a mobile robot byitself can make the type of map described in this paper. The autonomous selection of node position can be realized by usingthe vertices of the Generalized Voronoi Graph (GVG) [5]. Since the omni-directional horizontal depth information is similarto that from a 2D laser range sensor, we can extract the vertices. In addition, the Voronoi edges can be used as paths to movebetween adjacent nodes. Based on the GVG, the robot can explore an unknown environment with a topological explorationstrategy [4,5,9]. Feature learning for the objects included in each node can also be further improved in the context of auton-omy. To accomplish this, we will construct an object feature model database. Then the features of the objects included ineach node can be learned by adopting the features that are extracted from each node’s omni-directional environmentalimages and matched with the object feature model database. The SLAM system will also integrate the localization approachpresented above and work in a large environment.

Acknowledgement

This paper was performed for the Intelligent Robotics Development Program, one of the 21st Century Frontier R&D Pro-grams funded by the Ministry of Knowledge Economy of Korea.

References

[1] C. Andrieu, N. De Freitas, A. Doucet, M.I. Jordan, An introduction to MCMC for machine learning, Machine Learning 50 (2003) 5–43.[2] P. Buschka, A. Saffiotti, Some notes on the use of hybrid maps for mobile robots, in: Proceedings of the 8th International Conference of Intelligent

Autonomous Systems, 2004, pp. 547–556.[3] J.H. Challis, A procedure determining rigid body transformation parameters, Journal of Biomechanics 28 (6) (1995) 733–737.[4] H. Cheong, S. Park, S.-K. Park, Topological map building and exploration based on concave nodes, in: Proceedings of the International Conference on

Control, Automation and Systems, 2008, pp. 1115–1120.[5] H. Choset, K. Nagatani, Topological simultaneous localization and mapping (SLAM): toward exact localization without explicit localization, IEEE

Transactions on Robotics and Automation 17 (2) (2001) 125–137.[6] I.J. Cox, J.J. Leonard, Modeling a dynamic environment using a bayesian multiple hypothesis approach, Artificial Intelligence 66 (1) (1994) 311–344.[7] J.J. Craig, Introduction to Robotics: Mechanics and Control, third ed., Prentice-Hall, 2005.[8] M. Cummins, P. Newman, FAB-MAP: probabilistic localization and mapping in the space of appearance, The International Journal of Robotics Research

27 (6) (2008) 647–665.[9] N.L. Doh, W.K. Chung, S. Lee, S. Oh, B. You, A robust general Voronoi graph based SLAM for a hyper symmetric environment, in: Proceedings of the IEEE/

RSJ International Conference on Intelligence Robots and Systems, 2003, pp. 218–223.[10] S. Ekvall, D. Kragic, P. Jensfelt, Object detection and mapping for service robot tasks, Robotica 25 (2007) 175–187.[11] A. Elfes, Occupancy grids: a probabilistic framework for robot perception and navigation, Ph.D. Dissertation, Department of Electrical and Computer

Engineering, Carnegie Mellon University, 1989.[12] D. Fox, W. Burgard, F. Dellaert, S. Thrun, Monte Carlo localization: efficient position estimation for mobile robots, in: Proceedings of the 16th National

Conference on Artificial Intelligence, 1999, pp. 896–901.[13] H. Frezza-Buet, F. Alexandre, From a biological to a computational model for the autonomous behavior of an animation, Information Sciences 144 (1–4)

(2002) 1–43.[14] C. Galindo, A. Saffiotti, S. Coradeschi, P. Buschka, J.A. Fernández-Madrigal, J. González, Multi-hierarchical semantic maps for mobile, robotics, in:

Proceedings of the IEEE/RSJ International Conference on Intelligence Robots and Systems, 2005, pp. 3492–3497.[15] E. Horowitz, S. Sahni, S. Rajasekaran, Computer Algorithm with C++, W.H. Freeman and Company, 1997.[16] I. Isard, A. Blake, Condensation – conditional density propagation for visual tracking, International Journal of Computer Vision 29 (1) (1998) 5–28.[17] M.E. Jefferies, W.-K. Yeap (Eds.), Robotics and Cognitive Approaches to Spatial Mapping, Springer Tracts in Advanced Robotics, vol. 38, Springer, 2008.[18] N. Karlsson, E.D. Bernardo, J. Ostrowski, L. Goncalves, P. Pirjanian, M.E. Munich, The vSLAM algorithm for robust localization and mapping, in:

Proceedings of the IEEE International Conference on Robotics and Automation, 2005, pp. 24–29.[19] Y. Ke, R. Sukthankar, PCA-SIFT: a more distinctive representation for local image descriptors, in: Proceedings of the IEEE International Conference on

Computer Vision and Pattern Recognition, 2004, pp. 506–513.


[20] D. Kortenkamp, T. Weymonth, Topological mapping for mobile robots using a combination of sonar and vision sensing, in: Proceedings of the 12thNational Conference on Artificial Intelligence, 1994, pp. 979–984.

[21] J. Košecká, L. Zhou, P. Barber, Z. Duric, Qualitative image based localization in indoors environments, in: Proceedings of the IEEE InternationalConference on Computer Vision and Pattern Recognition, 2003, pp. 3–10.

[22] B. Kuipers, The cognitive map: could it have been any other way?, in: H.L. Pick, L.P. Acredolo (Eds.), Spatial Orientation: Theory Research andApplication, Plenium Press, New York, 1983.

[23] M. Leordeanu, M. Herbert, A spectral technique for correspondence problems using pairwise constraints, in: Proceedings of the IEEE InternationalConference on Computer Vision, 2005, pp. 1482–1489.

[24] D.G. Lowe, Distinctive image features from scale-invariant keypoints, International Journal of Computer Vision 60 (2) (2004) 91–110.[25] F. Lu, E. Millos, Robot pose estimation in unknown environment by matching 2D range scans, Journal of Intelligent and Robotics Systems 18 (3) (1997)

249–275.[26] S.-K. Park, M. Kim, C.-W. Lee, Mobile robot navigation based on direct depth and color-based environment modeling, in: Proceedings of the IEEE

International Conference Robotics and Automation, 2004, pp. 4253–4258.[27] A. Ranganathan, F. Dellaert, Semantic modeling of places using objects, in: Proceedings of the Robotics: Science and Systems (RSS), 2007.[28] A.K. Reid, J.E.R. Staddon, A reader for the cognitive map, Information Sciences 100 (1–4) (1997) 217–228.[29] R. Sim, P. Elinas, J.J. Little, A study of the Rao–Blackwellised particle filter for efficient and accurate vision-based SLAM, International Journal of

Computer Vision 74 (3) (2007) 303–318.[30] S. Simhon, G. Dudek, A global topological map formed by local metric map, in: Proceedings of the IEEE/RSJ International Conference on Intelligence

Robots and Systems, 1998, pp. 1708–1714.[31] S. Se, D. Lowe, J. Little, Mobile robot localization and mapping with uncertainty using scale-invariant visual features, The International Journal of

Robotics Research 21 (8) (2002) 735–758.[32] S. Thrun, Robotic mapping: a survey, in: G. Lakemeyer, B. Nebel (Eds.), Exploring Artificial Intelligence in the New Millenium, Morgan Kaufmann, 2002.[33] S. Thrun, S. Gutmann, D. Fox, W. Burgard, B. Kuipers, Integrating topological and metric maps for mobile robot navigation: a statistical approach, in:

Proceedings of the 15th National Conference in Artificial Intelligence, 1998, pp. 989–995.[34] E.C. Tolman, Cognitive maps in rats and men, Psychological Review 55 (1948) 189–208.[35] M. Tomono, S. Yuta, Mobile robot localization based on an inaccurate map, in: Proceedings of the IEEE/RSJ International Conference on Intelligence

Robots and Systems, 2001, pp. 399–404.[36] E.A. Topp, H.I. Christensen, Topological modelling for human augmented mapping, in: Proceedings of the IEEE/RSJ International Conference on

Intelligence Robots and Systems, 2006, pp. 2257–2263.[37] E. Trucco, A. Verri, Introductory Techniques for 3D Computer Vision, Prentice-Hall, 1998.[38] S. Vasudevan, S. Gächter, V. Nguyen, R. Siegwart, Cognitive maps for mobile robots – an object based approach, Robotics and Autonomous Systems 55

(5) (2007) 359–371.[39] J. Wang, H. Zha, R. Cipolla, Coarse-to-fine vision-based localization by indexing scale-invariant features, IEEE Transactions on Systems Man and

Cybernetics – Part B: Cybernetics 36 (2) (2006) 413–422.[40] R.F. Wang, E.S. Spelke, Human spatial representation: insights from animals, Trends in Cognitive Science 6 (9) (2002) 376–382.[41] W.-K. Yeap, Toward a computational theory of cognitive map, Artificial Intelligence 32 (1988) 297–360.[42] W.-K. Yeap, M.E. Jefferies, On early cognitive mapping, Spatial Cognition and Computation 2 (2000) 85–116.[43] B.-D. Yim, Y.-J. Lee, J.-B. Song, W. Chung, Mobile robot localization using fusion of object recognition and range information, in: Proceedings of the IEEE

International Conference Robotics and Automation, 2007, pp. 3533–3538.[44] H. Zender, Geert-Jan M. Kruijff, Multi-layered conceptual spatial mapping for autonomous mobile robots, in: Proceedings of the AAAI Spring

Symposium on Control Mechanisms for Spatial Knowledge Processing in Cognitive/Intelligent Systems, 2007, pp. 62–66.[45] Z. Zhang, Iterative point matching for registration of freeform curves and surfaces, International Journal of Computer Vision 13 (2) (1994) 119–152.[46] <http://www.sick.com>/.[47] <http://www.videredesign.com>/.

http://www.sick.com

http://www.videredesign.com

Documents

Vision-based global localization for mobile robots with hybrid maps of objects and spatial layouts