Semantic Mapping using Virtual Sensors and Fusion of Aerial …135976/FULLTEXT01.pdf · 2009-01-28 · Persson, Martin (2008). Semantic Mapping using Virtual Sensors and Fusion of

Semantic Mapping using Virtual Sensors and Fusion of Aerial Images with Sensor Data

from a Ground Vehicle

Örebro Studies in Technology 30

Martin Persson

Semantic Mapping using Virtual Sensors and Fusion of Aerial Images with Sensor Data

from a Ground Vehicle

© Martin Persson, 2008

Title: Semantic Mapping using Virtual Sensors and Fusion of Aerial Images with Sensor Data from a Ground Vehicle

Publisher: Örebro University 2008

www.publications.oru.se

Editor: Maria Alsbjer [email protected]

Printer: Intellecta DocuSys, V Frölunda 04/2008

issn 1650-8580 isbn 978-91-7668-593-8

Abstract

Persson, Martin (2008). Semantic Mapping using Virtual Sensors and Fusionof Aerial Images with Sensor Data from a Ground Vehicle. Örebro Studies inTechnology 30, 170 pp.

In this thesis, semantic mapping is understood to be the process of putting atag or label on objects or regions in a map. This label should be interpretableby and have a meaning for a human. The use of semantic information has sev-eral application areas in mobile robotics. The largest area is in human-robotinteraction where the semantics is necessary for a common understanding be-tween robot and human of the operational environment. Other areas includelocalization through connection of human spatial concepts to particular loca-tions, improving 3D models of indoor and outdoor environments, and modelvalidation.

This thesis investigates the extraction of semantic information for mobilerobots in outdoor environments and the use of semantic information to linkground-level occupancy maps and aerial images. The thesis concentrates onthree related issues: i) recognition of human spatial concepts in a scene, ii)the ability to incorporate semantic knowledge in a map, and iii) the ability toconnect information collected by a mobile robot with information extractedfrom an aerial image.

The first issue deals with a vision-based virtual sensor for classification ofviews (images). The images are fed into a set of learned virtual sensors, whereeach virtual sensor is trained for classification of a particular type of humanspatial concept. The virtual sensors are evaluated with images from both ordi-nary cameras and an omni-directional camera, showing robust properties thatcan cope with variations such as changing season.

In the second part a probabilistic semantic map is computed based on anoccupancy grid map and the output from a virtual sensor. A local semantic mapis built around the robot for each position where images have been acquired.This map is a grid map augmented with semantic information in the form ofprobabilities that the occupied grid cells belong to a particular class. The local

maps are fused into a global probabilistic semantic map covering the area alongthe trajectory of the mobile robot.

In the third part information extracted from an aerial image is used to im-prove the mapping process. Region and object boundaries taken from the prob-abilistic semantic map are used to initialize segmentation of the aerial image.Algorithms for both local segmentation related to the borders and global seg-mentation of the entire aerial image, exemplified with the two classes groundand buildings, are presented. Ground-level semantic information allows focus-ing of the segmentation of the aerial image to desired classes and generationof a semantic map that covers a larger area than can be built using only theonboard sensors.

Keywords: semantic mapping, aerial image, mobile robot, supervised learning,semi-supervised learning.

Acknowledgements

If it had not been for Stefan Forslund this thesis would never have been written.When Stefan, my former superior at Saab, gets an idea he believes in, he usuallyfinds a way to see it through. He pursued our management to start financingmy Ph.D. studies. I am therefore deeply indebted to you Stefan, for believing inme and making all of this possible.

I would like to thank my two supervisors, Achim Lilienthal and Tom Duck-ett, for their guidance and encouragement throughout this research project. Youboth have the valuable experience needed to pinpoint where performed workcan be improved in order to reach a higher standard.

Most of the data used in this work have been collected using the mobilerobot Tjorven, the Learning Systems Lab’s most valuable partner. Of the mem-bers of the Learning Systems Lab, I would particularly like to thank HenrikAndreasson, Christoffer Valgren, and Martin Magnusson for helping with datacollection and keeping Tjorven up and running. Special thanks to: Henrik, forsupport with Tjorven and Player, and for reading this thesis; Christoffer, forproviding implementations of the flood fill algorithm and the transformation ofomni-images to planar images; and Pär Buschka, who knew everything worthknowing about Rasmus, the outdoor mobile robot I first used.

The stay at AASS, Centre of Applied Autonomous Sensor Systems, has beenboth educational and pleasant. Present and former members of AASS, you’llalways be on my mind.

This work could not have been performed without access to aerial images.My appreciation to Jan Eriksson at Örebro Community Planning Office, andLena Wahlgren at the Karlskoga ditto, for providing the high quality aerialimages used in this research project. And thanks to Håkan Wissman for theimplementation of the coordinate transformations that connected the GPS po-sitions to the aerial images. The financial support from FMV (Swedish DefenceMaterial Administration), Explora Futurum and Graduate School of Modellingand Simulation is gratefully acknowledged. I would also like to express grati-tude to my employer, Saab, for supporting my part-time Ph.D. studies.

Finally, to my beloved family, who has coped with a distracted husband andfather for the last years, thanks for all your love and support.

Contents

I Preliminaries 13

1 Introduction 151.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181.3 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 191.4 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . 201.5 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201.6 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2 Experimental Equipment 232.1 Navigation Sensors for Mobile Robots . . . . . . . . . . . . . . 232.2 Mobile Robot Tjorven . . . . . . . . . . . . . . . . . . . . . . . 252.3 Mobile Robot Rasmus . . . . . . . . . . . . . . . . . . . . . . . 272.4 Handheld Cameras . . . . . . . . . . . . . . . . . . . . . . . . . 282.5 Aerial Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

II Ground-Based Semantic Mapping 31

3 Semantic Mapping 333.1 Mobile Robot Mapping . . . . . . . . . . . . . . . . . . . . . . 33

3.1.1 Metric Maps . . . . . . . . . . . . . . . . . . . . . . . . 343.1.2 Topological Maps . . . . . . . . . . . . . . . . . . . . . 343.1.3 Hybrid Maps . . . . . . . . . . . . . . . . . . . . . . . . 35

3.2 Indoor Semantic Mapping . . . . . . . . . . . . . . . . . . . . . 353.2.1 Object Labelling . . . . . . . . . . . . . . . . . . . . . . 353.2.2 Space Labelling . . . . . . . . . . . . . . . . . . . . . . . 373.2.3 Hierarchies for Semantic Mapping . . . . . . . . . . . . 38

3.3 Outdoor Semantic Mapping . . . . . . . . . . . . . . . . . . . . 403.3.1 3D Modelling of Urban Environments . . . . . . . . . . 41

3.4 Applications Using Semantic Information . . . . . . . . . . . . . 433.5 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . 45

CONTENTS

4 Virtual Sensor for Semantic Labelling 474.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.1.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.2 The Feature Set . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.2.1 Edge Orientation . . . . . . . . . . . . . . . . . . . . . . 504.2.2 Edge Combinations . . . . . . . . . . . . . . . . . . . . . 524.2.3 Gray Levels . . . . . . . . . . . . . . . . . . . . . . . . . 534.2.4 Camera Invariance . . . . . . . . . . . . . . . . . . . . . 544.2.5 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . 57

4.3 AdaBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584.3.1 Weak Classifiers . . . . . . . . . . . . . . . . . . . . . . 59

4.4 Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . 604.5 Evaluation of a Virtual Sensor for Buildings . . . . . . . . . . . 60

4.5.1 Image Sets . . . . . . . . . . . . . . . . . . . . . . . . . . 614.5.2 Test Description . . . . . . . . . . . . . . . . . . . . . . 614.5.3 Analysis of the Training Results . . . . . . . . . . . . . . 624.5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.6 A Building Pointer . . . . . . . . . . . . . . . . . . . . . . . . . 704.7 Evaluation of a Virtual Sensor for Windows . . . . . . . . . . . 73

4.7.1 Image Sets and Training . . . . . . . . . . . . . . . . . . 734.7.2 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.8 Evaluation of a Virtual Sensor for Trucks . . . . . . . . . . . . . 764.8.1 Image Sets and Training . . . . . . . . . . . . . . . . . . 764.8.2 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

4.9 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . 81

5 Probabilistic Semantic Mapping 835.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.1.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . 845.2 Probabilistic Semantic Map . . . . . . . . . . . . . . . . . . . . 84

5.2.1 Local Semantic Map . . . . . . . . . . . . . . . . . . . . 855.2.2 Global Semantic Map . . . . . . . . . . . . . . . . . . . 86

5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 885.3.1 Virtual Planar Cameras . . . . . . . . . . . . . . . . . . 885.3.2 Image Datasets . . . . . . . . . . . . . . . . . . . . . . . 905.3.3 Occupancy Maps . . . . . . . . . . . . . . . . . . . . . . 915.3.4 Used Parameters . . . . . . . . . . . . . . . . . . . . . . 93

5.4 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 955.4.1 Evaluation of the Handmade Map . . . . . . . . . . . . 965.4.2 Evaluation of the Laser-Based Maps . . . . . . . . . . . . 965.4.3 Robustness Test . . . . . . . . . . . . . . . . . . . . . . . 97


CONTENTS

III Overhead-Based Semantic Mapping 101

6 Building Detection in Aerial Imagery 1036.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

6.1.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . 1046.2 Digital Aerial Imagery . . . . . . . . . . . . . . . . . . . . . . . 104

6.2.1 Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . 1046.2.2 Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . 1056.2.3 Manual Feature Extraction . . . . . . . . . . . . . . . . 105

6.3 Automatic Building Detection in Aerial Images . . . . . . . . . . 1066.3.1 Using 2D Information . . . . . . . . . . . . . . . . . . . 1066.3.2 Using 3D Information . . . . . . . . . . . . . . . . . . . 1076.3.3 Using Maps or GIS . . . . . . . . . . . . . . . . . . . . . 108


7 Local Segmentation of Aerial Images 1117.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

7.1.1 Outline and Overview . . . . . . . . . . . . . . . . . . . 1127.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1137.3 Wall Candidates . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

7.3.1 Wall Candidates from Ground Perspective . . . . . . . . 1147.3.2 Wall Candidates in Aerial Images . . . . . . . . . . . . . 114

7.4 Matching Wall Candidates . . . . . . . . . . . . . . . . . . . . . 1177.4.1 Characteristic Points . . . . . . . . . . . . . . . . . . . . 1177.4.2 Distance Measure . . . . . . . . . . . . . . . . . . . . . . 118

7.5 Local Segmentation of Aerial Images . . . . . . . . . . . . . . . 1197.5.1 Edge Controlled Segmentation . . . . . . . . . . . . . . . 1197.5.2 Homogeneity Test . . . . . . . . . . . . . . . . . . . . . 1207.5.3 Alternative Methods . . . . . . . . . . . . . . . . . . . . 121

7.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1227.6.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . 1227.6.2 Tests of Local Segmentation . . . . . . . . . . . . . . . . 1237.6.3 Result of Local Segmentation . . . . . . . . . . . . . . . 124


8 Global Segmentation of Aerial Images 1278.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

8.1.1 Outline and Overview . . . . . . . . . . . . . . . . . . . 1288.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1298.3 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

8.3.1 Training Samples . . . . . . . . . . . . . . . . . . . . . . 1318.3.2 Colour Models and Classification . . . . . . . . . . . . . 132

8.4 The Predictive Map . . . . . . . . . . . . . . . . . . . . . . . . . 1338.4.1 Calculating the PM . . . . . . . . . . . . . . . . . . . . . 133

CONTENTS

8.5 Combination of Local and Global Segmentation . . . . . . . . . 1348.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

8.6.1 Experiment Set-Up . . . . . . . . . . . . . . . . . . . . . 1348.6.2 Result of Global Segmentation . . . . . . . . . . . . . . . 134

8.7 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . 1398.7.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 139

IV Conclusions 141

9 Conclusions 1439.1 What has been achieved? . . . . . . . . . . . . . . . . . . . . . . 1439.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1469.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

V Appendices 149

A Notation and Parameters 151A.1 Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151A.2 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

B Implementation Details 155B.1 Line Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 155B.2 Geodetic Coordinate Transformation . . . . . . . . . . . . . . . 156B.3 Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

Bibliography 159

Part I

Preliminaries

Chapter 1

Introduction

Mobile robots are often unmanned ground vehicles that can be either au-tonomous, semi-autonomous or teleoperated. The most common way to allowautonomous robots to navigate efficiently is to let the robot use a map as theinternal representation of the environment. A lot of research has focused onmap building of unknown environments using the mobile robot’s onboard sen-sors. Most of this research has been devoted to robots that operate in planarindoor environments. Outdoor environments are more challenging for the mapbuilding process. It cannot any longer be assumed that the ground is flat, theenvironment contains larger moving objects such as cars and the operating areahas a larger scale that put higher demands on both mapping and localizationalgorithms.

This thesis presents work on how a mobile robot can increase its awarenessof the surroundings in an outdoor environment. This is done by building se-mantic maps, where connected regions in the map are annotated with namesof the semantic class that they belong to. In this process a vision-based virtualsensor is used for the classification. It is also shown how semantic informationcan be used to extract information from aerial images and use this to extendthe map beyond the range of the onboard sensors.

There are a wide range of application areas making use of semantic infor-mation in mobile robotics. The most obvious area is human robot interactionwhere a semantic understanding is necessary for a common understanding be-tween human and robot of the operational environment. Other areas includethe use of semantics as the link between sensor data collected by a mobile robotand data collected by other means and the use of semantics for execution mon-itoring, used to find problems in the execution of a plan.

1.1 MotivationOccupancy maps can be seen as the standard low level map in mobile robotapplications. These maps often include three types of areas:

15

16 CHAPTER 1. INTRODUCTION

1. Free areas - areas where the robot with a high probability can operate (ifthe area is large enough).

2. Occupied areas - areas where the robot with a high probability cannot belocated. In indoor environments occupied areas typically represent wallsand furniture.

3. Unexplored areas - areas where the status is unknown to the robot.

Occupancy maps are used for planning and navigating in an environment. Themap can be used for localization and path planning, i.e., the mobile robot candetermine how to go from A to B in an optimal way. The robot can also usethe map to decide how the area shall be further explored in order to reduce theextent of unknown areas.

A semantic map brings a new dimension of knowledge into the map. Witha semantic map the robot not only knows that it is close to an object, but alsoknows what type of object it faces. With semantic information in the map, theabstraction level of operating the robot can be changed. Instead of orderingthe robot to go to a coordinate in the map the robot can be ordered to go tothe entrance of a building. To illustrate the benefits of the ability to extractsemantic knowledge and of the use of semantic mapping, a number of differentsituations in outdoor environments are given in the following, where semanticknowledge can support a mobile robot or similar systems:

Follow a route description Humans often use verbal route descriptions whenexplaining the way for someone that will visit a location for the firsttime. If the robot has the possibility to understand its surroundings itcould follow the same type of descriptions. A route description could forinstance be:

1. Follow the road straight ahead.

2. Pass two buildings on the right side.

3. Stop at the road crossing.

Make a route description Conversely to the previous example, a robot thattravels from A to B using absolute navigation could also produce routedescriptions for humans. Stored information can then be used to auto-matically produce descriptions for tourists, business travellers, etc.

Localization using GIS When the robot can build maps that not only outlineobjects, but also labels the object types, navigation using GIS (Geograph-ical Information Systems) such as city maps is facilitated. If the robot candistinguish, for example, buildings from other large objects (trees, lorriesand ground formations) the correlation between the building informa-tion in the robot’s map and in a city map may be established as long as

1.1. MOTIVATION 17

the initial pose estimation is good enough. For the case where only onebuilding has been found this “good enough” is related to the inter-housedistances and for the case where several buildings have been mapped theinitial pose estimation can be even less restricted.

Navigation using GPS and aerial image Consider a mobile robot that shouldgo from position A to position B, where the positions are known in globalcoordinates. If the robot is equipped with a GPS (Global Positioning Sys-tem) it can navigate from A toward B. What it cannot foresee are possibleobstacles in the form of rough terrain, large buildings, etc., and it is there-fore not possible to plan the shortest traversable path to B. Now assumethat the robot has access to an aerial image and that it has the abilityto recognise certain types of objects, such as buildings, trucks and roads,with the onboard sensors. The robot can then build a semantic map ofits vicinity, correlate this with estimated buildings and roads in the aerialimage and start planning the path to take. As more buildings are detected,the segmentation of the aerial image improves and the final path to thegoal can be determined.

Assistance for the visually impaired The technique of a virtual sensor that usesvision to understand objects in the environment could be used in an as-sistance system for blind people. With a small wearable camera and anaudio interface the system can report on objects detected in the environ-ment, e.g.:

1. Bus stop to the left.

2. Approaching a grey building.

3. Entrance straight ahead.

This case clearly indicates the benefit of using high-level (semantic) infor-mation, since the alternative where the environment is only described interms of objects with no labels is less useful.

Search and Surveillance Consider a robot that should be used in an urban areathat is restricted for persons to enter and that the robot has no access toany a priori information. Depending on the task the robot needs to un-derstand the environment and be able to detect human spatial conceptsthat are of interest for an operator. This can, for example, be to search forinjured people or to find signs of intruders like broken windows. Extract-ing information with a vision system the robot can report the locations ofdifferent objects and send photos of them back to the operator. This givesthe operator the possibility to mark interesting locations in the imagesfor further investigations or to give new commands based on the visualinformation.


From the above situations three desired “skills” related to semantic informationcan be noted:

1. The ability to recognise certain types of objects in a scene and by thatrelating these objects to human spatial concepts,

2. the ability to incorporate semantic knowledge in a map, and

3. the ability to connect information collected by a mobile robot with infor-mation extracted in an aerial image.

1.2 ObjectivesThe main objective of the work presented in this thesis is to propose a frame-work for semantic mapping in outdoor environments, a framework that caninterpret information from vision systems, fuse the information with other sen-sor modalities and use this to build semantic maps. The performance of theproposed techniques is demonstrated in experiments with data collected bymobile robots in an outdoor environment. The work is structured accordingto the three “skills” discussed in the previous section. It was decided to usemachine learning for the recognition part in order to have a generic system thatcan adapt to different environments by a training process.

A mobile robot shall by use of onboard sensors and possible additional in-formation include semantic information in a map that is updated by the robot.For the work with this thesis the main information source was selected to bevision sensors. Vision sensors have a number of attractive properties, including:

• They are often available at low cost,

• they are passive, resulting in decreased probability of interference withother sensors,

• they can produce data with rich information (both high resolution andcolour information), and

• they can acquire the data quickly.

There are also some drawbacks, especially in comparison to laser range scan-ners or radar; standard cameras do not allow to measure range directly, indirectrange measurement have low accuracy, and standard cameras are sensitive tobrightness, mix of direct and indirect light, weather conditions, etc.

Another objective of the work presented in this thesis is to develop algo-rithms that allow to automatically include information from aerial images inthe mapping process. With the growing access to high quality aerial images,e.g., from Google Earth and Microsoft’s Virtual Earth, it becomes an attractive

1.3. SYSTEM OVERVIEW 19

opportunity for mobile systems to use such images in planning and navigation.Extracting accurate information from monocular aerial images is not a triv-ial task. Usually digital elevation models are needed in order to separate, e.g.,buildings from driveways. An alternative method that can replace digital ele-vation models by combining the aerial image with data from a mobile robotis suggested and evaluated. The objective is to extract information that can beuseful in tasks such as planning and exploration.

The work presented in this thesis is concentrated on extraction of semanticinformation and on semantic map building. It is assumed that techniques fornavigation, planning, etc., are available. The experiments were performed usingmanually controlled mobile robots and the paths were chosen by a human.The evaluated algorithms are implemented in Matlab [The MathWorks] forevaluation and currently work off-line.

1.3 System OverviewThe system presented in this thesis consists of three modules that were designedto be applied in a sequential order. The modules can be exchanged or extendedseparately if new requirements arise or if information can be gathered in alter-native ways.

The first module is a virtual sensor for classification of views. In our casethe views are images and together with the robot pose and the orientation ofthe sensor this module points out the directions toward selected human spa-tial concepts. Two different vision sensors have been used; an ordinary cameramounted on a pan-and-tilt-head (PT-head) and an omni-directional camera giv-ing a 360◦-view of the surroundings in one single shot. Each omni-image wastransformed to a number of planar images, dividing the 360◦-view into smallerportions. The images are fed into learned virtual sensors, where each virtualsensor is trained for classification of a certain type of human spatial concept.

The second module computes a semantic map based on an occupancy gridmap and the knowledge about the objects in the environment, in our case theoutput from Module 1, the virtual sensor. A local map is built for each robotposition where images have been acquired. The local maps are then fused intoa global probabilistic semantic map. These operations assume that the robot isable to determine its pose (position and orientation) in the map.

The third module uses information extracted from an aerial image in themapping process. Region and object boundaries in the form of line segmentstaken from the probabilistic semantic map (Module 2) are used to initializelocal segmentation of the aerial image. An example is given with the classbuildings, where wall estimates constitute the object boundaries. These wallestimates are matched with edges found in the aerial image. Segmentation ofthe aerial image is based on the matched lines. The results from the local seg-mentation are used to train colour models which are further used for global


segmentation of the aerial image. In this way the robot acquires informationabout the surroundings it has not yet visited. The global segmentation is exem-plified with two classes, buildings and ground.

With these three modules, the three “skills” listed at the end of Section 1.1are addressed.

1.4 Main ContributionsThe main contributions of the work presented in this thesis are:

• Definition and evaluation of a learned virtual sensor based on a genericfeature set. Together with the pose from the mobile robot this can be usedto point out different human spatial concepts.

[Publications 6 and 7]

• A method to build probabilistic semantic maps that handles the uncer-tainty of the classification with the virtual sensor.

[Publication 5]

• Introduction of ground-based semantic information as an alternative tothe use of elevation data in detection of buildings in aerial images.

[Publications 1, 2, 3, and 4]

• The use of aerial images in mobile robot mapping to extend the view ofthe onboard sensors to, e.g., be able to “see” around the corner.

[Publications 1, 2, 3, and 4]

1.5 Thesis OutlineThe presentation of the work is divided into two parts where the first part(Chapters 3 - 5) covers ground-based semantic mapping and extraction of se-mantic information, i.e., Module 1 and 2. The second part (Chapters 6 - 8) isbased on work that includes aerial images. In detail, the thesis is organized asfollows:

Chapter 2 describes the experimental equipment used, consisting of two mobilerobots and two handheld digital cameras.

Chapter 3 gives an overview of works that have been published in the area ofsemantic mapping and works about mobile robot applications in which seman-tic information is utilized in a number of different ways.

1.6. PUBLICATIONS 21

Chapter 4 describes the virtual sensor (Module 1). Two classification meth-ods, AdaBoost and Bayes classifier, are compared for diverse sets of images ofbuildings and nonbuildings. Virtual sensors for windows and trucks are learnedand an example where the output from the virtual sensor is combined with themobile robot pose to point out the direction to buildings is given.

Chapter 5 shows how the information from the virtual sensor can be used tolabel connected regions in an occupancy grid map and in this way create aprobabilistic semantic map (Module 2).

Chapter 6 describes systems for automatic detection of buildings in aerial im-ages and specifically points out problems with monocular images.

Chapter 7 presents a method to overcome problems in detection of buildingsin monocular aerial images and at the same time to improve the limited sensorrange of the mobile robot. It is shown how the probabilistic semantic mapdescribed in Chapter 5 can be used to control the segmentation of the aerialimage in order to detect buildings (first part of Module 3).

Chapter 8 extends the work in Chapter 7 by adding a global segmentation stepof the aerial image in order to obtain estimates of both building outlines anddriveable areas. With this information exploration in unknown areas can bereduced and path planning facilitated (second part of Module 3).

Chapter 9 summarizes the thesis, discusses the limitations of the system andgives proposals for future work.

The appendices contain a list of abbreviations and explanations of the nota-tion used in the thesis (Appendix A), and give details on some of the implemen-tations (Appendix B).

1.6 PublicationsA large extent of the work presented in this thesis has been previously reportedin the following publications:

1. Martin Persson, Tom Duckett and Achim Lilienthal, “Fusion of AerialImages and Sensor Data from a Ground Vehicle for Improved SemanticMapping”, accepted for publication in Robotics and Autonomous Sys-tems, Elsevier, 2008


2. Martin Persson, Tom Duckett and Achim Lilienthal, “ImprovedMappingand Image Segmentation by Using Semantic Information to Link AerialImages and Ground-Level Information”, In Recent Progress in Robotics;Viable Robotic Service to Human, Springer-Verlag, Lecture Notes in Con-trol and Information Sciences, Vol. 370, December 2007, pp. 157–169

3. Martin Persson, Tom Duckett and Achim Lilienthal, “Fusion of AerialImages and Sensor Data from a Ground Vehicle for Improved SemanticMapping”, In IROS 2007 Workshop: From Sensors to Human SpatialConcepts, November 2, 2007, San Diego, USA, pp. 17–24

4. Martin Persson, Tom Duckett and Achim Lilienthal, “ImprovedMappingand Image Segmentation by Using Semantic Information to Link AerialImages and Ground-Level Information”, In Proceedings of the 13th In-ternational Conference on Advanced Robotics (ICAR), August 21–24,2007, Jeju, Korea, pp. 924–929

5. Martin Persson, Tom Duckett, Christoffer Valgren and Achim Lilien-thal, “Probabilistic Semantic Mapping with a Virtual Sensor for Build-ing/Nature Detection”, In Proceedings of the 7th IEEE InternationalSymposium on Computational Intelligence in Robotics and Automation(CIRA), June 21–24, 2007, Jacksonville, FL, USA, pp. 236–242

6. Martin Persson, Tom Duckett and Achim Lilienthal, “Virtual Sensor forHuman Concepts – Building Detection by an Outdoor Mobile Robot”,In Robotics and Autonomous Systems, Elsevier, 55:5, May 31, 2007, pp.383–390

7. Martin Persson, Tom Duckett and Achim Lilienthal, “Virtual Sensor forHuman Concepts – Building Detection by an Outdoor Mobile Robot”,In IROS 2006 Workshop: From Sensors to Human Spatial Concepts -Geometric Approaches and Appearance-Based Approaches, October 10,2006, Beijing, China, pp. 21-26

8. Martin Persson and Tom Duckett, “Automatic Building Detection forMobile Robot Mapping”, In Book of Abstracts of Third Swedish Work-shop on Autonomous Robotics, FOI 2005, Stockholm, September 1–2,2005, pp. 36–37

9. Martin Persson, Mats Sandvall and Tom Duckett, “Automatic BuildingDetection from Aerial Images for Mobile Robot Mapping”, In Proceed-ings of the IEEE International Symposium on Computational Intelligencein Robotics and Automation (CIRA), Espoo, Finland, June 27–30, 2005,pp. 273–278

Chapter 2

Experimental Equipment

This chapter contains descriptions of the equipment used in the experimentspresented in Chapters 4, 5, 7, and 8. First, navigation sensors for mobile robotsare discussed, followed by descriptions of the robots, Tjorven and Rasmus.Then, the two handheld cameras used to take images for training and evalu-ation of the virtual sensor are introduced, and details about the aerial imagesused are presented.

2.1 Navigation Sensors for Mobile RobotsDuring the collection of data with our mobile robots, the robots were manuallycontrolled. Thus, the navigation sensors onboard the robots were not needed inthis phase. However, when the data were processed, localization was important.It was used for building the occupancy grid maps and for registration of theposition and orientation of the robot.

GPS

In the Global Positioning System (GPS) triangulation of signals sent from satel-lites with known positions and at known times is used to calculate positions ina global coordinate system. GPS system errors, such as orbit, timing and atmo-spheric errors, limit the accuracy that can be achieved to approximately 10-15metres for a standard receiver [El-Rabbany, 2002].

GPS receivers placed in the vicinity of each other often show the same er-rors. This fact is exploited in differential GPS (DGPS). A GPS receiver is thenplaced at a known location and the current error can be calculated. This is thentransmitted to mobile GPS receivers via radio, and the error can in this way besignificantly reduced. The method gives an accuracy of 1-5 m at distances of upto a few hundred kilometres.

Other methods for improving navigation accuracy include RTK GPS (real-time kinematics GPS), and differential corrections offered as commercial ser-vices; these are used, e.g., by agriculture vehicles [García-Alegre et al., 2001].

23

24 CHAPTER 2. EXPERIMENTAL EQUIPMENT

A quality measure of the position estimate is indirectly available from theGPS receiver. The number of satellites used in the calculation is one measure.At minimum three are needed for a 2D-position (latitude/longitude), and fourare needed to also calculate an altitude value. Depending on the relative posi-tions of the satellites, the accuracy may vary considerably. This is reported inthree parameters: position, horizontal and vertical dilution of precision (PDOP,HDOP, and VDOP).

Odometry

Odometry consists of proprioceptive (self-measurement) sensors that measurethe movement of robot wheels using wheel encoders. The encoder informationcan be used to compute a position estimate. The error in position estimatesfrom this type of sensor accumulates as the robot moves and estimates areusually useless for long runs. Errors are due to factors such as slippery surfacesand unequal wheel diameters.

IMU, Compass and Inclinometers

Inertial Measurement Unit (IMU) compass, and inclinometers are complemen-tary navigation sensors. An IMU usually consists of three accelerometers, mea-suring the unit’s acceleration in an orthogonal frame, and angular rate gyrosthat measure the rotation rates around the same axes. From these a relative6D pose can be calculated. A compass delivers absolute values of the robotheading, and inclinometers measure pitch and roll angles of the robot.

Integrated Navigation

The sensors described above are often used in integrated navigation systems dueto their complementary strengths and weaknesses. GPS is the only sensor thatdirectly gives a global position. Due to the properties of GPS, it is usually com-bined with sensors that are accurate for short distance motion or do not driftover time. Odometry needs calibration but it can be quite accurate over shortdistances, and it is not affected by time. Inertial measurements on the otherhand suffer from drift directly related to the integration time. Combined, thesesensors can constitute a navigation system, giving geo-referenced positions withhigh accuracy as long as the GPS delivers reliable position estimates.

GPS is best suited for open areas or in the air, where it continuously hasa number of GPS-satellites in view. In urban terrain, where satellites may beshadowed by buildings, problems due to reflections or multi-path signals arise,especially close to large objects. This has been noted by, e.g., Ohno et al. in theirwork that addressed fusion of DGPS and odometry [Ohno et al., 2004]. Theproblem with multi-path signals is impossible to detect using the usual qual-ity measures, such as the number of satellites in sight or the position dilution

2.2. MOBILE ROBOT TJORVEN 25

of precision, and can therefore introduce severe position errors. One way toovercome the problem is to use complementary sensors, e.g., laser range scan-ners [Kim et al., 2007], since these are suitable for navigation in urban regions.Still, the system needs a correct initial position in order to be able to detect thepresence of multi-path signals.

2.2 Mobile Robot TjorvenThis section describes the mobile robot Tjorven, used in most of the experi-ments presented in this thesis. Tjorven is a Pioneer P3-AT from ActivMedia,equipped with differential GPS, a laser range scanner, cameras and odometry.The robot is equipped with two different types of cameras; an ordinary cameramounted on a pan-tilt-head together with the laser, and an omni-directional dig-ital camera. The onboard computer runs Player1, tailored for the used sensorconfiguration, which handles data collection. The robot is depicted in Figure2.1 with markings of the used sensors and equipment.

Figure 2.1: The mobile robot Tjorven.

1Player is a robot server released under the GNU General Public License. Information about theproject can be found at http://playerstage.sourceforge.net/.


Laser range scanner

The laser range scanner is a SICK2 LMS 200 mounted on the pan-tilt-head. Ithas a maximum scanning range of 80 m, a field of view of 180◦, a selectableangular resolution of 0.25◦, 0.5◦, or 1◦ (1◦ was used in the experiments), and arange resolution of 10 mm. A complete scan takes in the order of 10 ms, andscans are usually stored at 20 Hz in our experiments.

GPS

The used differential GPS from NovAtel3, a ProPak-G2Plus, consists of oneGPS receiver, which is called the base station, placed at a fixed position andone portable GPS receiver, called the rover station, which is mounted on therobot. These two GPS receivers are connected via a wireless serial modem.The imprecision of the system is around 0.2 m (standard deviation) in goodconditions. GPS data are stored at 1 Hz.

Odometry

The odometry measures the rotation of one wheel on the left side of the robotand one wheel on the right side of the robot. Using this it captures both trans-lational and rotational motion. Measurements from odometry are stored at 10Hz.

Pan-Tilt Head

The rotary pan-tilt head, a PowerCube 70 from Amtec4, allows for rotationalmotion around two axes. In the horizontal plane it can rotate about three quar-ters of a revolution, limited by the physical configuration of the robot compo-nents, and the head can tilt approximately ±60◦.

Planar Camera

The planar camera is mounted on the laser range scanner giving it the samemovability as the laser. The camera is a DFK 41F02 manufactured by Imag-ingSource5. It is a FireWire camera with a colour CCD sensor with 1280× 960pixel resolution.

Omni-directional Camera

The omni-directional camera gives a 360◦ view of the surroundings in one singleshot. The camera itself is a standard consumer-grade SLR digital camera, 8

2www.sick.com3www.novatel.com4Amtec-robotics is now integrated in Schunk GmbH, www.schunk.com5www.theimagingsource.com

2.3. MOBILE ROBOT RASMUS 27

megapixel Canon EOS350D6. On top of the lens, a curved mirror from 0-360.com7 is mounted.

2.3 Mobile Robot RasmusRasmus is an outdoor mobile robot, an ATRV JR from iRobot. The robot isequipped with a laser scanner, a stereo vision sensor and navigation sensors.

Figure 2.2: The mobile robot Rasmus.

Camera

The cameras on Rasmus are analogue camera modules XC-999, manufacturedby Sony8. They have a 1/2 inch CCD colour sensor with 768 × 494 pixel reso-lution.

6www.canon.com7www.0-360.com8www.sony.com


Laser range scanner

The mobile robot is equipped with a fixed 2D SICK laser range scanner of typeLMS 200; the specifications are the same as the ones for the laser range scanneron Tjorven, see Section 2.2.

GPS-receiver

The GPS-receiver on Rasmus is an Ashtech G12 GPS9. The update rate forposition computation is selectable between 10 Hz and 20 Hz [Ashtech, 2000].At 20 Hz the calculation is limited to eight satellites.

Inertial Measurement Unit

The inertial measurement unit is a Crossbow10 IMU400CA-200. It consists of3 accelerometers and 3 angular rate sensors; see the manual [Crossbow, 2001]for further information.

Compass

The compass unit is a KVH C100. It measures heading with a resolution of0.1◦, and it has an update rate of 10 Hz [KVH, 1998].

Odometry

The robot uses two wheel encoders that give the rotation of one left and oneright wheel. The output is presented as a linear value representing forwarddistance and a rotation value representing the robot rotation around its verticalaxis. The resolution is below 0.1 mm.

2.4 Handheld CamerasTwo handheld digital cameras have been used to collect images for training thevirtual sensor. The first is a 5 megapixel Sony DSC-P92 digital camera withautofocus. It has an optical zoom of 38 to 114 mm (measured as for 35 mmfilm photography).

The second camera is built into a SonyEricsson K750i mobile phone. Thiscamera is also equipped with autofocus, and the image size is 2 megapixels.The fixed focal length is 4.8 mm (equivalent to 40 mm when measured as for35 mm film photography).

9www.magellangps.com10www.xbow.com

2.5. AERIAL IMAGES 29

Both cameras store images in JPEG-format, and the finest settings (highestresolution and quality) have been used for the collection of images. The camerasare depicted in Figure 2.3.

Figure 2.3: The used digital cameras. The one on the right is the Sony DSC-P92, and theone on the left is the SonyEricsson K750i.

2.5 Aerial ImagesThe aerial images used in this project are colour images taken from altitudes of2300 m to 4600 m. The images were taken during summer, in clear weather.The pixel size is 0.5 m or lower (images with higher resolution were convertedto 0.5 m). The images are stored in uncompressed TIFF-format.

Part II

Ground-Based SemanticMapping

Chapter 3

Semantic Mapping

This chapter presents the state-of-the-art in semantic mapping for mobile ro-bots. The focus is on outdoor semantic mapping, even though it is not restrictedto outdoor environments. It was, however, difficult to find relevant literature inthis subject since the number of publications on semantic mapping is still quitelow. Most of the relevant publications relate to mapping of indoor environ-ments and only a few consider the problem that the robot itself extracts thesemantic labels for the map. The content of this chapter is therefore broaderthan the topic of this thesis in order to capture immediate works that can havean influence on research in semantic mapping.

In this thesis, semantic mapping is understood to be the process of putting atag or label on objects or regions in a map. This label should be interpretable byand have a meaning for a human. In mobile robotics this can also be describedas a transformation of sensor readings to a human spatial concept. Alternativeinterpretations of semantics in this area exist. For instance semantics can, whenextracted by a robot, have a meaning for the robot but be hard for humans tointerpret [Téllez and Angulo, 2007].

This chapter is organised as follows. Section 3.1 gives a short overview ondifferent types of maps used in mobile robotics. Extraction of semantic infor-mation in indoor environments is discussed in Section 3.2. This includes objectdetection, space classification and systems where semantic maps form a layerin a hierarchical representation of the environment. Examples from outdoormapping are presented in Section 3.3. As a motivation for using semantic in-formation, Section 3.4 gives examples of how semantic information is used indifferent applications. The chapter concludes with a summary in Section 3.5.

3.1 Mobile Robot MappingA map is a representation of an area, a restricted part of the world. Maps usedin mobile robotics can be divided into three groups; metric maps, topologicalmaps and hybrid maps, where the latter are a combination of the first two

33

34 CHAPTER 3. SEMANTIC MAPPING

types. To give a short overview, this section briefly presents these maps. Fora more comprehensive survey on mapping for mobile robots see e.g. [Thrun,2002], and for hybrid maps see [Buschka, 2006].

3.1.1 Metric Maps

A metric map is a map where distances can be measured, distances that relateto the real world. Metric maps build by a mobile robot can be divided into gridmaps and feature-based maps [Jensfelt, 2001].

Grid Map Grid maps are probably the most common environment represen-tation used for indoor mobile robots. The value of a grid cell in a metric gridmap represents a measure of occupancy of that specific cell and gives informa-tion whether the cell has been explored or not [Moravec and Elfes, 1985]. Agrid map containing metric information is well suited for path planning. Staticobjects that are observed several times are usually given higher values than dy-namic objects that appear at different locations. The main drawbacks of gridmaps are that they are space consuming and that they provide a poor interfaceto most symbolic problem solvers [Thrun, 1998].

Feature-based Map Feature-based maps represent features or landmarks thatcan be distinguished by the mobile robot. Examples of commonly used featuresare edges, planes and corners [Chong and Kleeman, 1997]. Feature-based mapsare not used in the work presented in this thesis, but some of the referred workspresented in, e.g., Section 3.2 use this type of map.

Topographic Map In a topographic map the elevation of the Earth’s surfaceis shown by contour lines. This type of map also often includes symbols thatrepresent features like different types of terrain, cultural landscapes and urbanareas with streets and buildings. Topographic maps are seldom used directly inmobile robotics but they can be used in the creation of schematic maps [Freksaet al., 2000]. The schematic map can be used both for path planning of a mobilerobot and as the reference model of the environment during navigation by themobile robot.

3.1.2 Topological Maps

Topological maps are represented as graphs with nodes and arcs (also callededges) where the nodes represent distinct spatial locations and the arcs describehow the nodes are connected. This allows efficient planning and typically re-sults in lower computational and memory requirements [Thrun, 1998]. Topo-logical maps can be built from metric maps where the nodes can be found byuse of, e.g., Voronoi diagrams [v. Zwynsvoorde et al., 2000].

3.2. INDOOR SEMANTIC MAPPING 35

3.1.3 Hybrid Maps

Hybrid maps are a solution to overcome the shortcomings from using only onespecific type of map by combining different type of maps. Most common is thecombination of metric and topological maps. In the context of this thesis, thecombination of metric maps and semantic information is more relevant.

3.2 Indoor Semantic MappingIn this section works related to indoor semantic mapping are presented. First,works on object labelling are reported. Second, scientific work on how to clas-sify different areas is described, e.g., where space in an indoor environment islabelled as “kitchen” and “office”. Finally, hierarchical map constructions thatinclude semantic maps are presented.

3.2.1 Object Labelling

Finding doors and gateways is essential for mobile robots in order to navigatein indoor environments and consequently the largest group of publications ad-dresses door and gateway recognition. In the following, short descriptions of aselection of works where objects are found and classified are given.

In Anguelov’s work [Anguelov et al., 2002] movable objects, detected frommapping the environment at different times, are learned. Similar objects canthen be detected by the mobile robot in new environments without seeing theobject move. The approach is model-based, where the objects are detected usinga 2D laser range scanner. This gives the contour and size of the object, which inturn is compared with templates. To detect a correct contour the objects need tobe separated from other objects. In the experiments a small number of objects(a sofa, a box and two robots) are used.

Another type of object that often moves is a door. A door can be in a statefrom closed to fully open. This fact has been explored in order to detect doors[Anguelov et al., 2004]. The authors assume rectilinear walls and perform con-secutive mappings of a corridor using a laser range scanner. The map buildingprocess detects when an already mapped door has moved. This door is thenused to train a model of the door colour as seen by an omni-directional cam-era. The vision system uses the colour model to find more doors of the samecolour (only one colour is modelled at a time).

Another step toward semantic representations of environments is taken byLimketkai et al. They present a method to classify 2D laser range finder readingsinto three classes: walls, doors, and others [Limketkai et al., 2005]. A Markovchain Monte Carlo method is used to learn model parameters and the resultsare metric maps with object labels. The objects are aggregated from primitiveline segments. A wall can, for example, be a number of aligned lines. The doorsare assumed to be indented and the size of the indentation is learned.


Indoor environments often contain planar surfaces that are parallel or or-thogonal with respect to each other. Extracting planes from 3D laser range datahas been used to achieve semantic scene interpretations of indoor environmentsas floors, walls, roofs, and open doors [Nüchter et al., 2003, Weingarten andSiegwart, 2006]. Nüchter et al. use a semantic net with relationships such asparallel, orthogonal, under etc. The planes are classified using these relation-ships, for example floor is parallel to roof and floor is under roof.The semantic information is used to merge neighboring planes, which in turnleads to refined 3D models with reduced jitter in floors, walls and ceilings. Ina similar work the semantic information was used to improve scan matchingusing a fast variant of Iterative Closest Point (ICP) [Besl and McKay, 1992]by performing individual matching of the different classes, e.g., points belong-ing to floor in one scan are matched with floor-points in the following scan[Nüchter et al., 2005].

Beeson et al. use extended Voronoi graphs to autonomously detect placesin a global metric map [Beeson et al., 2005]. Their work is based on detectionof gateways and path fragments in the map. These two concepts are used todetect places. According to their definition, a place is found when there are notexactly two gateways and one path. For instance, a dead end is a place since ithas one gateway and one path, and an intersection is also a place since it hasmore than one path. This type of place detection can be used in topologicalmap building.

In vision based SLAM (Simultaneous Localization and Mapping) differentkinds of visual landmarks are used. A semantic approach is to learn and la-bel objects by their appearance using SIFT (Scale-Invariant Feature Transform)features. In order to handle different views of the landmarks, 3D-object modelscan be built based on a number of views of the object [Jeong et al., 2006]. Inte-grating the semantic map in SLAM eliminates the need for a specific anchoringtechnique that connects positions in the map (landmarks) and their associatedsemantics. Instead, the SIFT features directly constitute the link between thelearned objects and objects registered as landmarks in the semantic map. Inthe work presented by Jeong et al. experiments are performed with 5 differentobjects that are manually labelled and pre-stored in an object feature database.

Ekvall et al. also use SIFT features to recognise objects in combination withSLAM [Ekvall et al., 2006]. Training is performed by showing the interestingobjects to a mobile robot equipped with a vision system. An object is extract bybackground subtraction. Semantic information in the form of Receptive FieldCooccurence Histograms (RFCH) and SIFT features of the object are extracted.RFCH is used to locate potential objects and SIFT features are used for veri-fication of the object in zoomed-in images. When the robot performs SLAMand detects an object, a reference to the object is placed in the map and in thisway the robot can return and find a requested object. If the object has moved,a search for the object is performed.


3.2.2 Space Labelling

The previous subsection gave examples of object labelling and locating objectsin a map. In this subsection, the focus is on classification of areas in the map.Several works on classification of indoor space have been reported. They canbe divided into two classes. The first type that is reported here distinguishes be-tween gateways, rooms, and corridors. The second type of methods also classifywhat type of room the robot has entered, e.g., “kitchen” or “living room”.

An example of the first type is the virtual sensor for detection of roomand corridor transitions presented in [Buschka and Saffiotti, 2002]. The virtualsensor makes use of sonar sensors and both indicates the transitions betweendifferent rooms and calculates a set of parameters characterising the individualrooms. Each room can be seen as a node in a topological structure. The set ofparameters includes the width and length of the room and is calculated usingthe central 2nd order moments resulting in a virtual sensor that is relativelystable to changes of the furniture in the room.

When service robots act in a domestic environment it is important that thedefinition of regions follow a human representation. In human augmented map-ping [Topp and Christensen, 2006] a person guides a mobile robot in a domes-tic environment, and gives the robot information about the different locations.During this guided home tour the robot learns about the environment from theuser or from several users. A hierarchical representation of the environment iscreated and segmented using the following concepts:

• Objects – things that can be manipulated.

• Locations – areas from where objects can be observed or manipulated,often smaller than a room.

• Regions – contain one or several locations.

• Floor – connects a number of regions with the same height in order tobe able to distinguish between similar room configurations at differentlevels.

The guidance procedure includes dialog between the robot and the user wherethe robot can ask questions in order to remove ambiguous information.

Another robot system that learns places in a home environment is BIRON[Spexard et al., 2006]. BIRON uses an integration of spoken dialog and visuallocalization to learn different rooms in an apartment.

Mozos et al. semantically label indoor environments as corridors, rooms,doorways, etc. Features are extracted from range data collected with 180-degree laser range scanners [Mozos et al., 2005, Mozos, 2004]. These featuresare the input to a classifier learned using the AdaBoost algorithm. The fea-tures are based on 360-degrees scans and to obtain them two configurations


have been used. The first configuration uses two 180-degree laser range scan-ners and the second configuration uses one 180-degree laser range scanner andthe remaining 180-degrees are simulated from a map of the environment. In[Rottmann et al., 2005] additional features extracted from a vision sensor areused. The use of visual features is limited to recognition of a few objects (e.g.monitor, coffee machine, and faces), due to its complexity. Nevertheless, in theenvironments where the system was evaluated, it was demonstrated that us-ing these visual features made it easier to classify a room as either a seminarroom, an office or a kitchen. The method has been tested in different officeenvironments.

Friedman et al. extract the topological structure of indoor environmentsvia place labels (“room”, “doorway”, “hallway”, and “junction”) [Friedmanet al., 2007]. A map is built using measurements from a laser range scanner andSLAM. Similar features to those defined by Mozos are extracted at the nodes ofa Voronoi graph defined in the map. Voronoi random fields, a technique to ex-tract the topological structure of indoor environments, are introduced. Pointson the Voronoi graph are labelled by converting the graph into a conditionalrandom field [Lafferty et al., 2001] and the feature set is extended with con-nectivity features extracted from the Voronoi graph. With these new featuresit is possible to differentiate between actual doorways and narrow passagescaused by furniture, since it is more likely that short loops are found aroundfurniture than through doorways. Adding these features to a classifier learnedwith AdaBoost improved the resulting topological map. Further improvementwas reported when the Voronoi random fields were used together with the bestweak classifiers found by AdaBoost.

A purely visual approach to the classification of indoor environments is pre-sented by Pirri [Pirri, 2004]. The method makes use of a texture database ob-tained from a large number of images of indoor environments. The textures areprocessed with a wavelet transform to describe their characteristics. Texturesof furniture and wall materials are stored in the database and combined witha statistical memory that includes probability distributions of the likelihood ofrooms with respect to the furniture.

3.2.3 Hierarchies for Semantic Mapping

Approaches that use semantic mapping and are intended to handle navigationat different scales and complexity often present maps in the form of a hierar-chy. Different levels of refinement are used with at least one layer of semanticinformation.

The concept of the Spatial Semantic Hierarchy (SSH) evolved during the1990’s [Kuipers, 2000]. SSH is inspired by properties of cognitive mapping,the principles that humans use to store spatial knowledge of large-scale areas.Spatial knowledge describes environments and is essential for getting from one


place to another. SSH consists of several interacting representations of knowl-edge about large-scale spaces that are divided into five levels:

• The sensory level is the interface to the sensor systems such as vision andlaser with the focus to handle motion and exploration.

• The control level uses continuous control laws as world descriptors. Thelevel can create and make use of local geometric maps.

• The causal level contains information similar to what can be obtainedfrom route directions and is essential in SSH.

• The topological level includes an ontology of places, paths, connectivityetc. intended for planning.

• The metrical level contains a global metric map. This map is not an es-sential part of SSH.

The metrical level can be used in path planning or to distinguish between placesthat appear to be identical in the other levels, but navigation and explorationcan still be possible without this information.

Galindo et al. present a multi-hierarchical semantic map for mobile robots[Galindo et al., 2005]. The map consists of two hierarchies; the spatial hierar-chy and the conceptual hierarchy. The spatial hierarchy contains informationgathered by the robot sensors. The information is stored in three levels; localgrid maps, a topological map and an abstract node that represents the wholespatial environment of the robot. The conceptual hierarchy models the rela-tionship between concepts, where concepts are categories (objects and rooms)and instances (e.g. “room-C” and “sofa-1”). The two hierarchies are integratedto allow the robot to perform tasks like “go to the living room”. This multi-hierarchical semantic map is further developed in [Galindo et al., 2007]. Thetwo hierarchies resemble the previous ones, but are here named spatio-symbolichierarchy and the semantic hierarchy. The semantic map is used to discard el-ements of the domain that should not be considered in the planning phase inorder to speed up the planning process. For instance, if the robot should gofrom the living room to the kitchen and get a fork in a drawer it should notconsider objects in the bath room.

Mozos et al. present a complete system for a service robot using representa-tions of spatial and functional properties in a single hierarchical semantic map[Mozos et al., 2007]. The hierarchy is composed of four layers; the metric map,the navigation map, the topological map and the conceptual map. The naviga-tion map is a graph with nodes that are placed within a maximum distance ofeach other and the map is used for planning and autonomous navigation. It issimilar to the topological map but represents the environment in more detail.The conceptual map contains descriptions of concepts and their relations in theform of “is-a” and “has-a”, e.g., “LivingRoom is-a Room” or “LivingRoom


hasObject LivingRoomObj” and “TVSet is-a LivingRoomObj”. The systemincludes speech synthesis and speech recognition for operation, which is usedin turn by human augmented mapping to learn the places in the topologicalmap. Using semantic information is motivated by the intended use of the robotto interact with people that are not trained robot operators.

NavSpace [Ross et al., 2006] is a stratified spatial representation that in-cludes lower tiers for navigation and localization, and upper tiers for human-robot interaction. It was developed to enable navigation of a wheelchair usingdialogs with the human. These dialogs can handle concepts such as “left of”and “beside”, place labels (e.g., “kitchen”), action descriptions (e.g., “turn”)and quantitative terms (e.g., “10 metres”).

It can be noted that in the works described above, the conceptual relation-ships are hand-coded and not learned by the robot itself.

3.3 Outdoor Semantic MappingThe number of publications related to outdoor semantic mapping is lower thanthe works on indoor semantic mapping that were reported above.

Wolf and Sukhatme [Wolf and Sukhatme, 2007] describe two mapping ap-proaches that create outdoor semantic maps from laser range scanner readings.Two techniques for supervised learning were used: Hidden Markov Models(HMM) and Support Vector Machines (SVM). The first semantic map is basedon the activity caused by passing objects of different sizes. Using this infor-mation, the area is classified as either road or sidewalk. The resulting map isstored in a two-dimensional grid of symmetric cells. Two robots are placedon each side of the road with overlapping fields of view in order to decreasethe influence of occlusion. During data collection the positions of the robotswere fixed and known. Four properties were extracted from the laser data andstored in the map: activity, occupancy, average object size and maximum objectsize. The authors also included one more class, stationary objects, in additionto road and sidewalk and then used a multi-class SVM for the classification.The second type of semantic map classifies ground into two classes; navigableand non-navigable. The classification is based on the roughness of the terrainmeasured by the laser and is intended to be used for path planning.

Triebel et al. have developed a mapping technique for outdoor environ-ments, called multi-level surface maps, that can handle structures like bridges[Triebel et al., 2006]. Multiple surfaces can be stored in each grid cell and byanalysing the neighbouring cells, classification of the terrain into traversable,non-traversable and vertical surfaces is performed.

Closely related work to these terrain mapping approaches concerns detec-tion of drivable areas for mobile robots using vision [Dahlkamp et al., 2006,Guo et al., 2006, Song et al., 2006]. These works do not primarily build seman-tic maps but they use semantic information for road localization in navigation.

3.3. OUTDOOR SEMANTIC MAPPING 41

The work performed by Torralba et al. delivers the most extensive semanticmapping system found in the literature. Place recognition is performed in bothindoor and outdoor environments with the same system [Torralba et al., 2003].The system identifies locations (e.g. office 610), categorises new environments(“office”, “corridor”, “street”, etc) and performs object recognition using theknowledge of the location as extra information. Global image features basedon wavelet image decomposition of monochrome images are used and PrincipalComponents Analysis (PCA) reduces the dimensionality to 80 principal com-ponents. The presented system recognises over 60 locations and 20 differentobjects, which is a high number compared with many other reported systems.For training and evaluation, a mobile system consisting of a helmet mountedweb camera with resolution 120 × 160 pixels is used. The system is claimed tobe robust to a number of difficulties, such as motion blur, saturation and lowcontrast.

3.3.1 3D Modelling of Urban Environments

Several research projects directed toward automatic modelling of outdoor en-vironments, especially urban environments, have been presented in the lastdecade. Even though these projects do not explicitly use semantic informa-tion, they need to be able to classify data in order to remove data belongingto classes that should not be included in the final model. From our perspective,these types of projects are also interesting as references for building semanticgrid maps as described in Chapter 5.

A project for rapid urban modelling, with the aim to automatically con-struct 3D walk-through models of city centres without objects such as pedestri-ans, cars and vegetation, is presented in [Früh and Zakhor, 2003]. The systemuses laser range scanners and cameras both on ground and in air. A DigitalElevation Model (DEM) is constructed from an airborne laser range scannermounted on an airplane and overview images are captured. The experimentalset-up used for the data acquisition on the ground consists of a digital colourcamera and two 2D-laser range scanners mounted so that one measures a hori-zontal plane and the other measures a vertical plane [Früh and Zakhor, 2001].The equipment is mounted at 3.6 meters height on a truck that drives along theroads and the collection of data is performed for one side of the road at a time.The horizontal scanner is used for position estimation and the vertical scannercaptures the shape of the buildings. Images from the digital camera are usedas texture on the 3D-models built from the scanned data. Turns of the truckcause problems and data affected by this are ignored [Früh and Zakhor, 2002].Vegetation and pedestrians occlude the facades and are therefore removed fromthe model using semantic segmentation of the data. The scans are divided intoa background part including buildings and ground, and a foreground part thatshould be removed. Removing the foreground will in turn leave holes in themeasurements that need to be filled. The missing spatial information is recon-


structed and the images from the most direct views for the background are usedto fill in the holes. The application has problems with, e.g., vegetation and istherefore not suitable for residential areas.

Another system for urban modelling is AVENUE [Allen et al., 2001]. Themain goal with AVENUE is to automate site modelling of urban environments.The system consists of three components: a 3D modelling system, a planningsystem for deciding where to take the next view and a mobile robot for ac-quiring data. Range sensing (CYRAX 2400 3D laser range scanner) is usedto provide dense geometric information, which are then registered and fusedwith images to provide photometric information. The planning phase, Next-Best-View, makes sure that the new scanning includes object surfaces not yetmodelled. The navigation system of the mobile robot uses odometry and DGPS[Georgiev and Allen, 2004]. Visual localization is performed when the GPS isshadowed. Using coarse knowledge of the position based on previous sensorreadings, the robot knows in what direction to search for building models andmatches the corresponding model with the current view in order to accuratelydetermine its pose.

An additional project working on 3D-mapping is presented in [Howardet al., 2004]. A large area (one square kilometre) is mapped with a two-wheeledmobile robot (Segway RMP, Robotic Mobility Platform) equipped with both avertical and a horizontal laser range scanner. The vertical scanner is directedupwards giving readings both to the right and to the left of the robot. Assump-tions made are that the altitude is constant and that the environment is partiallystructured. Two levels of localization are used. The first is the fine-scale local-ization that uses the horizontal laser range scanner, roll and pitch data and theodometry. This gives a detailed localization with a drift. The second navigationsystem is the coarse-scale localization. This uses either GPS, good in open areas,or Monte Carlo Localization (MCL) that is good close to buildings. The MCLrequires a prior map that can be extracted from an aerial or satellite image. Tocombine the coarse and fine localization, feature-based fitting of sub-maps isused.

A fourth project with its main focus on the environment close to roads ispresented in [Abuhadrous et al., 2004]. A 2D laser range scanner with 270◦

scanning angle is mounted on the backside of a car. Histograms are used foridentification of objects along a road. The system separates three object types:roads, building facades and trees, and illustrates these using simple 3D models.

These four projects (summarized in Table 3.1) all represent semantic infor-mation even though it is not explicitly mentioned. Früh removes vegetation andpedestrians from the model of ground and buildings. In AVENUE buildings areused for localization. Other examples are the use of building outlines extractedfrom aerial images [Howard et al., 2004] and classification of trees and build-ings in laser range point clouds [Abuhadrous et al., 2004].

3.4. APPLICATIONS USING SEMANTIC INFORMATION 43

References Description Sensors & Navigation[Früh and Zakhor,2001, 2002, 2003]

3D-modelling of build-ings in urban areas. Oc-cluding objects are re-moved.

2 cross-mounted 2D laserrange scanners, vision, andDEM. MCL in aerial photo.

[Allen et al., 2001] 3D-modelling of urbanenvironments.

Vision and laser range scan-ner (3D). Odometry andDGPS.

[Howard et al., 2004] Outdoor 3D-modelling,assumes constantaltitude.

2 cross-mounted 2D laserrange scanners. IMU, odom-etry, laser range scanner forfine localization, GPS orMCL for coarse localization.

[Abuhadrous et al.,2004]

3D-modelling of roads,building facades andtrees.

Vertically mounted 2D laserrange scanner with 270◦

field-of-view. GPS and odom-etry.

Table 3.1: Summary of 3D-modelling projects where MCL is the abbreviation of MonteCarlo localization and DEMmeans Digital Elevation Model. Cross-mounted laser rangescanners means a configuration where the planes spanned by the laser beams are per-pendicular to each other.

3.4 Applications Using Semantic InformationIn this section several examples are given that demonstrate the broad use ofsemantic information for different applications. The section finishes with anexample where the use of semantic information is suggested but not yet wellexplored.

The importance of semantic information in the form of human spatial con-cepts is evident in the communication between robots and humans. A projectthat aims for dialogs with robots is "Talking to Godot" [Theobalt et al., 2002].The representation of the environment uses three layers; grid map, topologicalmap, and a semantic map that connects regions in the topological map withsemantic symbols [Bos et al., 2003]. Other robot systems that use dialog func-tions are the already mentioned [Topp and Christensen, 2006] and [Spexardet al., 2006]. Skubic et al. discussed the benefits of linguistic spatial descrip-tions for different types of robot control, and pointed out that this is especiallyimportant when there are novice robot users [Skubic et al., 2003]. In these sit-uations it is necessary for the robot to be able to relate its sensor readings tohuman spatial concepts.

Semantic information extracted from an aerial image has been used for lo-calization [Oh et al., 2004]. Oh et al. used map data to bias a robot motionmodel in a Bayesian filter to areas with higher probability of robot presence.


They assumed that mobile robot trajectories are more likely to follow pathsin the map and that semantic information in the form of probable paths wasknown. Using these assumptions, GPS position errors due to reflections frombuildings were compensated.

Stachniss et al. make use of the space labelling technique introduced byMozos [Mozos, 2004] for multi-robot exploration [Stachniss, 2006, Stachnisset al., 2006]. The idea is to use the semantic information in the form of spacelabels to direct the robots in an efficient way. A corridor is a natural placeto start mapping an area since it often has connections to different rooms.Stachniss shows by simulation of up to 50 mobile robots that the explorationtime is reduced when this type of strategy is applied.

To be able to benchmark SLAM-algorithms in urban environments, Wulf etal. use semantic information as a link between a 3D city model and drawingsof the buildings [Wulf et al., 2007]. By extracting walls from the 3D data theycould obtain a verification of the model accuracy by comparing the 3D databelonging to walls with the walls in the drawings.

Another area where semantic information has been used is in executionmonitoring [Bouguerra et al., 2007]. In execution monitoring the goal is to findproblems in the execution of a plan. Semantic knowledge was organised in aknowledge base with definitions of concepts and relations between them. Sucha relation can, for example, be that a “bedroom has-at-least one bed”. Thesemantic information was encoded using description logics and a system forknowledge representation and reasoning, LOOM1, was used for managing thesemantic information.

In [Nielsen et al., 2004] semantics is defined to give meaning to something.To give meaning to an environment the authors let a mobile robot take images(snapshots) of the environment and tag these with the present location. A test isperformed where novice robot operators should fulfil a certain task using eithera standard 2D map or a 3D-map augmented with snapshots. The practical ex-periments showed that navigation was facilitated with the latter configurationand the operators experienced that they had better control of the robot.

Semantic information can be used in mobile robotics in, e.g., search and ex-ploration, SLAM, and perception [Calisi et al., 2007]. SLAM, as an example, isa well-studied problem that most often takes into account geometrical forma-tions. The combination of SLAM and semantic information has been proposedby Dellaert and Bruemmer [Dellaert and Bruemmer, 2004]. But, while the useof semantic and contextual information has been shown to be promising inmany situations, it has rarely been incorporated in the mapping process [Calisiet al., 2007].

1http://www.isi.edu/isd/LOOM/

3.5. SUMMARY AND CONCLUSIONS 45

3.5 Summary and ConclusionsThis chapter has presented an overview of literature on semantic mapping. Themajor part of the work that deals with semantic mapping considers an indoorenvironment and only a few works deal with outdoor environments. The scopeof the literature survey was therefore extended to include also projects dealingwith 3D-modelling of urban environments where semantic information implic-itly has an important role in order to reduce effects from false readings andocclusions.

Semantic information has been shown to be useful in a number of situa-tions. The most obvious is in human robot interaction where the semantic in-formation facilitates communication in both directions. Other examples wheresemantic knowledge can be used include localization, exploration, evaluation,execution monitoring, and remote control.

Several promising concepts where topological and metric maps are labelledwith semantic information have been presented in recent years and interest insemantic mapping seems to be increasing. Hybrid maps in the form of hier-archies where lower layers represent metric and topological information, andupper layers contain the semantic information are used by a number of re-searchers. Examples of methods that can automatically classify objects andspaces, mainly indoor, into semantic classes have been given.

Based on the content and references presented in this chapter it can be notedthat there is an increasing interest in semantic mapping. This interest has so farbeen concentrated to indoor environments. There is therefore a gap in currentresearch regarding outdoor semantic mapping, a gap that this thesis intends tofill in.

Chapter 4

Virtual Sensor for Semantic Labelling

4.1 IntroductionTo enable human operators to interact with mobile robots, semantic informa-tion is of high value, as pointed out in the previous chapter. Several other ap-plications for the semantic information were also mentioned. To extract thesemantic information the robot system needs a mechanism that can transformthe sensor readings into human spatial concepts.

In this chapter it is shown that virtual sensors can constitute this mecha-nism, being the component that processes and interprets information about thesurroundings from data collected by mobile robots. A virtual sensor is under-stood as one or several physical sensors with a dedicated signal processing unitfor recognition of real world concepts. An important aspect of the virtual sen-sor proposed in this work is a general method for learning a particular instanceof the virtual sensor from a set of generic features.

As an example, this chapter describes virtual sensors for detection of fourhuman spatial concepts: buildings, nature, windows and trucks. Three of theseconcepts are man-made structures. The main example is a virtual sensor forbuilding detection using methods for classification of views as buildings or non-buildings based on vision. The purpose of this is twofold. First of all, buildingsare one type of very distinct objects that often is used in, e.g., textual descriptionof route directions. Second, the virtual sensor is needed as a building detectorin Chapter 7, where building outlines identified by a ground-level mobile robotare used for segmentation of aerial images.

4.1.1 Outline

The method to obtain a virtual sensor for building detection is based on learn-ing a mapping from a set of image features to a particular concept. It combinesdifferent types of features such as edge orientation, gray level clustering, andcorners. Section 4.2 describes the feature extraction.

47

48 CHAPTER 4. VIRTUAL SENSOR FOR SEMANTIC LABELLING

The AdaBoost algorithm [Freund and Schapire, 1997] is used for learninga classifier for monocular gray scale images based on the features. AdaBoosthas an ability to select the best so-called weak classifiers out of many others.The selected weak classifiers are linearly combined by AdaBoost to produceone strong classifier. Due to this selection process it is possible to start with alarge feature set, calculate classifiers based on each feature and let AdaBoostselect the best ones. This allows the same feature set to be utilized also foridentification of other concepts. AdaBoost is presented in Section 4.3.

Many different learning approaches could possibly be used to solve the clas-sification part of the virtual sensor. To compare the performance of AdaBoostwith an alternative classifier, the Bayes Optimal Classifier (BOC) was selected.BOC uses the variance and covariance of the features in the training data toweight the importance of each feature. Section 4.4 introduces BOC.

Section 4.5 presents an evaluation of a virtual sensor for building detectionbased on building and nonbuilding images. It is shown that the virtual sensorrobustly establishes the link between sensor data and the particular human spa-tial concept. Experiments with an outdoor mobile robot show that the methodis able to separate the two classes with a high classification rate, and that themethod extrapolates well to images collected under different conditions.

In Section 4.6, the virtual sensor for buildings is applied on a mobile robot,combining classifications of sub-images from a panoramic view with spatialinformation (location and orientation of the robot) in order to estimate thedirections of buildings in an outdoor environment. This information could thenbe communicated, for example, to a remote human operator.

The method can be extended to virtual sensors for other human spatialconcepts. This is demonstrated by learning and evaluating two further virtualsensors. Section 4.7 shows that the feature set can be used to distinguish be-tween three human spatial concepts; buildings, nature, and windows, and inSection 4.8 a virtual sensor for trucks complements the first three virtual sen-sors, giving a total of four different human spatial concepts within the sameexperiment.

A summary and conclusions of this chapter are given in Section 4.9.

4.2 The Feature SetA vision sensor onboard the mobile robot acquires the environmental data usedby the virtual sensor to determine how the surroundings should be classified.These data are processed and stored as gray scale images and hence the featureset defined in this section contains features that can be extracted from this typeof images. The objective of the feature set is to capture significant propertiesin the images allowing learned classifiers to distinguish between different con-cepts. To make such a method work for a wide variety of classes requires a

4.2. THE FEATURE SET 49

generic set of features. To define such a set is hard and therefore a heuristicmethod has been used in the design of the feature set.

The philosophy regarding the selection of features that should be includedin the feature set is to include features that are believed to contribute with ahigh probability to an increase of the classification rates of the virtual sensors.Since AdaBoost is used as the learning mechanism only features that in factcontribute will be used in the final classifier. It is therefore not a problem if thefeature set contains features that are of modest use for a particular classificationtask. A benefit of this approach is that whenever a classification problem needssome more features to perform well, these features can be added to the featureset. Adding features can in turn also result in better performance of previouslytrained virtual sensors if they are retrained utilizing the extended feature set.

The first virtual sensor presented here is the virtual sensor for the conceptof buildings. The discussion therefore continues with an analysis of suitablefeatures for building detection. Many systems for building detection, both foraerial and ground-level images, use line and edge related features. Buildingdetection from ground-level images often uses the fact that, in many cases,buildings show mainly horizontal and vertical edges. In nature, on the otherhand, edges tend to have more randomly distributed orientations. Inspectionof histograms based on edge orientation confirms this observation. Histogramsof edge direction can be classified by, e.g., Support Vector Machines (SVM).This has been used for detection of buildings in natural images [Malobabicet al., 2005]. Another method, developed for building detection in content-based image retrieval, uses consistent line clusters with different properties[Li and Shapiro, 2002]. These properties are based on edge orientation, edgecolours, and edge positions.

An example of other features that have been used in outdoor environmentsis the output from Gabor filters. Gabor filters are considered as a robust methodfor texture estimation and have been used for recognising roads in aerial imagesand natural objects such as trees and coral [Ramos et al., 2006]. The imageswere divided into 11 × 11 pixel patches and filtered using four Gabor filtersdefined for two scales and two orientations.

Another example of features extracted from images taken by a mobile robotwas presented in [Morita et al., 2005, 2006]. Features were calculated fromimage patches (16 × 16 pixels) taken from the upper part of the images andconsisted of the average of R, G, and B normalized colours, the edge density, thedistribution of edge orientation, and the number of line segments. The patcheswere classified into three classes; trees, buildings and sky (uniform) with a SVMand the classification result was used for self-localization of the mobile robotalong a previously visited path.

For the virtual sensor presented in this section, a large number of image fea-tures that are extracted from the whole image have been selected. The featurescan be divided into three groups. The obvious indication of man-made struc-tures is that they have a high content of vertical and horizontal edges, which is


also confirmed in the related work presented above. The first type of featuresrepresents this property. The second type of features combines the edges intomore complex structures such as corners and rectangles. The third type of fea-tures is based on the assumption that certain types of concepts contain surfacesin homogeneous colours, e.g., facades.

The features that are calculated for each image are numbered 1 to 24. Allfeatures except 9 and 13 are normalized in order to avoid scaling problems.This feature set was selected with regard to a virtual sensor for building detec-tion. An extended form of the feature set is used in Section 4.8 for a virtualsensor for truck detection.

4.2.1 Edge Orientation

The first group of features is based on edge orientation. The calculation of theorientations includes a few steps. First an edge image is calculated resultingin a binary image. From this binary image straight line segments representingstraight edges are calculated and finally the orientations of these lines are usedto compute the features.

For edge detection Canny’s edge detector [Canny, 1986] is used. It includes aGaussian filter and is less likely than others to be affected by noise. A drawbackis that the Gaussian filter can distort straight lines. For line extraction in theedge image an implementation by Peter Kovesi [Kovesi, 2000] was used, seeAppendix B.1. The absolute values of the line orientations are calculated andsorted into different histograms. The features based on edge orientation are:

1. 3-bin histogram of absolute edge orientation values.

2. 8-bin histogram of absolute edge orientation values.

3. Fraction of vertical lines out of the total number.

4. Fraction of horizontal lines.

5. Fraction of non-horizontal and non-vertical lines.

6. As 1) but based only on edges longer than 20% of the longest edge.

7. As 1) but based only on edges longer than 10% of the shortest image side.

8. As 1) but weighted with the lengths of the edges.

To be able to compare features 1-3 and 6-8 between images with a differentnumber of lines, each bin value is divided by the total number of lines in thehistogram, giving relative values of the orientation distribution.


Feature 1

Feature 1, denoted as �f1, is a 3-bin histogram of the absolute edge orientationvalues, where the edge orientation is defined in the interval [−π/2, π/2]. Thehistogram has class limits [0, 0.2, π/2−0.2, π/2] radians giving that �f1 is a vectorwith 3 elements.

�H3 = Hist(|�θ|, c3), c3 = [0, 0.2, π/2− 0.2, π/2] (4.1)

where �θ is the orientations of the edges. The normalized �f1 is then calculated as

�f1 = �H3/∑

�H3. (4.2)

Feature 2

Feature 2, �f2, is defined in the same way as �f1 with the only difference that an8-bin histogram is used with limits [0, π/16, . . . , 7π/16, π/2]:

�H8 = Hist(|�θ|, c8), c8 = [0, π/16, . . . , 7π/16, π/2]. (4.3)

�f2 = �H8/∑

�H8. (4.4)

Feature 3 to 5

Feature 3 is the fraction of vertical lines, taken from �f1, out of the total numberof lines:

f3 = �f1(3) (4.5)

where �f1(3) denotes the member value of bin 3. Feature 4, the fraction of hori-zontal lines and Feature 5, the fraction of non-horizontal and non-vertical lines,are defined in an analogous way:

f4 = �f1(1) (4.6)

f5 = �f1(2) = 1− �f1(1)− �f1(3). (4.7)

Features 6 to 8

In Feature 1 to 5, all line segments found in the image were used. To captureonly the major lines in the image, features 6, 7, and 8 decrease the influenceof short lines. Feature 6 is calculated like �f1, but only with the edges longerthan 20% of the longest edge. In Feature 7, the calculations are based on theedges that are longer than 10% of the shortest image side and in Feature 8 edge


orientation members are weighted with the edge length in the calculation of thehistogram.

4.2.2 Edge Combinations

The lines found to compute features 1-8 can be combined to form corners andrectangles. The features based on these combinations are:

9. Number of right-angled corners.

10. 9) divided by the number of edges.

11. Fraction of right-angled corners with direction angles close to 45◦ + n ·90◦, n ∈ 0, . . . , 3.

12. 11) divided by the number of edges.

13. The number of rectangles.

14. 13) divided by the number of corners.

Features 9 and 10

Feature 9 is the number of right-angled corners in the image. Here, a right-angled corner is defined as two lines with close end points (maximum 4 pixelsseparation) and 90◦ ± βdev angle in between. During the experiments βdev =20◦ was used. Feature 10 is the number of right-angled corners relative to thenumber of edges:

f10 = f9/∑

�H3. (4.8)

Features 11 and 12

For buildings with vertical and horizontal lines from doors and windows, thecorners most often have a direction of 45◦, 135◦, 225◦ and 315◦, assuming thatthe camera is oriented correctly, see Section 4.2.5. The direction is defined asthe ‘mean’ value of the orientation angle for the two lines defining the corner.This property is captured in feature 11.

f11 = C45/f9 (4.9)

where C45 is the number of right-angled corners with the above defined direc-tions ±15◦. Feature 12 is Feature 11 relative to the number of edges:

f12 = f11/∑

�H3. (4.10)


Image

0 50 100 150 200

0

50

100

150

Line segments

0 50 100 150 200

0

50

100

150

Corners

0 50 100 150 200

0

50

100

150

Rectangles

Figure 4.1: Example of extraction of edge combination features. The upper right partshows line segments extracted from the image to the left. The lower left image showsthe lines that are connected as corners and the lower right part shows rectangles formedfrom the corners.

Features 13 and 14

From the lines and corners defined above rectangles are detected. A rectangleis formed from two corners and the pair of lines constructing these corners.The conditions on the final rectangle are that all four corners shall have anangle of 90◦ ± βdev and that the distance between the line endpoints formingthe rectangle corners are maximum 4 pixels. The number of rectangles is storedin Feature 13 and Feature 14 is the number of rectangles relative to the numberof right-angled corners:

f14 = f13/f9 (4.11)

An example on the extraction of edges, corners and rectangles is given inFigure 4.1.

4.2.3 Gray Levels

Unlike to the above defined features using edges, features 15 to 24 are based ongray levels. Features 15 to 19 use gray level histograms to estimate if dominat-ing gray levels exist. Features 20 to 24 are computed from connected areas to


quantify the existence of homogeneous areas. The features are scaled with theimage size.

Features 15 to 19

Equally binned gray level histogram with 25 bins are used and the result isnormalized with the image size. In order to capture several dominating graylevels the largest bins are accumulated in features 16–19. These features capturethe distribution of gray levels in the image.

15. Largest value in gray level histogram.

16. Sum of the 2 largest values in gray level histogram.




Features 20 to 24

With features 20–24, images showing large variations in intensity can be sep-arated from those that have large homogeneous areas. Typical areas that canbe homogeneous are, e.g., building facades, roads, lawns, water and the sky.To find local areas with homogeneous gray levels a search for the largest con-nected areas with similar gray level is performed. Gray levels are considered tobe similar if they fall into the same bin of the 25-bin histogram. The largestregions of interest that are 4-connected are calculated and up to the 5 largestare accumulated in features 21–24:

20. Largest 4-connected area.

21. Sum of the 2 largest 4-connected areas.




4.2.4 Camera Invariance

In the work presented in this chapter, virtual sensors will use input data thatwere obtained from a number of different cameras. To illustrate that the definedfeature set can be utilized with different cameras, experiments using differentimage sets are performed in Section 4.5. In this paragraph, a direct comparisonof images taken by two of the different cameras is performed to illustrate that


Figure 4.2: The seven image pairs used for the camera invariance test. The left imagein each pair was obtained from the digital camera and the right image in each pair wasobtained from the mobile phone camera.

the defined features are not really affected by the choice of camera. The camerasare a 5 megapixel Sony DSC-P92 digital camera and a Sony Ericsson K750imobile phone 2 megapixel camera, see Section 2.4.

The evaluation of the feature set is performed on seven image pairs. Eachpair was taken at the same place and with the handheld cameras. The imageswere manually cropped to cover essentially the same area. The cropped imagesare shown in Figure 4.2.

Features were calculated for each of the 14 images and each image wascompared with all images, giving a total of 196 comparisons. The comparisonwas performed using a similarity measure M calculated from ri, a measure foreach individual feature that can vary between 0 (identical features) and 1. Thesimilarity measure M was calculated as the mean value of ri

M =1

24

24∑i=1

ri (4.12)

where i denotes the feature number. The measure ri is calculated as a weightedmean value


ri =

N∑j=1

rijwj (4.13)

for the features that are histograms. For each bin j, rijis calculated as

rij=

∣∣∣f cij− fm

ij

∣∣∣max(f c

ij, fm

ij)

(4.14)

with c and m denoting the digital camera and the mobile phone respectively.The above calculation guarantees that rij

∈ [0, 1]. The weights wj are intro-duced to put higher weights on the relative differences originating from binswith a large number of hits. They are calculated as

wj =f c

ij+ fm

ij∑Nj=1(f

cij

+ fmij

)(4.15)

with N being the number of bins. For the single value features, when N = 1,Equations 4.13 - 4.15 can be simplified to

ri =|f c

i − fmi |

max(f ci , fm

i )(4.16)

and no weights need to be used.The results from the comparisons are stored in a symmetric matrix that

is illustrated in Figure 4.3 by squares representing the measure M for eachimage combination. The first seven rows and columns correspond to the digitalcamera and the following seven rows and columns to the mobile phone. LowM -values from the comparison are represented by dark squares and high M -values are represented by lighter gray shades. The figure shows two distinctdiagonals. The diagonal starting in the upper left corner is the self-comparison(features from image 1 are compared with features from image 1 etc.) and allsquares therefore got the M -value 0 (black). The other diagonal represents theimage pairs taken at the same location (image 1 is in pair with image 14, 2 with13 etc.) and got M -values between 0.092 and 0.205. All other squares representpairing of images taken at different locations. The M -values for these squareswere in the interval of 0.28 to 0.54. This result implies that the features fromthe image pairs are more similar than the features from any other combinationof images taken with the same camera.

By this test it has been shown that for the selected seven image pairs, thefeature set clearly distinguishes between different scenes instead of differentcameras. The conclusion is therefore that the extracted features are sufficientlyinvariant to these cameras meaning that alternating between these camerasshould not affect adversely the performance of the system.


1 2 3 4 5 6 7 8 9 10 11 12 13 14

1

2

3

4

5

6

7

8

9

10

11

12

13

14

Figure 4.3: Result of the camera invariance test. Rows/columns 1-7 correspond to thedigital camera and rows/columns 8-14 to the mobile phone. Dark squares show highsimilarity.

4.2.5 Assumptions

The calculation of the features is based on the assumptions that the cameraorientation is close to horizontal and that the perspective distortion is small.Concerning the first assumption, the presented system is not rotational invari-ant, i.e., it cannot handle images with arbitrary rotational angles. A functionthat can estimate the rotational angle can be implemented by identification ofpeaks in the edge histograms assuming that the approximate 90 degree dif-ference between horizontal and vertical edges exists. On the other hand, it isprobably not needed to perform that operation, since an outdoor robot thatis supposed to navigate in non-flat terrain needs to estimate not only positionand orientation, but also pitch and roll angles. It is therefore assumed that theimages can be re-rotated with an accuracy that is sufficient for the system (e.g.the smallest bin used for edge orientation classifies an edge as horizontal if it iswithin ±11◦).

The second assumption concerns the perspective of the images. It is assumedthat the optical axis of the camera has a limited deviation from the normal ofthe object surfaces in the images. If this is not the case the use of histograms foredge directions is not optimal for the performed task. In the system proposedin this thesis, small deviations are compensated for by the quantization intoonly a few histogram bins and the tolerances in the definition of qualitativefeatures such as corners. It is possible to transform an image in order to restore


the perspective of the main surface. One way is to find the vanishing point, butthis is often a costly operation. Another possibility, if 3D-models of the sceneexist, is to use image depth information to obtain the parameters needed for aperspective correction.

4.3 AdaBoostAdaBoost is the abbreviation for adaptive boosting. It was developed by Fre-und and Schapire [Freund and Schapire, 1997] and has been used in diverseapplications, e.g., as classifiers for image retrieval [Tieu and Viola, 2000] andreal-time face detection [Viola and Jones, 2004]. In mobile robotics, AdaBoosthas, e.g., been used in ball tracking for soccer-robots [Treptow et al., 2003]and to classify laser scans for learning of places in indoor environments [Mozoset al., 2005]. Mozos’ work is a nice demonstration of the use of machine learn-ing and a set of generic features to transform sensor readings to human spatialconcepts. A cascaded classifier was developed where the first steps should befast and have a very low false negative rate. The following steps include clas-sifiers with increasing complexity that can resolve the ‘hard cases’ from the setof positive classifications found in the first steps.

The main purpose of AdaBoost is to produce a strong classifier by a linearcombination of weak classifiers, where weak means that the classification ratehas to be only slightly better than 0.5 (better than guessing). Figure 4.4 showspseudo code for the implemented AdaBoost algorithm (see [Schapire, 1999] fora formal algorithm). The principle of AdaBoost is as follows.

The input to the algorithm is a number, N , of positive and negative exam-ples. The training phase is a loop. For each iteration t, the best weak classifierht is calculated and a distribution Dt is recalculated. The boosting process usesDt to increase the weights of the hard training examples in order to focus theweak learners on the hard examples. In our case Dt has been used to bias thehard examples by including it in the calculation of a weighted mean value forthe MDC, see Section 4.3.1.

Implementing and using AdaBoost, one should note a few things. First, ifthe evaluation of training examples results in 100% correct classification by aweak classifier, there will be no more change in Dt (Dt+1 = Dt). If this happensin the first iteration only one feature will be used in the strong classifier. Second,the general AdaBoost algorithm does not include rules on how to choose thenumber of iterations T of the training loop. The training process can be abortedif the distribution Dt does not change, or alternatively a fixed maximum num-ber of iterations T is used. Boosting is known to be not particularly prone tothe problem of overfitting [Schapire, 1999]. T = 30 was used for training in allthe experiments presented and no indications of overfitting were noted whenevaluating the performance of the classifier on an independent test set.

4.3. ADABOOST 59

• Use N training examples (x1, y1) . . . (xN , yN ), where yn = 1 for Np

positive examples and yn = −1 for Nn negative examples (N =Np + Nn)

• Initialize the distribution: D1(np) = 1/Np for the positive examples(with indices np) and D1(nn) = 1/Nn for the negative examples(with indices nn)

• For t = 1, . . . , T :

1. Normalize distribution Dt

2. Train weak learners using distribution Dt

3. Find weak hypothesis hj that minimizes εj and set ht = hj andεt = εj . The goodness of a weak hypothesis hj is measured byits error εj =

Pn:hj(xn)�=yn

Dt(n).

4. Set αt = 12

ln 1−εt

εt

5. Update Dt:Dt+1(n) = Dt(n) exp(−αtynht(xn))

• The final strong classifier is given by:

H(x) = sign“PT

t=1 αtht(x)”

Figure 4.4: General AdaBoost algorithm, where εt is a measure of the performance ofthe best weak classifier ht at t. Index j points out the used feature, and αt is the weightof ht used both in the update of Dt and in the final strong classifier H(x).

4.3.1 Weak Classifiers

In its standard form, the weak classifiers in AdaBoost use single value features.To be also able to handle feature arrays from the histogram data, a minimumdistance classifier (MDC) was used to calculate a scalar weak classifier. Dt wasused to bias the hard training examples by including it in the calculation of aweighted mean value for the MDC prototype vector:

ml,k,t =

∑{n=1...N |yn=k} f(n, l)Dt(n)∑

{n=1...N |yn=k} Dt(n)(4.17)

where ml,k,t is the mean value array for iteration t, class k, and feature l andyn is the class of the n:th image. The features for each image are stored inf(n, l) where n is the image number. For evaluation of the MDC at iteration t,a distance value dk,l(n) for each of the two classes k is calculated as

dk,l(n) = ‖f (n, l)−ml,k,t‖ (4.18)


and the shortest distance for each feature l indicates the winning class for thatfeature.

4.4 Bayes ClassifierIt is instructive to compare the result from AdaBoost with another classifierand for that Bayes Classifier was selected. Bayes Classifier, or Bayes OptimalClassifier (BOC) as it is sometimes called, classifies normally distributed datawith a minimum misclassification rate. The decision function is [Duda et al.,2001]

dk(x) = lnP (wk)−1

2ln |Ck| −

1

2[(x−mk)T C−1

k (x−mk)] (4.19)

where P (wk) is the prior probability (in the experiments set to 0.5), mk is themean vector of class k, and Ck is the covariance matrix of class k calculated onthe training set, and x is the feature value to be classified.

Not all of the defined features can be used in BOC. Linear dependenciesbetween features give numerical problems in the calculation of the decisionfunction. Therefore normalized histograms cannot be used, hence features �f1,�f2, �f6, �f7, and �f8 were not considered. The set of features used in BOC wasrepresented by the following vector:

�f = [f3, f4, f9 . . . f15, f17, f20, f23]T . (4.20)

This set was constructed by starting with the best individual feature as foundby running AdaBoost (see Figure 4.7, Section 4.5.3) and adding the second bestfeature etc., while observing the condition value of the covariance matrices.

4.5 Evaluation of a Virtual Sensor for BuildingsIn this section a virtual sensor for building detection is evaluated using imagesfrom two classes; buildings and nonbuildings. For this type of classification itis generally difficult to define representative negative samples for the trainingset. In order to cope with this, nature images were used as the representationof nonbuildings in the evaluation presented here and, hence, nature is used asthe name of the class nonbuildings. This shall not be seen as a limitation sincethe virtual sensor is learned and the training set can be adapted to an intendedoperational environment.

Results for AdaBoost are compared with results using BOC and two differ-ent image resolutions are tested and compared.

4.5. EVALUATION OF A VIRTUAL SENSOR FOR BUILDINGS 61

4.5.1 Image Sets

Three different sources were used for the collection of nature and building im-ages utilized in the experiments. For Set 1, images were taken by an ordinaryconsumer digital camera. These images were taken over a period of severalmonths in outdoor environments that are potential operational areas for a mo-bile robot. The virtual sensor was designed to be used to classify images takenby a mobile robot. Therefore, Set 2 consists of images taken on manually con-trolled runs with a mobile robot, performed on two different occasions. Set1 and 2 are disjunctive in the sense that the images do not depict the samebuildings or the same nature views.

In order to verify the system performance with an independent set of im-ages, Set 3 contains images downloaded1 from the Internet using Google’s Im-age Search. For buildings the search term building was used. The first 50 imageswith a minimum resolution of 240 × 180 pixels containing a dominant build-ing were downloaded. For nature images, the search terms nature (15 images),vegetation (20 images), and tree (15 images), were used. Only images that di-rectly applied to the search term and were photos of real environments (no artsor computer graphics) were used. Borders and text around some images wereremoved manually. Table 4.1 summarizes the different sets of images and thenumber of images in each set.

All images were converted to gray scale and stored in two different resolu-tions (maximum side length 120 pixels and 240 pixels, referred to as size 120and 240 respectively). In this way the performance for different resolutionscan be compared. In the case of comparable performance, using low resolu-tion images has the advantage of faster computations and decreased demandson the used equipment or alternatively, sub-images depicting objects at longerdistances will be possible to use.

Examples of images from Set 1 and 2 are shown in Figure 4.5. Note thedifference between the building images in Set 1 and 2. The images taken withthe handheld cameras are most often centred on the view while the optical axisof the camera on the robot was mostly horizontal. This resulted in images inSet 2 that only contain building objects in the upper half and where the lowerpart shows the ground surface.

4.5.2 Test Description

Five tests have been defined for evaluation of the virtual sensor. Test 1 showswhether it is possible to collect training data with a consumer camera and usethis for training of a classifier that is evaluated with a different camera on theintended platform, the mobile robot. Test 2 trains and evaluates on a mixtureof Set 1 and Set 2. Test 3 shows how well the learned model, trained withimages taken in our neighbourhood with known equipment, extrapolates to

1The images were downloaded 7th Sept 2005.


Set Origin Area Buildings Nature

1 Handheld digital camera Urban and nature 40 40

2 Camera on mobile robot Örebro Campus 66 24

3 Internet search Worldwide 50 50

Total number 156 114

Table 4.1: Summary of the image sets used for the evaluation of the virtual sensor forbuildings. The digital camera is a 5 megapixel Sony (DSC-P92) and the mobile camerais an analogue camera mounted on the mobile robot Rasmus, see Section 2.3.

images taken around the world by different photographers. Test 4 evaluates theperformance on the complete collection of images. Table 4.2 summarizes thetest cases. These tests have been performed with AdaBoost and BOC separatelyfor each of the two image sizes. For Test 2 and 4, holdout validation [Kohavi,1995] is performed. A random function is used to select the training partition(90% of the images) and the images not selected are used for the evaluation ofthe classifiers. This was repeated Nrun times.

No. Nrun Train Set Test Set

1 1 1 2

2 100 90% of {1,2} 10% of {1,2}

3 1 {1,2} 3

4 100 90% of {1,2,3} 10% of {1,2,3}

5 100 90% of {1,2} 10% of {1,2}

Table 4.2: Description of defined tests (Nrun is the number of runs).

A fifth test, Test 5, is designed to test scale invariant properties of the system.Test 2 is repeated, but now the training is performed with one image size andthe evaluation is performed with the other image size.

4.5.3 Analysis of the Training Results

AdaBoost can compute multiple weak classifiers from the same features bymeans of a different threshold, for example. Figure 4.6 presents statistics on theusage of different features in Test 2. The feature most often used for image size240 is the orientation histogram (�f2). For image size 120, features �f2, f8 (cor-ners), f13 (rectangles) and f14 (rectangles related to corners) dominate. Figure4.7 shows how well each individual feature manages to classify images in Test2. Several of the histograms based on edge orientation are in themselves close to


Figure 4.5: Examples of images used for training. The uppermost row shows buildingsin Set 1. The following rows show buildings in Set 2, nature in Set 1, and nature in Set2.

the result achieved for the classifiers presented in the next section. ComparingFigure 4.6 and Figure 4.7 one can note that several features with high classifica-tion rates are not used by AdaBoost to the expected extent, e.g., features �f1, f3,f4, and f5. This is most likely caused by the way in which the distribution Dt

is updated. Because the importance of correctly classified examples is decreasedafter a particular weak classifier is added to the strong classifier, similar weakclassifiers might not be selected in subsequent iterations.

As a comparison to the results of Test 1 to Test 4 presented in Section 4.5.4,the result obtained on the training data using combinations of image sets isalso presented in Table 4.3. ΦTP is the true positive rate (in this case relatedto buildings), ΦTN is the true negative rate (related to nature), and ΦAcc is theaccuracy2.

4.5.4 Results

Training and evaluation were performed for the tests specified in Table 4.2 withfeatures extracted from images of size 120 and 240 separately. The result is pre-

2The accuracy ΦAcc is based on the sum of the true positives and the true negatives.


0 5 10 15 20 250

5

10

15

Per

cent

Feature number

0 5 10 15 20 250

10

20

30

40

Per

cent

Feature number

Figure 4.6: Histogram describing the frequency with which features are selected for thestrong classifier by AdaBoost in Test 2 as an average of 100 runs, using image size 120(upper) and 240 (lower).

sented in Table 4.4 and 4.5 respectively. The tables show ΦTP (buildings), ΦTN

(nature), and ΦAcc (the total classification rate with the corresponding standarddeviation). Results from both AdaBoost and BOC using the same training andtesting data are given.

In Test 1 a classification rate of over 92% is obtained for image size 240.This shows that it is possible to build a classifier based on digital camera im-ages and achieve very good image classification results with another cameraonboard the mobile robot, even though Set 1 and 2 have structural differences,see Section 4.5.1.

Test 2 is the most interesting test for us. Here, images that have been col-lected in the target environment of the mobile robot are used for both trainingand evaluation. This test shows high (and highest) classification rates. For bothAdaBoost and BOC they are around 97% using the image size 240. Figure 4.8shows the distribution of wrongly classified images for AdaBoost compared toBOC. It can be noted that for image size 120 several images give both classifiersproblems, while for image size 240 different images cause problems. Figures 4.9and 4.10 show the images that were wrongly classified. The numbers of the im-ages relate to the image numbers in Figure 4.8.


0 5 10 15 20 2560

70

80

90

100

Per

cent

Feature number

0 5 10 15 20 2560

70

80

90

100

Per

cent

Feature number

Figure 4.7: Histogram of classification rate of individual features in Test 2 as an averageof 100 runs with T = 1, image size 120 (upper) and 240 (lower).

Test 3 is of the same type as Test 1. They both train on one set of imagesand then validate on a different set. Test 3 shows lower classification rates thanTest 1 with the best result for AdaBoost using image size 240. It is not sur-prising that lower classification rates are obtained since the properties of thedownloaded images differ from the other image sets. The main difference be-tween the image sets is that the buildings in Set 3 often are larger and locatedat a greater distance from the camera. The same can be noted in the natureimages, where Set 3 contains a number of landscape images that do not showclose range objects. The conclusions from this test are that AdaBoost general-izes better than BOC and that the classification works very well even thoughthe training images were taken in our neighbourhood and the images used forevaluation were downloaded from the Internet.

Comparing the results of Test 2 and Test 4, it can be noted that the classi-fication rate is lower for Test 4, especially for image size 120. Investigation ofthe misclassified images in Test 4 shows that the share belonging to image Set3 (Internet) is large. For both image sizes 60% of the misclassified images camefrom Set 3 although Set 3 only represent 37% of the total number of images.This shows, once more, that the Internet images are harder to classify due totheir different properties.


Sets Size Classifier ΦTP [%] ΦTN [%] ΦAcc [%]

1 120 AdaBoost 100.0 100.0 100.0BOC 100.0 100.0 100.0

1,2 120 AdaBoost 97.2 100.0 98.2BOC 95.3 93.8 94.7

1,2,3 120 AdaBoost 89.7 94.7 91.9BOC 86.5 94.7 90.0

1 240 AdaBoost 100.0 100.0 100.0BOC 100.0 100.0 100.0

1,2 240 AdaBoost 100.0 100.0 100.0BOC 98.1 100.0 98.8

1,2,3 240 AdaBoost 98.7 99.1 98.9BOC 95.5 98.2 96.7

Table 4.3: Results from evaluation of the virtual sensor on the same image sets as usedfor the training. The image sets are defined in Table 4.1.

Test 5 demonstrates the scale invariance of the system by performing a crossresolution test. Classifiers were trained with images of size 120 and evaluatedwith images of size 240 and vice versa. The result is presented in Table 4.6 andshould be compared to Test 2 in Tables 4.4 and 4.5. The conclusion from thistest is that the features used have scale invariant properties over a certain rangeand that AdaBoost shows notably better scale invariance than BOC, whichagain demonstrates AdaBoost’s better extrapolation capability.


Test no. Classifier ΦTP [%] ΦTN [%] ΦAcc [%]

1 AdaBoost 81.8 91.7 84.4BOC 93.9 58.3 84.4

2 AdaBoost 93.0 91.8 92.6 ± 5.8BOC 95.7 89.0 93.4 ± 5.5

3 AdaBoost 68.0 90.0 79.0BOC 72.0 74.0 73.0

4 AdaBoost 86.6 89.8 87.9 ± 6.2BOC 86.4 88.5 87.3 ± 6.0

Table 4.4: Results for Test 1-4 using images with size 120.

Test no. Classifier ΦTP [%] ΦTN [%] ΦAcc [%]

1 AdaBoost 89.4 100.0 92.2BOC 95.5 87.5 93.3

2 AdaBoost 96.1 98.3 96.9 ± 4.3BOC 98.1 95.7 97.2 ± 4.0

3 AdaBoost 88.0 94.0 91.0BOC 90.0 82.0 86.0

4 AdaBoost 94.1 95.5 94.6 ± 3.8BOC 94.8 93.4 94.2 ± 4.7

Table 4.5: Results for Test 1-4 using images with size 240.

Train Test Classifier ΦTP [%] ΦTN [%] ΦAcc [%]

120 240 AdaBoost 94.2 96.7 95.1 ± 4.2BOC 93.0 94.3 93.5 ± 5.3

240 120 AdaBoost 95.1 90.8 93.6 ± 6.0BOC 100.0 44.8 80.5 ± 6.7

Table 4.6: Results for Test 5, the cross resolution test where data from Test 2 were used.Training was performed with images of size 120 and testing with images of size 240 andvice versa.


0 2 4 6 8 10 12 14 16 18 200

5

10

15

20

Image number (starting with the most times wrongly classified)

#

0 2 4 6 8 10 12 14 16 18 200

5

10

15

Image number (starting with the most times wrongly classified)

#

Figure 4.8: Distribution of the 20 most frequently wrongly classified images from Ada-Boost (gray) and BOC (white) in Test 2, using image size 120 (upper) and 240 (lower).

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

Figure 4.9: The 16 most frequently misclassified images in Test 2 with image size 120.


1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

Figure 4.10: The 16 misclassified images in Test 2 with image size 240.


4.6 A Building PointerThe learned building detection algorithm has been used to construct a virtualsensor. This sensor indicates the presence of buildings in different directionsrelated to a mobile robot. In this case the robot performed a sweep with itscamera (from −120◦ to +120◦ in relation to its heading) at a number of pointsalong its track. The images were then classified into buildings and nonbuildings.The virtual sensor was trained using AdaBoost and the images in Set 1, seeSection 4.5. The experiments were performed using Tjorven, a Pioneer robotequipped with GPS and a camera on a PT-head, see Section 2.2 for more details.Figure 4.11 shows the result of a tour in the Campus area. The arrows showthe direction towards buildings and the lines point toward nonbuildings. Figure4.12 shows an example of the captured images and their classes from a sweepwith the camera at the first sweep point (the lowest rightmost sweep point inFigure 4.11).

This experiment was conducted with yet another camera than those usedin Section 4.5 and during winter with no leaves on the trees and snow on theground. An estimation of the performance based on the evaluation by a humanexpert is shown in Table 4.7. Some images are hard to classify into the twoclasses since they include large portions belonging to both of the classes. Theseimages are denoted ambiguous and are shown in a separate column of the table.Problems were noted, for example, when a hedge is in front of a building orwhen the robot drives close to a bicycle stand and sees buildings and vegetationthrough the bikes. A specific error occurred at the second stop (lower middlestop) when a car stopped in front of the robot. Among the unambiguous images89% where correctly classified. Note that the good generalization of AdaBoosttogether with a suitable feature set is expressed by the fact that the classifierwas trained on images taken in a different environment and during a differentseason.

True False Ambiguous

76.4% 9.4% 14.2%

Table 4.7: Results for the building pointer. The column of ambiguous data representsthose images that have a mixture of building and nature where neither dominates.

4.6. A BUILDING POINTER 71

Figure 4.11: Using a virtual sensor to point out the class buildings along the mobilerobot path. Arrows indicate buildings and lines nonbuildings. (Aerial image ©ÖrebroCommunity Planning Office.)


Figure 4.12: Example of a single sweep with the camera. The arrows point at imagesclassified as buildings and the lines point at nonbuildings.

4.7. EVALUATION OF A VIRTUAL SENSOR FOR WINDOWS 73

4.7 Evaluation of a Virtual Sensor for WindowsIn the previous sections of this chapter, a virtual sensor has been presented.To illustrate the concept experiments with the aim of detecting buildings werepresented. This section shows that the general approach defined for the virtualsensor is not limited to building detection. A third class, windows, is introducedin order to show that the set of features can be used to distinguish betweenclasses that have similar structure. Both buildings and windows have a largecontent of vertical and horizontal lines, corners and rectangles.

4.7.1 Image Sets and Training

In the evaluation two sets of images, denoted A and B, are used. Set A consistsof images taken by handheld digital cameras and include images of buildings,windows and nature. All nature images from Set 1 (Section 4.5.1) are includedin Set A. The building images consist of the building images from Set 1 with theexception that a few images have been added to extend the image set and imagesthat only contained single windows, and therefore belong to the windows class,were removed. Half of the window images were collected using a digital cameraand the other half using a mobile phone camera, see Section 2.4. Set B consistsof Set A extended with Set 3 consisting of Internet images of buildings andnature3. The two image sets and the number of images in each set are listed inTable 4.8. Examples of images in the window dataset are given in Figure 4.13.

Set Windows Buildings Nature

A 100 50 50

B 100 100∗ 100∗

Table 4.8: The number of images used for the tests. The images were collected with adigital camera (5 megapixel Sony DSC-P92), with a mobile phone camera (2 megapixelSonyEricsson K750i) and downloaded from the Internet. (∗ includes 50 from the Inter-net.)

One virtual sensor is trained for each image class, i.e., one virtual sensor forbuildings, one for windows and one for nature. In the training process the im-ages from the other two classes constituted the negative examples. The trainingwas based on 90% of the positive and negative examples and the evaluationwas performed on the remaining 10%. The selection of the training and testsets was random and was repeated 100 times in the same manner as describedin Section 4.5.

3It was not as straightforward to search for windows as for buildings and nature images on theInternet. Hence no images have been collected from the Internet for the windows class.


Figure 4.13: Examples of images in the window dataset.

4.7.2 Result

The result from the evaluation of the different virtual sensors for image Sets Aand B is presented in Tables 4.9 and 4.10. Table 4.9 shows the true positiverate, ΦTP , the true negative rate, ΦTN , the accuracy, ΦAcc, and the normalizedaccuracy4 ΦNacc. Table 4.10 divides the false positive images into their trueclasses. There are three columns with data extracted from the experiments.One cell in each row refers to the true positive rate, ΦTP from Table 4.9, andthe remaining two cells show the false positive rates for the respective classes inthe negative dataset. The diagonals in the tables should contain high numbersand the other cells should contain low numbers.

Result for Set A The best results of the virtual sensors using the images takenwith the digital camera are achieved for the nature virtual sensor with an ac-curacy of 98%. This shows that the feature set is able to separate man-madestructures from natural objects with a high detection rate. The accuracy forthe window virtual sensor and for the building virtual sensor is 87% and 89%

4The normalized accuracy ΦNacc is a value that takes the sizes of the positive and negative setsinto account and represents a value for the case when the number of positives equals the numberof negatives in the evaluated set. ΦNacc can be calculated as the mean value of ΦTP and ΦTN .

4.7. EVALUATION OF A VIRTUAL SENSOR FOR WINDOWS 75

Test VS Set ΦTP [%] ΦTN [%] ΦAcc [%] ΦNacc [%]

1 Windows A 94.3 79.1 86.7 ± 7.2 86.7

2 Buildings A 79.0 92.7 89.3 ± 7.0 85.8

3 Nature A 98.6 97.6 97.8 ± 3.2 98.1

4 Windows B 94.6 87.0 89.6 ± 4.7 90.8

5 Buildings B 77.9 80.3 79.5 ± 6.9 79.1

6 Nature B 97.5 95.3 96.1 ± 3.3 96.4

Table 4.9: Results from the evaluation of the windows virtual sensor based on 100 runs.ΦTP is the true positive rate, ΦTN is the true negative rate, ΦAcc is the accuracy, andΦNacc is the normalized accuracy.

Test VS Set Wind. [%] Build. [%] Nat. [%]

1 Windows A 94.3 35.8 6.0

2 Buildings A 11.0 79.0 0.0

3 Nature A 1.4 4.4 98.6

4 Windows B 94.6 23.6 2.3

5 Buildings B 24.1 77.9 15.2

6 Nature B 2.9 6.4 97.5

Table 4.10: Results from the evaluation of the three virtual sensors (windows, buildings,and nature), using 100 runs, and with the result separated into the three classes.

respectively. It can be noted that the window virtual sensor has a larger truepositive rate than the building virtual sensor. The reason for this is probablythat the building images often include parts of the surroundings like small partsof vegetation and ground, while the surroundings of the windows are homo-geneous walls. From Table 4.10 it can be noted that the nature virtual sensorhas a low false positive rate and that the other two virtual sensors show lowfalse positive rates for nature images (6% and 0%). The window virtual sensorhas substantially more false positives from the building images than from thenature images (36% versus 6%).

Result for Set B Test 4-6, which includes additional images from the Internet,show the same tendency as Tests 1-3. The figures for Test 4 in Table 4.9 arehigher than for Test 1, while Test 5 and 6 show slightly lower results. Thelargest change in Table 4.10 between Set A and B occurs for the building virtualsensor (Test 2 and 5). The true positive rate is approximately the same as for


Set A, but the false positive rate for windows has increased from 11% to 24%and for nature from 0% to 15%.

The sets of building images are more closely related to the images of win-dows and nature than window images are related to nature images. The sepa-ration between the different classes is therefore harder for the building virtualsensor, which is more evident when the Internet images with their larger varietyare introduced. For instance, the building images show buildings from largerdistances and some images also include vegetation, compare with Test 3 and 4in Section 4.5.4.

4.8 Evaluation of a Virtual Sensor for TrucksIn this section, a virtual sensor for trucks is learned and evaluated. The imagesets used in the previous section are extended with images of trucks and trail-ers. For the evaluation, virtual sensors for the four classes windows, buildings,nature, and trucks are learned.

A virtual sensor for trucks is useful in outdoor environments since buildingscan be confused with other large objects and it can be beneficial to distinguishbetween stationary and movable objects in any mapping process.

4.8.1 Image Sets and Training

Image sets A and B used in Section 4.7 are extended with images of trucks thatwere collected manually using handheld digital cameras (Sony DSC-P92 andSonyEricsson K750i) and downloaded from the Internet5. The new sets weredenoted as C and D, see Table 4.11. Examples of images from the truck datasetare given in Figure 4.14.

Set Windows Buildings Nature Trucks

C 100 50 50 50

D 100 100∗ 100∗ 100∗

Table 4.11: The number of different images used for the tests presented in this section.The images were collected with a digital camera (5 megapixel Sony DSC-P92), witha mobile phone camera (2 megapixel SonyEricsson K750i) and from the Internet (∗

includes 50 from the Internet.).

As in the previous section, one virtual sensor was trained for each imageclass, i.e., one virtual sensor for buildings, one for windows, one for natureand one for trucks. In the training process the images from the other threeclasses constituted the negative examples. The training was based on 90% of

5Search terms “truck”, “lorry” and “lastbil” (lorry in Swedish) were used.

4.8. EVALUATION OF A VIRTUAL SENSOR FOR TRUCKS 77

Figure 4.14: Example of images in the truck dataset. The upper two rows show trucksfrom set C and the lower two rows show truck images found on the Internet.

the positive and negative examples and the evaluation was performed on theremaining 10%. The selection of the training and test sets was random and wasrepeated 100 times in the same manner as described in Section 4.5.

The feature set that has been defined so far (features �f1 to f24) is based onstraight edges and gray levels. One typical qualitative feature that could be usedto detect vehicles is circular shapes from, e.g., wheels. In order to better captureproperties that can be related to vehicles, new features were introduced foran additional test. These features aim to detect the vehicle’s tyres and circularshaped fenders. The features added to the previously defined feature set aredefined as:

25. Number of circle segments in the image that covers at least 75% of a cir-cle, measured by dividing the length of the edge segment by the estimatedperimeter of a found circle.

26. Number of circle segments in the image that cover between 25% and50% of a circle.

27. Number of circle segments in the image that cover between 50% and75% of a circle.


28. Radius to distance ratio. The mean value of the two most similar radii isdivided by the distance between the circles. This feature is set to zero ifless than two circles have been found.

The search for circles makes use of the Random Sample Consensus (RANSAC)algorithm [Fischler and Bolles, 1981] to map line segment points to circles.RANSAC is used to fit a model to data and is known to accept a large portionof outliers in the data. The used model for a circle is defined as a centre coor-dinate and the radius of the circle. The algorithm randomly selects three pointsfrom the dataset (three points are needed to define a circle) and calculates thecircle model that fits the selected points. The remaining points in the datasetare evaluated against this model and those points that are within a tolerancet1 form a subset Si. The process repeatedly performs the random initialization,compares the obtained subsets and stores the largest subset Si. The final modelof the circle is calculated using the least squares method based on the points inthe largest subset. During the experiments t1 = 0.4 was used and the datasetconsisted of individual edge segments. Since straight lines can also be identi-fied as circles if the radius is infinite, the radii of the circles and circle segmentswere limited to half of the image height. A general RANSAC implementation[Kovesi, 2000] was adapted to find circular shapes.

4.8.2 Result

The result from the evaluation of the different virtual sensors and image setsis presented in Table 4.12. Test 1-8 refers to tests with the original 24 featuresand Test 9-16 to the extended feature set. The rightmost column shows the nor-malized accuracy (ΦNacc), which is the expected accuracy when using equallylarge positive and negative evaluation sets.

Result for Set C From Table 4.12 it can be noted (comparing test 1-4 with9-12) that the results for both ΦAcc and ΦNacc are better for three out of fourvirtual sensors when using the feature set with 28 features. A small degradationis seen for buildings, but more importantly, a notable improvement is seen forthe truck virtual sensor. The true positive rate has in fact decreased, but the truenegative rate has improved more, resulting in an increase of ΦAcc from 81.5%to 87.4%. For test 9-12 ΦNacc varies from 86% for buildings up to almost99% for nature.

Result for Set D Set D includes images collected from the Internet. It wasnoted in previous sections that the sets of Internet images are not as consistentas the images taken manually with digital cameras. This is clearly reflected inthe decreased accuracy for buildings, nature and trucks. The set of windowimages does not include images from the Internet and for Test 5 and 13 the

4.8. EVALUATION OF A VIRTUAL SENSOR FOR TRUCKS 79

Test VS Set ΦTP [%] ΦTN [%] ΦAcc [%] ΦNacc [%]

1 Windows C 92.9 85.3 88.4 ± 5.8 89.1

2 Buildings C 81.6 91.6 89.6 ± 5.8 86.6

3 Nature C 96.6 98.3 98.0 ± 2.8 97.5

4 Trucks C 91.6 79.0 81.5 ± 7.8 85.3

5 Windows D 94.1 87.8 89.4 ± 4.8 91.0

6 Buildings D 73.3 82.8 80.4 ± 5.7 78.0

7 Nature D 96.1 95.9 96.0 ± 2.8 96.0

8 Trucks D 83.9 75.3 77.4 ± 5.6 79.6

9 Windows C 93.2 85.7 88.7 ± 5.5 89.4

10 Buildings C 81.4 91.0 89.1 ± 5.4 86.2

11 Nature C 98.4 99.0 98.9 ± 2.0 98.7

12 Trucks C 89.6 86.8 87.4 ± 7.1 88.2

13 Windows D 94.6 87.9 89.6 ± 4.5 91.3

14 Buildings D 78.9 81.9 81.2 ± 6.5 80.4

15 Nature D 96.0 95.6 95.7 ± 3.2 95.8

16 Trucks D 79.8 81.4 81.0 ± 5.5 80.6

Table 4.12: Results from the evaluation of the four virtual sensors for 100 runs. Test 1-8use the original 24 features and Test 9-16 use the extended feature set with 28 features.ΦTP is the true positive rate, ΦTN is the true negative rate, ΦAcc is the accuracy, andΦNacc is the normalized accuracy.

accuracy has in fact increased slightly because the window images in Set Dconstitute a relatively more homogeneous group of images compared to Set C.

Additional Features A few features intended to capture properties of truckswere added to the feature set and the results are shown in Table 4.12, Test 9to 16. For the truck VS the positive detection rate actually decreased, but theaccuracy was increased due to that the true negative rate increased more thanthe true positive rate decreased. In Set C, the values of ΦAcc are similar for allvirtual sensors except for the truck virtual sensor. Here the value has increasedfrom 81.5% to 87.4% and it can therefore be concluded that the use of theadditional features actually improved the overall result.

Comparing the results in the previous section, Table 4.9 with the result inTable 4.12 it can be seen that the introduction of the set of truck images did


not introduce any major differences to the overall result for the three othervirtual sensors.

To better understand which classes cause problems in the different tests, theclassification results have been separated. Table 4.13 shows the correspond-ing figures. The first three columns are the same as in Table 4.12. Next, thereare four columns with data extracted from the experiments. One cell on eachrow refers to the true positive detection rate (the diagonals), and the remainingthree cells show the false positive rates for the respective classes in the negativedataset. The diagonals in the tables should contain high numbers and the othercells should contain low numbers.

From the table it can be noted that the lowest numbers (low false positiverate) occur for nature images classified by the window virtual sensor and forwindow images classified by the nature virtual sensor. This indicates that win-dows and nature are well separated by the selected features. One can also notethat the introduction of additional features decreased all false positive rates forthe truck virtual sensor (cf. Test 4 with 12 and Test 8 with 16).

Test VS Set Wind. [%] Build. [%] Nat. [%] Truck [%]

1 Windows C 92.9 29.4 0.0 14.6

2 Buildings C 13.9 81.6 1.8 4.0

3 Nature C 0.4 2.4 96.6 3.4

4 Trucks C 16.8 18.2 32.2 91.6

5 Windows D 94.1 21.1 0.8 14.7

6 Buildings D 24.5 73.3 9.1 18.1

7 Nature D 1.9 5.2 96.1 5.1

8 Trucks D 26.1 32.1 16.0 83.9

9 Windows C 93.2 28.6 0.0 14.4

10 Buildings C 13.7 81.4 1.6 6.8

11 Nature C 0.0 1.8 98.4 2.0

12 Trucks C 9.9 13.6 19.2 89.6

13 Windows D 94.6 21.2 0.8 14.2

14 Buildings D 25.5 78.9 9.1 19.6

15 Nature D 1.7 5.2 96.0 6.4

16 Trucks D 16.6 24.0 15.1 79.8

Table 4.13: Results from the evaluation of the four virtual sensors (windows, building,nature, and trucks), using 100 runs with the original 24 features (Test 1-8) and theextended feature set (Test 9-16), separated into the four classes.


In this section a virtual sensor for trucks was introduced and an evaluationof the virtual sensors for windows, building, nature, and trucks was performed.Additional features that notably improved the performance of the truck class(which was the intention) were introduced, but also the overall classificationresult was improved. For Set C, images taken by the handheld cameras, highaccuracies from 87.4% up to 98.9% were achieved.

4.9 Summary and ConclusionsThis chapter introduced the concept of a virtual sensor and presented instancesof virtual sensors for four classes: windows, buildings, nature, and trucks.These virtual sensors use vision to classify the view and were trained usingcategorized images (supervised learning). Virtual sensors relate the robot sen-sor readings to human spatial concepts and are applicable, for example, whensemantic information is necessary for communication between robots and hu-mans.

First, two classifiers intended for use on a mobile robot to discriminatebuildings from nature were evaluated. The results from the evaluation showthat high classification rates can be achieved, and that Bayes classifier and Ad-aBoost give similar classification performance in the majority of the performedtests. The number of wrongly classified images is reduced by about 50% whenthe higher resolution images are used. The features that were used have scaleinvariant properties, demonstrated by the cross resolution test where the clas-sifier was trained with one image size and tested on another size. The benefitsgained from using AdaBoost include the highlighting of strong features and itsimproved generalization properties over the Bayes classifier. The tests also re-vealed that histogram of edge orientation is the best single feature in the featureset for finding building images.

To show the ability to learn other virtual sensors using the defined featureset, virtual sensors for windows and trucks were also learned and evaluated.Without extension of the feature set, they performed very well, although theperformance of the virtual sensor for buildings degraded slightly. To even betterdistinguish between similar concepts, features that capture specific characteris-tic properties may be added to the feature set. This was shown for the virtualsensor dedicated to detection of trucks. With a few new features, the result forthe accuracy of truck detection was notably improved.

It was also shown how a virtual sensor can be used for pointing out build-ings along the trajectory of a mobile robot. In the performed experiment itturned out that the feature set could also handle seasonal changes.

The suggested method using machine learning and generic image featuresmakes it possible to extend virtual sensors to a range of other important humanspatial concepts.

Chapter 5

Probabilistic Semantic Mapping

5.1 IntroductionIn Chapter 3 the importance of semantic information and the possibilities that itgives were discussed. It was noted that semantic information can, for instance,facilitate human-robot interaction (HRI) but several other application fieldswere also presented. A method to extract semantic information was presentedin Chapter 4, where a virtual sensor that relates sensor readings to humanspatial concepts was introduced. As one example, a virtual sensor for buildingdetection using methods for classification of views as buildings or nature basedon vision was described. The purpose was to detect one very distinctive typeof object that is often used by humans, for example, in textual description ofroute directions.

A semantic map is an additional tool to represent semantic information andcan be used, for instance, to improve HRI. This chapter introduces a methodthat fuses data from different sensor modalities, range sensors and vision sen-sors, to create a semantic map of the environment. The method was appliedto an outdoor environment, but it is expected to also apply to indoor envi-ronments with appropriate classifiers for the desired classes of objects to bemapped. The introduced method for probabilistic semantic mapping combinesinformation from a learned virtual sensor with a standard occupancy grid map.Here, this will be exemplified by using the learned virtual sensor for buildingdetection presented in Chapter 4. Pose information (location and orientation)from the mobile robot together with the output from the virtual sensor is usedto estimate the directions to detected buildings. These directions are used toupdate the occupancy grid map with semantic information. The result is a se-mantic map with two classes: ‘buildings’ and ‘nonbuildings’. A probabilisticapproach is applied in order to cope with uncertainty in the output from thevirtual sensor, both uncertainty in the classification and uncertainty regardingwhich part of the image caused the classification.

83

84 CHAPTER 5. PROBABILISTIC SEMANTIC MAPPING

5.1.1 Outline

Section 5.2 describes the process of building a probabilistic semantic map. Im-ages taken by a mobile robot are fed into a virtual sensor learned for a specificclass. Based on the result from the virtual sensor and the robot pose, local mapsshowing connected regions1 representing the detected class are created for eachrobot position where images have been acquired. The local maps are then fusedinto a global probabilistic semantic map.

For the experiments an omni-directional camera mounted on the mobilerobot giving a 360◦ view of the surroundings was used. From the omni-direct-ional image Npv planar views are created with a horizontal field-of-view ofΔ degrees (the values of Npv and Δ are provided in Table 5.2). These planarimages are used as input to the virtual sensor. The process of calculating theplanar views and descriptions of the experiments are given in Section 5.3. Anevaluation of the results is given in Section 5.4, followed by a discussion inSection 5.5 that concludes this chapter.

5.2 Probabilistic Semantic MapIn this section a probabilistic approach to semantic mapping is presented. Theobjective is to take an occupancy grid map and introduce semantic labellingof the occupied cells. The semantic information comes from the output of alearned virtual sensor. It is assumed that an occupancy grid map including oc-cupied cells that represent objects of the class of interest exists. An occupancygrid map can be built using a laser scanner or stereo vision for example.

The result from the virtual sensor applies for the whole input image. Thismeans that all objects within the view are assumed to belong to the same class.To focus the attention on the main objects in the view, probabilities assignedfor single objects are adjusted according to their proportions of the view, seeSection 5.2.1.

The process of building the semantic map is divided into two steps. Inthe first step connected regions that are within view of the virtual sensor aresearched in the grid map, and a local semantic map is created around the robot.In the second step the local maps are used to update a global map using aprobabilistic method. The result in this case is a global semantic map whereconnected regions for buildings and nature can be distinguished.

Both the local maps and the global map are grid maps of the same sizeas the occupancy grid map and with the same cell size. Each cell can have avalue Pcell in the interval [0, 1]. The maps are initialised with all cells set tounknown (Pcell = 0.5) and are then incrementally updated as the robot travelsalong the trajectory and evaluates the views with the virtual sensor. In this

1A connected region is understood as a connected component in a binarized occupancy gridrepresenting the real object that caused the occupied area in the map.

5.2. PROBABILISTIC SEMANTIC MAP 85

chapter the virtual sensor for building detection is used and the resulting mapwill then contain the three following classes; buildings (Pcell > 0.5), nature(Pcell < 0.5) and unknown (Pcell = 0.5). Empty and unknown cells in theoriginal occupancy grid map are not affected by the mapping, only the occupiedcells are considered.

5.2.1 Local Semantic Map

Throughout this chapter it is assumed that an occupancy map has been builtor is built during the mobile robot motion. The local semantic map is a prob-abilistic representation of a sector in the occupancy map as seen by the robot.The sector is defined by the robot pose, the direction of the virtual planar cam-era (Section 5.3.1), the opening angle θ of the sector (related to the classifiedcamera image) and the expected maximum range LV S of the virtual sensor, VS.The horizontal covering angles {αi} = α1, α2, . . . , αn of all connected regionswithin this sector are calculated. A sector is illustrated in Figure 5.1.

Figure 5.1: Illustration of a sector with an opening angle θ, representing the view of thevirtual sensor. Two connected regions are found (the grey rectangles) within the sectorand their respective sizes are represented by α1 and α2.

The sum of the horizontal covering angles cannot exceed the value of the sectoropening angle θ:

n∑i=1

αi ≤ θ (5.1)

The outlines of the connected regions found in the sector constitute the localmap. These regions shall be assigned a probability based on the classificationperformed by the virtual sensor. As previously mentioned it is assumed that


larger objects are more likely to affect the virtual sensor. The larger regions inview are therefore given a higher probability than smaller regions.

Probabilities Pi(class|VSTs , αi) are assigned to the n regions in view (definedby the sector) in relation to their perceived size, measured by the horizontalcovering angles αi, using the following expression:

Pi(class|VSTs , αi) =1

2+

αi

θ(P (class|VSTs)−

1

2) (5.2)

where P (class|VSTs) is the conditional probability that a view is class when thevirtual sensor classification at sensor reading Ts is class. With Equation 5.2, theprobabilities Pi(class|VSTs , αi) are assigned a value that is within the interval 0to 1 and proportional to the perceived region sizes and P (class|VSTs).

P (class|VSTs) is assigned different values depending on the output of theVS. The two combinations that are interesting for calculating P (class|VSTs),here examplified with buildings and nonbuildings, are:

P (class|VSTs) =

{P (build|VS=build) > 0.5P (build|VS=¬build) < 0.5

(5.3)

With this setting Pi(class|VSTs , αi) will always be assigned a value larger than0.5 when the virtual sensor detects a building, and a value below 0.5 when aview is classified as a nonbuilding.

One problem in the practical use of the local semantic map is the selectionof the parameters P (class|VSTs). In situations where the performance of thevirtual sensor is well known, the parameters can be determined based on thisperformance. The internal relation of the parameters should then reflect theprobability of false readings in order to handle these in the best way. Thisapproach has not been investigated due to reasons described in Section 5.3.4.Instead, in the experiments described further on in this chapter the parametersin Equation 5.3 were learned in a test run with the mobile robot.

5.2.2 Global Semantic Map

The second step deals with the global semantic map, which is updated whenevera new local semantic map is available. The standard Bayes update equation (asdescribed in, e.g., Thrun et al. [1998]) is used to update the global semantic mapwith the local semantic map produced in the previous step. In the following itis shown how an individual grid cell at position (x, y) is updated based on atechnique for updating occupancy grid maps.

The probability that grid cell (x, y) is occupied after Ts sensor readings is de-noted by P (occx,y|s

1, s2, . . . , sTs) where si denotes a sensor reading. Assumingthat the conditional probability P (s(t)|occx,y) is independent of P (s(τ)|occx,y)if t �= τ (known as the Markov property), the probability at (x, y) can be com-puted as:

5.2. PROBABILISTIC SEMANTIC MAP 87

P (occx,y|s1:Ts) = 1−

(1 +

Pprior

1−Pprior

∏Ts

r=1P (occx,y|s

(r))

1−P (occx,y|s(r))

1−Pprior

Pprior

)−1

.

(5.4)A second assumption that the prior probability for occupancy, Pprior, can beset to 0.5 simplifies Equation 5.4 to:

P (occx,y|s1:Ts) = 1−

(1 +

∏Ts

r=1P (occx,y|s

(r))

1−P (occx,y|s(r))

)−1

. (5.5)

This equation can be rewritten to a recursive update formula. From Equation5.5 we can write

∏TS−1r=1

P (occx,y|s(r))

1−P (occx,y|s(r))=

(1− P (occx,y|s

1:Ts−1))−1

− 1 =

P (occx,y|s1:Ts−1)

1−P (occx,y|s1:Ts−1)

(5.6)

Substituting Equation 5.6 in Equation 5.5 results in the recursive update for-mula

P (occx,y|s1:Ts) =

1−(1 +

P (occx,y|s1:Ts−1)

1−P (occx,y|s1:Ts−1)P (occx,y|s

Ts )1−P (occx,y|sTs )

)−1

.(5.7)

In our case the sensor reading sTs is the output VSTs from the virtual sensor atsensor reading Ts and the grid cells are assigned a probability denoting whetherthey belong to class. Using these notations Equation 5.7 is rewritten as:

P (class|VS1:Ts) =

1−(1 + P (class|VS1:Ts−1)

1−P (class|VS1:Ts−1)

P (class|VSTs )

1−P (class|VSTs)

)−1 (5.8)

which is the update formula used for the grid cells of the global semantic map(the grid cell index (x, y) has been left out). The resulting global semantic mapwill contain three different classes:

Building if P (class|VS1:Ts) > 0.5

Unknown if P (class|VS1:Ts) = 0.5

Nonbuilding if P (class|VS1:Ts) < 0.5

(5.9)

where the degree of certainty for Building is higher close to 1 and the degree ofcertainty for Nonbuilding is higher close to 0.


5.3 ExperimentsIn this section the experiments that have been performed to validate the abovepresented method are described. The semantic information comes from the vir-tual sensor that was trained using AdaBoost. This training was performed usingimages with 240 pixels side length, images taken from image set 1 defined inSection 4.5.1. In the evaluation of the method, images from virtual planar cam-eras were used as input to the virtual sensor. The virtual planar cameras aredescribed in Section 5.3.1.

The two datasets that were used in the experiment are described in Section5.3.2. The calibration of the parameters was based on the first of these datasets.The validation was performed using three different occupancy grid maps, seeSection 5.3.3. The calibration is described in Section 5.3.4.

5.3.1 Virtual Planar Cameras

In Chapter 4, a digital camera mounted on a pan-tilt head was used to createa panoramic view with approximately 280◦ horizontal field-of-view. A sweepof the pan-tilt head requires a few seconds, making the data acquisition time-consuming. In order to acquire a 360◦ field-of-view panoramic image in onesingle shot, an omni-directional camera is used here. As an additional benefit,the centre of the image plane is now the same throughout the whole panoramicview.

The robot used in the experiments, a Pioneer P3-AT from ActivMedia, isfitted with an omni-directional camera. The camera is a standard consumer-grade SLR digital camera (Canon EOS350D, 8 megapixels). On top of the lens,a curved mirror from 0-360.com is mounted. Further details on the equipmentare given in Section 2.2.

The camera-mirror combination produces omni-directional images that canbe unwrapped into high-resolution spherical images (also referred to as pano-ramic images) by a polar-to-Cartesian conversion. From a large spherical image,smaller planar images are extracted, using projections, as they would appearfor a regular camera. Figure 5.2 shows an example of the omni-directionalimage, the unwrapped image, and some planar images. For details on cameraprojections, see for example Hartley and Zisserman [2004].

5.3. EXPERIMENTS 89

Figure 5.2: The upper part of the figure is one of the omni-directional images used in theexperiments, the middle part is an unwrapped version of the same image, and the lowerpart shows some planar images extracted from the unwrapped image. One can notethat the omni-directional image only uses about 30% of the available 8 megapixels,giving that the effective number of pixels in the omni-image is about 2.4 megapixels.The unwrapped image has 2.2 megapixels.


5.3.2 Image Datasets

Two datasets are used for the experiments described in this chapter, see Table5.1. The datasets consist of omni-images and pose information acquired by themobile robot. Each omni-image was converted into eight planar images, whereeach planar image has a resolution of 320× 320 pixels and covers a horizontaland vertical field-of-view of 56◦. This means that there is a small overlap be-tween planar images generated from the same omni-image. This overlap wasintroduced in order to reduce the probability that information on the imageborder would not be taken fully into account. Examples of the planar imagesare given in the lower part of Figure 5.2. In total 2384 planar images wereused for the experiments. The images were collected at Örebro Campus andthe mobile robot trajectories are shown in Figure 5.3. Differential GPS (DGPS)and odometry were used to compute the robot poses (position and orientation)along the trajectory, see Appendix B.3. Other alternatives for pose estimation,such as SLAM (Simultaneous Localization and Mapping), could also have beenused.

Figure 5.3: The figure shows the trajectories for the two datasets used for training (Set1) and evaluation (Set 2) respectively. Set 1 is the right trajectory (dashed) and Set 2is the left trajectory (solid). The starting points are marked with circles. (Aerial image©Örebro Community Planning Office.)

5.3. EXPERIMENTS 91

Set Omni-images Planar images Length

1 88 704 146 m

2 210 1680 317 m

Table 5.1: Image datasets used for the semantic mapping experiments.

5.3.3 Occupancy Maps

The evaluation part of the semantic mapping experiment was repeated withthree different occupancy maps. These maps cover parts of the area shownin Figure 5.3. The first map is a handmade occupancy map. Using the aerialimage presented in Figure 5.3, the occupancy map shown in Figure 5.4 wasconstructed. The building outlines and groups of trees around the trajectoryhave been marked as filled polygons by hand and this occupancy map is binary,i.e., a grid cell is either empty or occupied.

B

B

Figure 5.4: The handmade occupancy map. The two building regions are marked witha ’B’. All other regions are nonbuildings. This map with the labels also serves as theground truth in the evaluation.

The second map covers the last two thirds of Set 2, see Figure 5.5(a). Thismap is created using 2D laser readings from the SICK laser range finder, hor-izontally mounted in front of the robot, see Section 2.2. The resulting map isshown in Figure 5.5(b). The reason why the first third of the dataset was notused is that it was not feasible to create a consistent map from the 2D laserrange data since the robot trajectory included elevated terrain.

The third map is based on 3D-scans obtained by tilting the laser range scan-ner. In this case only readings with a minimum height of 2 meters were consid-ered in the occupancy map. This eliminates some cars and bushes. The resulting


(a) The area of the 2D-map (solid) and the 3D-map (dashed).

(b) The occupancy map based on measurements using a 2D-laser range finder.

(c) The occupancy map based on measurements using a3D-laser range finder.

Figure 5.5: The two occupancy maps based on 2D- and 3D-laser range finders.

5.3. EXPERIMENTS 93

occupancy map, shown in Figure 5.5(c), is based on data from the first half ofSet 2 (the part where 3D-data were collected).

In the experiments, binarized occupancy grid maps were used, where eitheran empty grid cell (0) or an occupied grid cell (1) is represented. However,probabilistic occupancy maps could also be used directly with the suggestedalgorithm. Since it is assumed that the grid cells in the occupancy map are in-dependent from the output of the virtual sensor, the values of the grid cellsin the occupancy map can be seen as another sensor reading and the proba-bilistic semantic map is updated with an expression corresponding to Equation5.8. Provided that nonbinarized occupancy grid maps are used, an alternativeprocedure for definition of objects in the sector has to be defined.

5.3.4 Used Parameters

Table 5.2 lists the important parameters for constructing the planar cameraviews and the probabilistic semantic maps. These parameters can be adjustedto change the desired properties of the system, for instance:

• P (build|VS=build) versus P (build|VS=¬build) is related to the perfor-mance of the virtual sensor.

• The sector opening angle θ should be related to the planar camera fieldof view. By decreasing θ the result from the virtual sensor focuses on thecentral parts of the planar image.

In our work the last four parameters (planar camera field-of-view - grid cellsize) were set according to Table 5.2. The setting of the first three parame-ters (the sector opening angle θ and the probabilities P (build|VS=build) andP (build|VS=¬build)) have been varied in order to optimize the performanceof the system. It would be preferable to be able to relate P (build|VS=build)and P (build|VS=¬build) directly to the classification rate of the virtual sensor.However, in reality there are a lot of views that contain a mix of buildings andnature that make a proper ground truth evaluation difficult. It was thereforedecided to set the parameters based on the evaluation of the performance ofthe complete system including the virtual sensor and the map building algo-rithm. Set 1 was used for learning the first three parameters in Table 5.2 andthese parameters were then evaluated using Set 2 in Section 5.4.

In total 21 combinations of different field-of-views θ and probability pairswere evaluated in order to find a good combination of parameters. This evalu-ation was performed using dataset 1. The following three θ were used: 56◦, 45◦,and 30◦ and the following seven probability pairs (P (b|VS=b), P (b|VS=¬b)):(0.8, 0.3), (0.8, 0.4), (0.8, 0.45), (0.9, 0.3), (0.9, 0.4), (0.9, 0.45), and (0.95,0.48). With θ=56◦ the field-of-view is the same as for the virtual sensor. With


Parameter Value Description

P (build|VS=build) learned (> 0.5) Building probability

P (build|VS=¬build) learned (< 0.5) 1 - Nature (nonbuilding)probability

θ learned (30-56) Sector opening angle [deg]

Δ 56 Planar camera field-of-view [deg]

Npv 8 Number of planar views

LV S 50 VS maximum range [m]

- 0.5 Grid cell size [m]

Table 5.2: Description of parameters used as settings for the planar camera views, sectorsize and the probability maps.

θ=45◦ there is no overlap in the local maps belonging to the same position. Us-ing θ=30◦ it was intended to see how the system works when only the centre ofthe virtual sensor’s field-of-view is used and the border parts are neglected.

One can note that P (build|VS=build) was selected to be always proportion-ally larger than P (build|VS=¬build). There are several reasons that motivatethis asymmetry. The main reason is that the virtual sensor for building clas-sification produces substantially more false negatives than false positives (seeSection 4.5.1). A second reason is that it is more common with visible naturein front of buildings than vice versa. A third reason discovered during the ex-periments is that open views often were classified as nonbuildings. This affectsthe result since as the robot drives along a straight road, both the forward andbackward looking views were classified as nature. The problem with this wasthat parts of the building close to the road were included in these sectors givingmany false updates. Accordingly, we tend to have little confidence in classifi-cations as nonbuilding, expressed by a value of P (build|VS=¬build) close to0.5.

For each parameter combination the following measures were calculated:

• The true positive building detection rate, ΦTP . Number of cells correctlyclassified as building / number of building cells included in the sectors.

• The true negative detection rate, ΦTN . Number of cells correctly classifiedas nature / number of nature cells included in the sectors.

The measures were calculated based on the final global semantic map and usethe sum of the true rates (ΣT = ΦTP + ΦTN ) as the primary selection criterion.Combination 6 (θ = 56◦, P (build|VS=build) = 0.9, P (build|VS=¬build) = 0.45)

5.4. RESULT 95

Test θ [◦] P (b|VS=b) P (b|VS=¬b) ΦTP [%] ΦTN [%] ΣT [%]

1 56 0.80 0.30 33.1 96.0 129.1

2 56 0.80 0.40 63.8 91.0 154.8

3 56 0.80 0.45 76.8 88.2 165.0

4 56 0.90 0.30 52.0 94.2 146.2

5 56 0.90 0.40 71.2 90.3 161.5

6 56 0.90 0.45 80.2 86.6 166.87 56 0.95 0.48 90.1 73.4 163.5

8 45 0.80 0.30 26.5 97.1 122.6

9 45 0.80 0.40 54.9 91.8 146.7

10 45 0.80 0.45 71.3 88.0 159.3

11 45 0.90 0.30 44.5 94.8 139.3

12 45 0.90 0.40 65.0 90.5 155.5

13 45 0.90 0.45 80.8 87.2 168.014 45 0.95 0.48 88.3 76.7 165.0

15 30 0.80 0.30 22.9 95.6 118.5

16 30 0.80 0.40 46.5 91.4 137.9

17 30 0.80 0.45 73.8 89.5 163.3

18 30 0.90 0.30 35.9 93.6 129.5

19 30 0.90 0.40 69.1 90.0 159.1

20 30 0.90 0.45 75.4 88.6 164.0

21 30 0.95 0.48 81.1 79.4 160.5

Table 5.3: Parameters and detection rates for Set 1. The letter b in P (b|VS=b) is shortfor building.

and combination 13 (θ=45◦, P (build|VS=build) = 0.9, P (build|VS=¬build) =0.45) resulted in the highest detection rate, see Table 5.3, and hence, these com-binations result in the lowest total false rate.

5.4 ResultWe use Set 2 to evaluate the semantic map and present the result for the twobest parameter settings (combination 6 and 13) as found in the previous section.


5.4.1 Evaluation of the Handmade Map

The semantic map based on the handmade occupancy map and data from Set2 (combination 6) is presented in Figure 5.6. It can be noted that most of theoutlines of the connected regions were correctly labelled. Small parts that arenot correct are the rightmost part of the building (marked with ‘a’) and a partof a grove close to the left building (marked with ‘b’). At ‘a’ there is a coniferin front of the building and at ‘b’ the view shows a building behind some treetrunks. Both these two problems originate from scenes including a mixture ofbuilding and vegetation.

Resulting map for data set 2

b

a

Figure 5.6: The resulting map using dataset 2. The outlines of both building and natureregions are correct to a large extent.

Table 5.4 presents the detection rates for Set 2. The first row shows the re-sult for combination 6 and the second row for combination 13. The evaluationwas performed based on all cells in the grid map that are not equal to 0.5. Thetrue detection rates are all equal to or higher than 96.9% and combination 6gives a slightly better result than combination 13 (it was the other way aroundfor Set 1).

5.4.2 Evaluation of the Laser-Based Maps

The semantic maps created using the occupancy maps obtained from the 2Dand 3D laser range measurements are shown in Figures 5.7 and 5.8. From aqualitative visual inspection, one can note that the main problem in the 2D-version is that the occupancy map contains connected regions originating from

5.4. RESULT 97

Test ΦTP [%] ΦTN [%] ΦFP [%] ΦFN [%]

Handmade (6) 98.3 98.7 1.3 1.7

Handmade (13) 96.9 98.4 1.6 3.1

Laser 2D (6) 87.1 73.1 26.9 12.9

Laser 2D (13) 88.5 72.8 27.2 11.5

Laser 3D (6) 94.1 95.0 5.0 5.9

Laser 3D (13) 96.4 97.3 2.7 3.6

Table 5.4: Results for parameter combination 6 and 13 using dataset 2 with 3 differentoccupancy maps. The columns contain the true positive, the true negative, the falsepositive, and the false negative rates.

objects with low height, indicated by the ellipses in Figure 5.7. Since the imagesfrom the camera include also buildings behind these objects, which are mainlybushes, the output from the virtual sensor indicates buildings. These busheshave therefore been classified as buildings in the semantic map, see the markedareas in Figure 5.7. These problems do not occur in the 3D-version. In the 2Dmap also some single small trees have been classified as buildings where thereare buildings in the background of the image. This again shows the problem ofmixtures of different classes in the image, in this case between small objects infront of a building.

Quantitatively, the results in Table 5.4 also show that the input of a 2D-laser based occupancy map does not deliver as high true positive rates as thehandmade map and the 3D-laser based map. The trajectories only partly coverthe same area, so the results are not fully comparable. Still, it can be noted thatthe false positive rate is around 27% for the test using the 2D-laser based map,while it is below 5% for the test using the 3D-laser based map. This differenceoriginates from the low vegetation mentioned above.

5.4.3 Robustness Test

To evaluate the robustness of the system, two different Monte Carlo simula-tions [Metropolis and Ulam, 1949] were performed using the handmade oc-cupancy map. First, the sensitivity to changes in robot pose was tested (posenoise) and second, the dependency on variations in the detection rate of thevirtual sensor was evaluated (classification noise). The uncertainty is modelledwith zero mean Gaussian noise defined by the standard deviation σ for theposition, σpos = 2 m, and direction, σdir = 5◦. The position uncertainty isapproximately the accuracy of standard GPS. Table 5.5 shows the result forMonte Carlo simulations with 20 runs per test. The first two rows containresults after introducing additional pose uncertainty. The detection rates are


Figure 5.7: The semantic map based on the 2D-laser occupancy map. The ellipses markbushes classified as buildings.

Figure 5.8: The semantic map based on the 2D-laser occupancy map.

slightly lower than the comparable ones presented in Table 5.4 (and repeatedin Table 5.5). The total average detection rate2 has decreased from 98.1% to96.3%. A degradation is expected when errors are introduced and in this case,the pose errors resulted in a small degradation showing that the system have arobust behaviour.

The second two rows contain the result with classification noise. Here 5%of the classifications obtained from the virtual sensor were randomly changed(building to nature and vice versa). One can note that the result for buildingdetection is close to the nominal case (average 97.0% compared to 97.6%), but

2Average of ΦTP and ΦTN for parameter combinations 6 and 13.


that nature detection is clearly affected by the changed detection rates of the vir-tual sensor (average 81.7% compared to 98.5%). This indicates that the prob-abilities P (build|VS=build) and P (build|VS=¬build) are crucial for the system.Introducing an additional error of 5% to the nature classifications results in aconsiderable drop of the correct nature classifications if this additional error isnot reflected in the probability P (build|VS=¬build).

To demonstrate better handling of false nature classifications the tests withclassification noise were repeated for parameter combinations 3 and 10. Withthe lower building probability (0.8) the ΦFP -rates were halved, from 15% and21% to 7% and 9% respectively.

Test ΦTP [%] ΦTN [%] ΦFP [%] ΦFN [%]

6 (pose noise) 95.8±3.6 97.5±1.0 2.5±1.0 4.2±3.6

13 (pose noise) 94.6±2.8 97.2±1.2 2.8±1.2 5.4±2.8

6 (classification noise) 97.6±1.2 84.7±3.7 15.3±3.7 2.4±1.2

13 (classification noise) 96.5±1.0 78.7±6.2 21.3±6.2 3.5±1.0

6 (nominal case) 98.3 98.7 1.3 1.7

13 (nominal case) 96.9 98.4 1.6 3.1

3 (classification noise) 95.8 ± 1.6 92.6 ± 4.2 7.4 ± 4.2 4.2 ± 1.6

10 (classification noise) 93.3 ± 2.5 90.5 ± 3.6 9.5 ± 3.6 6.7 ± 2.5

Table 5.5: Results for dataset 2. The first two rows show the results with pose uncer-tainty and the second two rows contain the result with classification noise. The resultsare presented with the standard deviation over 20 runs per test. The third two rowscontain the nominal values taken from Table 5.4 and the fourth two rows include datafor comparison with a second building probability.

5.5 Summary and ConclusionsIn this chapter it was shown how a virtual sensor for pointing out buildingsalong the trajectory of a mobile robot can be used in the process of building aprobabilistic semantic map of an outdoor environment. The presented resultsshow that with the probabilistic mapping algorithm the uncertainty in the se-mantic labelling can be reduced. The method handles the wide field-of-view ofthe planar camera (56◦) including objects that can belong to different classes.Despite the fact that the location of the classified object in the image is notknown, an almost correct semantic map is produced. The map produced wasfound to be very robust in the presence of pose uncertainty both for build-ings and nonbuildings. The experiments where classification noise was added


showed that a correct selection of prior probabilities is essential for the overallperformance.

From the experiments described in this chapter, several benefits of usingthe virtual sensor with its good generalisation properties have been noted. Thevirtual sensor produces useful results even though:

1. The training set was quite limited. In total only 80 low resolution imageswith side length 240 pixels were used.

2. The resolution of the planar images (320× 320 pixels) was different com-pared to the one used in the training phase.

3. The training was performed with images taken with a standard digitalcamera, but the images used during the experiment were planar imagesextracted from the omni-directional vision system onboard the mobilerobot.

In the evaluation of the semantic map it was noted that one type of occupancymap did not produce as good results as the others. This was the occupancygrid map that was based on 2D-laser readings. Such a map, where the datahave been collected in a horizontal plane close to the ground, is not optimal forfinding building outlines. It is therefore recommended that objects representedby the occupancy grid map should have a minimum height, especially when theobjective is to find objects of a particular height, such as buildings.

An extension to the semantic map presented here would be to refine thevirtual sensor so that it can point out the parts of an image that are mainlyresponsible for the classification. This will further improve the separation be-tween different objects. The problem with the separation was most evident inthe case with the occupancy map based on 2D-laser, since that map includedconnected regions representing objects with a low height. For the handmade oc-cupancy map and the 3D-laser-based map, where only readings above a heightof 2 m were considered, this problem was not observed. To refine the virtualsensor, classification of sub-images could be used. An example of such a tech-nique is described in [Morita et al., 2005] where small squares of the upper halfof images are classified as tree, building and uniform regions. The result fromthe classifier was used for localization in previously visited areas.

This chapter described the construction of a probabilistic semantic map.Wide angle views have been used and the results were good. The objects ob-served in the experiments have a certain size which can explain why the wideangle views produced good results. If classes of smaller objects should be in-cluded in a semantic map, views with a smaller field-of-view should be used inorder to better localize the objects of interest. An example of such a map wouldbe to add information from the virtual sensor for windows.

Part III

Overhead-Based SemanticMapping

Chapter 6

Building Detection in Aerial Imagery

6.1 IntroductionDetection of man-made structures, such as buildings, roads and vehicles, inaerial or satellite images has been an active research topic for many years.Aerial images, with their highly detailed contents, are an important source ofinformation for applications including GIS, surveillance, etc.

This chapter discusses the extraction of man-made objects (buildings) fromaerial imagery and gives examples of existing systems for automatic buildingdetection. Without aiming at an exhaustive overview, the purpose is to givea background on extracting information from aerial images and to exemplifythe problems when only monocular aerial images are used. The complexityof a system handling multiple view or stereo images is beyond our interest.Monocular images can be accessed more and more easily through the Internetand avoid complications that would arise from using stereo or multiple viewimages.

The content has been restricted to the fields that are of interest for ourresearch, which include models, strategies, technologies and algorithms for ex-traction of buildings. Automatic detection of man-made structures is not yeta fully mature subject. Presented systems are often limited to certain types ofimages giving a strong dependence on a specific source of input data to obtaingood performance.

Extraction of man-made structures from aerial images is a difficult taskdue to many reasons. Aerial images have a high level of structured and un-structured contents. Images differ in scale (resolution), sensor type, orientation,quality, dynamic range, light conditions and due to different weather and sea-sons. Buildings may have rather complicated structures and can be occluded byother buildings or vegetation. Together this makes building detection a chal-lenging problem.

103

104 CHAPTER 6. BUILDING DETECTION IN AERIAL IMAGERY

6.1.1 Outline

The chapter starts by introducing digital aerial imagery in Section 6.2. Thisincludes aspects of image sensors, carriers, and manual feature detection, in-tended to facilitate the understanding of the extraction procedure and give ideasfor improvement of existing implementations. Section 6.3 reviews methods forbuilding detection in aerial images and the use of additional information, e.g.,maps, to improve the performance is described. The chapter is concluded inSection 6.4.

6.2 Digital Aerial ImageryThe main information source in this work is a single digital aerial image fromwhich buildings and other relevant structures should be extracted. Digital im-ages can be divided in different ways, e.g., (a) panchromatic (monochromatic)and multi-chromatic, (b) low, medium and high resolution, and (c) stereo andmono.

Aerial images can be captured by cameras onboard satellites, spacecraft,aircraft and unmanned aerial vehicles (UAVs). Images may be of multi-bandtype including both the infrared (IR) and the visual wavebands. The use ofaerial images for mapping is interesting for several reasons:

• Satellites cover and can photograph most interesting areas on the earth,even though there are some areas that are always cloudy and cities thatare covered in smog most often.

• Satellite images are regularly updated.

• Aerial photos, for instance taken by UAVs, can give real-time coverageof an area.

6.2.1 Sensors

Our main interest is in aerial images taken with daylight cameras1. The reasonfor this is that daylight cameras are probably the most common type of image-based sensors. There are other image-based sensors that can provide more suit-able information but these are not so common. In scientific applications sen-sors are often combined to enhance the operational use. Some properties offour types of sensors used for remote sensing onboard aerial vehicles are listedbelow:

Daylight cameras, usually monochrome or colour CCD cameras, give high res-olution and high update rate. Monochrome cameras used to be most

1Daylight cameras should be understood as cameras that capture the visual wavebands.

6.2. DIGITAL AERIAL IMAGERY 105

common in existing systems since they traditionally gave higher reso-lution, but colour cameras are more frequently used in recent systems.Drawbacks of daylight cameras are their sensitivity to light conditionsand reduced visibility.

IR cameras are often used in remote sensing tasks, for instance in detectionof green vegetation. With wavelengths of typically 7–12 μm IR sensorsoperate independently of the light conditions and are well suited for, e.g.,search operations.

Image radar, SAR (Synthetic Aperture Radar), can give high resolution 3D im-ages and has all-weather capacity, but low update frequency.

Lidar, light detection and ranging, uses laser to give high resolution 3D imagesindependent of light conditions. It gives high accuracy in depth measure-ment and airborne Lidar systems can cover large areas (altitudes from1500 m up to 3500 m are possible) and are useful for distinction betweenbuildings and the ground surface.

For frequent updating of aerial images, aircraft and UAVs are typically used assensor carriers. The four types of image sensors listed above are available todayfor use onboard UAVs.

Additional information that can aid the extraction process may be providedby, for instance, elevation data, city maps, and GIS. Airborne laser range scan-ners have been used to build accurate 3D-models [Söderman et al., 2004] andSAR images, multi-spectral images, and high resolution satellite images havebeen used in building detection [Tupin and Roux, 2003, Xiao et al., 1998].

6.2.2 Resolution

The resolution of aerial and satellite images is often divided into ‘low’,‘medium’, and ‘high’, where the resolution in meters per pixel for satellite im-ages is: 30 m for low resolution, 10 m for medium resolution and 1 m for highresolution, and for aerial images: high resolution is below 0.2 m and low res-olution is above 1 m pixel size. These numbers differ in the literature but thevalues given above indicate their normal sizes.

6.2.3 Manual Feature Extraction

When identifying features in an aerial image one has to consider that the photo-graph is taken from above with the result that the objects do not look familiar.Below, a number of factors [US Army, 2001] are listed that can be consideredduring the recognition phase. For identification of an object several of thesefactors have to be used.


Size The dimension of an object is important in the identification process.The size, either measured from the image scaling or by comparison withknown objects in the image, can help to classify different types of objects.

Shape Shape can be used to separate man-made structures from natural ones.Man-made structures often have straight or smooth curved lines whilenatural structures are typically more irregular. For example, compare acanal and a river.

Shadows Shadows can be very helpful since they show a familiar view of theobject, e.g., an antenna tower may be hard to detect in an image but itsshadow is revealing. Shadows can also be used in height estimation ofobjects.

Shade Shade refers to the grey tone or texture (surface) of objects as they ap-pear in the image and can, e.g., be used to find areas with surfacing in aneven tone, such as asphalt.

Site An object can be recognized by its location in the image and its relation tosurrounding objects. For example, the spacing between objects can revealman-made influence in a natural environment.

From available papers on automatic extraction, shape, as defined by edges,seems to be the most popular feature in building detection. Techniques for useof shading are difficult to apply to real images [Lin, 1996].

6.3 Automatic Building Detection in Aerial ImagesA survey by Mayer [Mayer, 1999] focuses on extraction of buildings. Sevensystems developed between 1984 and 1998 were assessed according to a num-ber of criteria. The author concentrated on models and strategies in this surveyand the survey concludes that scale, context and 3D structure were the threemost important features to consider for object extraction in aerial images.

In the following subsections a few systems that represent different ap-proaches to automatic building detection are presented. They are divided intothree groups based on the type of information that is used. The first groupdeals with systems that use monocular images (2D), the second group are sys-tems with access to elevation information (3D) and the third group make useof maps or GIS for the detection of buildings in the aerial images.

6.3.1 Using 2D Information

A basic system for detection of buildings in monocular images taken from anadir2 view may include i) edge detection in a greyscale image ii) line determi-

2An image taken from a nadir view is taken from a zenith position. The opposite is an obliqueview where the image is taken at an angle.

6.3. AUTOMATIC BUILDING DETECTION IN AERIAL IMAGES 107

nation, and iii) search for buildings represented as ortho-polygonal lines [Car-doso, 1999].

In [Persson et al., 2005] we describe a system for automatic detection ofbuildings in aerial images, also taken from a nadir view. Our system builds twotypes of independent hypotheses based on the image content. A colour basedsegmentation process implemented with ESOM, an Ensemble of Self Organiz-ing Maps, is trained and used to create a segmented image showing differenttypes of roofs, vegetation and sea. A second type of hypotheses is based on anedge image produced from the aerial photo. Here, a line extraction process usesthe edge image as input to find straight line segments that represent edges. Fromthese edges, corners and rectangles that represent buildings are constructed. Aclassification process uses the information from both hypotheses to determinewhether the rectangles belong to buildings, unsure buildings or unknown ob-jects.

In the PhD thesis by C. Lin [Lin, 1996], generic 3D rectilinear models areused to model building parts. The system uses both wall and shadow infor-mation to verify hypotheses about modelled buildings. In a nadir image theshadow information is preferable to use while walls are hardly detectable, andfor an oblique image the walls are more easily found while the shadows may behard to use. The image angles therefore control how the verification process ofthe hypotheses weights the extracted information. Edges are identified with adirection, where the direction is dependent on the brightness on the respectiveside of the edge. In this way, parallel edges belonging to the same object canbe connected. Shadows are detected by their darker appearance. Lin’s systemwork well on images of scale models while real images pose problems due tothe existence of vegetation, roads, and parking lots. The system for buildingdetection from aerial images has been further improved with a user interfacefor interaction with the automatic system [Nevatia et al., 1997].

Fuzzy techniques have also been used in building detection. The FuzzBuRS[Levitt and Aghdasi, 2000], Fuzzy Building Recognition System, uses fuzzy vari-ables to describe average region intensity, region size, average edge lengths andbuilding likelihood. Roofs are often split into two halves due to different in-tensity and the system is limited to straight lines (as with many other systems)meaning that only polygon shaped buildings can be detected. The benefits ofthe system, compared with the authors’ previous system working on crisp val-ues, BuRS, are that the new system has a more compact representation and iseasily understandable.

6.3.2 Using 3D Information

Matthieu Cord et al. presented a method for extraction and modelling of urbanbuildings [Cord et al., 1999]. By use of high resolution stereo images a Digital


Elevation Model (DEM3) is created. The elevation information is then used forclassification of the global scene into buildings, ground surface and vegetation.This is followed by detailed modelling of the buildings. The authors believethat altitude is one of the most important sources of information for buildingdetection. This information is necessary to separate buildings from other man-made structures, e.g., parking lots.

Many other authors also use depth information. In [Guo et al., 2001] thedepth information is combined with a learning-based second step that correctsfalse positive detections of buildings. The authors use depth, colour, brightness,texture, and boundary energy4 in a tree classifier. The depth classifier filters outground objects and low vegetation. The colour and texture classifier filters outvegetation that has the same height as the buildings. Finally, a gradient field,based on a combination of image intensity and depth, is used to determine thesize and orientation of the buildings.

6.3.3 Using Maps or GIS

Methods that use maps or GIS in the extraction process are interesting from ourperspective since these have similarities to the approaches presented in Chapter7 and 8. One example of such a method is the use of knowledge representedin digital topographic databases for improvement of automated image analy-sis regarding extraction of settlement areas [Schilling and Vögtle, 1997]. Withthe knowledge a model-driven top-down approach can be integrated into thecommonly data-driven bottom-up process of satellite image analysis. Objectsof the same class stored in the database are used to learn features in the satelliteimage, under the assumption that the majority of the objects are correct. Thesefeatures are then used in a classification process for extraction of settlementareas.

Carroll presents a system for change detection, which is used in an ap-plication for information updating [Carroll, 2002]. The presented prototype,HouseDiff, combines GIS and edge detection. An edge detection method basedon deformable contours (snakes) is used in the system to find the outline ofnew buildings. Such a method overcomes limitations due to the assumption ofrectilinearity in previous works.

In another approach 3D building hypotheses of dense urban areas are gen-erated using scanned maps5 and aerial images [Roux and Maître, 1997]. Themaps are analysed in order to obtain a structural description of the scene. Thisinformation is then used for the analysis of a disparity image generated from astereo pair of aerial images. Histograms of the disparity image are calculated

3DEM include trees, buildings, the ground surface etc.4The boundary energy is used to find an initial size and orientation of a rectangle template

model. The best model is the one that maximizes the boundary energy, measured as the averagegradient magnitude along model boundaries.

5Maps scanned by using an image scanner.


and processed. Buildings are extracted under the assumption that the dispar-ity of one building is included in only one histogram mode. According to theauthors, other approaches (not using maps) have been shown to perform wellfor sparse buildings, but not in dense urban areas. Two reasons that may ex-plain this are the high image complexity and that there may be several differentinterpretations of the same object in the scene.

6.4 Summary and ConclusionsAutomatic detection of man-made structures is not yet a fully mature subject.Developed systems are often limited to certain types of images giving a strongdependence on the type of input data for good performance. Many systemsin this field use line and edge detection, form hypotheses, and connect themto 3D models for verification, while colour and texture are not used so oftento extract features for the detection phase. Table 6.1 gives a summary of thesystems discussed in this chapter.

This chapter has pointed out several important things. First, to be able totruly distinguish buildings from other man-made objects, information aboutthe elevation of the area in the image is needed, which is why a number of thesystems use Digital Elevation Models (DEM). The required elevation informa-tion can be obtained by airborne mounted sensors such as laser, radar, stereovision or by systems operating on the ground, e.g., an outdoor mobile robot.Second, multiple view images give different aspect angles which can help anautomated system in the detection phase. Third, an experience from our work[Persson et al., 2005] is that colour models of roofs in some cases overlap withother surfaces on the ground in the absence of elevation data. For instance,roads can be mixed up with gray roofs and tennis courts have the same colouras the red roofs.


References Type of images Comments[Lin, 1996], [Neva-tia et al., 1997]

Nadir or oblique,monocular

3D-models, works on model-board images, problems withreal images

[Levitt andAghdasi, 2000]

Nadir, monocular,gray scale

2D-models, fuzzy approach

[Cord et al., 1999] Nadir, stereo 3D-models, extracted fromDEM built by the system

[Guo et al., 2001] Nadir, stereo, col-our

Depth, colour and brightness,texture, and boundary energyare used

[Carroll, 2002] Nadir, monocular,colour

2D-models, searching for dif-ferences (HouseDiff)

[Schilling and Vög-tle, 1997]

Nadir, monocular Combination of GIS and sat-ellite images

[Roux and Maître,1997]

Nadir, stereo Combination of GIS in theform of maps and aerial im-ages, generates 3D buildinghypotheses

[Cardoso, 1999] Nadir, monocular,gray scale

Extracts 2D building esti-mates, uses shape in the formof edge detection

[Persson et al.,2005]

Nadir, monocular,colour

Extracts 2D building esti-mates, uses both colour andshape, assumption of rectan-gular buildings

Table 6.1: Summary of building detection systems.

Chapter 7

Local Segmentation of Aerial Images

This chapter investigates the use of monocular aerial images to extend the sen-sory range of a mobile robot for outdoor mapping. The suggested method re-lates an aerial image to ground-level information using building outlines (wallestimates). This approach addresses two difficulties simultaneously:

1. buildings are hard to detect in monocular aerial images without elevationdata and

2. the limitation that only those parts of the environment that are in line-of-sight of the sensors onboard the mobile robot can be perceived.

It is shown how wall estimates found by a mobile robot can compensate for theabsence of elevation data in segmentation of aerial images. A virtual sensor forbuilding detection mounted on a mobile robot is used in combination with anoccupancy map to obtain wall estimates from a ground perspective. These wallestimates are matched with edges detected in an aerial image. The result is usedto direct a region- and boundary-based segmentation algorithm for buildingdetection in the aerial image. Experiments demonstrate that the ground-levelbased wall estimates can focus the segmentation of the aerial image to buildingsand a semantic map, which covers a larger area than the onboard sensors, canbe built along the robot trajectory.

7.1 IntroductionA mobile robot has a limited view of its environment. Mapping of the opera-tional area is one way of enhancing this view for visited locations. In this chap-ter the possibility of using information extracted from aerial images to furtherimprove the mapping process is explored. Semantic information about build-ings is used as the link between ground level information and the aerial image.The method can speed up exploration or planning in areas not yet visited bythe robot.

111

112 CHAPTER 7. LOCAL SEGMENTATION OF AERIAL IMAGES

Colour image segmentation can be used to extract information about build-ings from an aerial image. For directed image segmentation it is necessary tohave an input that can point out the image regions (samples) used to trainthe segmentation algorithm. Examples of segmentation with manually selectedsamples used for building detection are given in [Dogruer et al., 2007] and inChapter 6 ([Persson et al., 2005]).

The method presented in this chapter replaces these manually picked train-ing samples with samples directed by the mobile robot. The virtual sensor forbuilding detection described in Chapter 4 is used to determine which parts ofan occupancy map belong to a building (wall estimate) resulting in the semanticmap described in Chapter 5. Matching of wall estimates found in the seman-tic map with edges detected in an aerial image followed by colour segmenta-tion is utilized to find building hypotheses. The matching is possible since geo-referenced aerial images are used and an absolute positioning system is installedonboard the robot. The matched lines are then used in region- and boundary-based segmentation of the aerial image for detection of buildings. The purposeis to detect building outlines faster than the mobile robot can explore the areaby itself. Using this method the robot can estimate the size of found buildingregions without actually rounding the building. The method does not assume aperfectly up to date aerial image, in the sense that buildings may exist althoughthey are not present in the aerial image, and vice versa. It is therefore possibleto use globally available1 geo-referenced images.

7.1.1 Outline and Overview

In Section 7.2 a presentation of work related to the use of aerial images in mo-bile robotics is given. The description of the proposed system is divided intothree main parts. The first part, Section 7.3, concerns the estimation of wallsby the mobile robot and edge detection in the aerial image. At ground level,wall estimates are extracted from the probabilistic semantic map described inChapter 5. This map is basically an occupancy map built from range data andlabelled using a virtual sensor for building detection (Chapter 4) mounted onthe mobile robot. The second part, Section 7.4, describes matching of wall es-timates from the mobile robot with the edges found in the aerial image. Todetermine potential matches between the wall estimates and the roof outlines,geo-referenced aerial images are used and the mobile robot has an onboardabsolute positioning system (GPS). The third part, Section 7.5, presents thesegmentation of an aerial image based on the matched lines. The matched linesare used in region- and boundary-based segmentation of the aerial image fordetection of buildings. This segmentation will be referred to as the local seg-mentation as opposed to the global segmentation presented in Chapter 8. De-scriptions of the experiments performed and the results obtained are found in

1E.g. Google Earth, Microsoft Virtual Earth, satellite images from IKONOS and its successors.

7.2. RELATED WORK 113

Section 7.6. Finally, the chapter is concluded and suggestions for future workare given in Section 7.7.

7.2 Related WorkOverhead images have been used in combination with ground vehicles in anumber of applications. Oh et al. used map data to bias a robot motion modelin a Bayesian filter to areas with higher probability of robot presence [Oh et al.,2004]. It was assumed that probable paths were known in the map. Since mo-bile robot trajectories are more likely to follow these paths in the map, GPSposition errors due to reflections from buildings were compensated using themap priors.

Pictorial information such as aerial photos and city-maps have been usedfor registration of sub-maps and subsequent loop-closing in SLAM [Chen andWang, 2006]. Aerial images were used by Früh and Zakhor in Monte Carlolocalization of a truck during urban 3D modelling [Früh and Zakhor, 2004].

A method to detect building outlines, also without elevation data, is to fuseSAR (Synthetic Aperture Radar) images and aerial images [Tupin and Roux,2003]. The building location was established in the overhead SAR image, wherewalls from one side of buildings can be detected through double reflections onthe ground and a wall. The complete building outline was then found usingedge detection in the aerial image. Parallel and perpendicular edges were con-sidered and the method belongs to edge-only segmentation approaches. Thiswork is similar to the work presented in this chapter in the sense that it uses apartly found building outline to segment a building from an aerial image.

Combination of edge and region information for segmentation of aerial im-ages has been suggested in several publications. Two papers from which thiswork took inspiration are [Mueller et al., 2004] and [Freixenet et al., 2002].Mueller et al. presented a method to detect agricultural fields in satellite im-ages. First, the most relevant edges were detected. These were then used toguide both the smoothing of the image and the following segmentation in theform of region growing. Freixenet et al. investigated different methods for in-tegrating region- and boundary-based segmentation, and also claim that thiscombination is the best approach for image segmentation.

Commonly used colour image segmentation methods are reviewed in[Cheng et al., 2001]. Concerning the choice of colour space for segmentationthe authors point out that no single colour space surpasses others for all typeof images.

From the above discussed references and the references presented in Section8.2, it can be concluded that aerial images contain information that is usefulfor mobile robots. For detection of building outlines in aerial images edge in-formation is the most used feature, which is also confirmed by the references


presented in Section 6.3. All together there is considerable motivation for theapproach presented in this chapter.

7.3 Wall CandidatesA major problem for building detection in aerial images is to decide which ofthe edges in the aerial image correspond to building outlines. The idea of ourapproach is to match wall estimates extracted from two perspectives in order toincrease the probability that a correct segmentation is achieved. In this sectionthe process of extracting wall candidates is described, first from the mobilerobot’s perspective at ground-level and then from aerial images.

7.3.1 Wall Candidates from Ground Perspective

The wall candidates from the ground perspective are extracted from a seman-tic map acquired by a mobile robot as described in Chapter 5. In Chapter 5three different occupancy maps (handmade, 2D laser data and 3D laser data)were used to create probabilistic semantic maps with two classes: buildings andnonbuildings. In this chapter, the probabilistic semantic map based on the oc-cupancy grid map built with 2D laser data is used since it is the map built fromrobot measurements that covers most buildings. This probabilistic semanticmap is presented in Figure 7.1.

The lines representing probable building outlines are extracted from theprobabilistic semantic map using the same implementation [Kovesi, 2000] aspreviously used in Section 4.2. The implementation of the line extraction al-gorithm and the used parameter setting is described in Appendix B.1. The ex-tracted lines representing wall estimates are given in Figure 7.2, which alsoshows the robot trajectory where data for the probabilistic semantic map werecollected.

7.3.2 Wall Candidates in Aerial Images

Edges extracted from an aerial image taken from a nadir view are used as po-tential building outlines. The edge image is a binary image from which straightlines are extracted to be used as wall candidates in the matching process de-scribed in Section 7.4.

Two common categories of colour edge detection methods are output fusionmethods and multi-dimensional gradient methods [Ruzon and Tomasi, 1999].In colour edge detection with output fusion, edge detection is performed sepa-rately on the three components of the colour space used and the resulting edgesare fused. In multi-dimensional gradient methods, the gradients from the threecomponents are fused and then the edges are defined. Here, the first alternativewith the edge detection performed separately on the three RGB-components

7.3. WALL CANDIDATES 115

Figure 7.1: The probabilistic semantic map used in the experiments. White cells denotehigh probability of walls and dark cells show outlines of nonbuilding entities.

Figure 7.2: The trajectory of the mobile robot (dashed), the ground level wall estimates(solid) and the aerial image used (©Örebro Community Planning Office). The semanticmap in Figure 7.1 covers the upper left part of this figure.


using Canny’s edge detector [Canny, 1986] is applied. The resulting edge imageIe is calculated by fusing the binary images obtained for the three colour com-ponents with a logical OR-function. Finally a thinning operation2 is performedto remove points that occur when edges appear slightly shifted in the differentcomponents. For line extraction in Ie the same implementation and parametersas in Section 7.3.1 were used. The lines extracted from the edges detected in theaerial image in Figure 7.2 are shown in Figure 7.3.

Figure 7.3: The lines extracted from the edge version of the aerial image.

The colour edge detection method is used because it finds more edge pointsthan gray scale edge detection. This is because edges on the border betweenareas that have different colours but similar intensity are not detected in grayscale versions of the same image. In a test where the two methods had the samesegmentation parameters, the colour version produced 19% more edge pointsresulting in 17% more detected lines for an aerial image of size ca. 800×1300pixels (400×650 m). Figure 7.4 gives a close-up from that test to show anexample of the differences. The calculation time of the colour edge detection isslightly more than three times longer than ordinary gray scale edge detection.This time is still small in comparison to the routines used for detecting lines inthe edge images (for the same example as above: ca. 20%).

2The Matlab command bwmorph(im,’thin’,Inf) was used.

7.4. MATCHING WALL CANDIDATES 117

a) b) c)

Figure 7.4: Gray scale (b) and colour edge detection (c) in an aerial image (a). In the topthe colour version finds additional edges where light green vegetation meets light grayground, and in the lower part edges are found around the green football field where thegrass meets the red running tracks.

7.4 Matching Wall CandidatesThe purpose of the wall matching step is to relate wall estimates, obtained atground level with the mobile robot, to the edges detected in the aerial image.All wall estimates are represented as line segments. A wall estimate found bythe mobile robot is denoted as Lg (g indicates ground-level) and the N linesrepresenting the edges found in the aerial image by Li

a with i ∈ {1, . . . , N}(a indicates aerial). Both line types are geo-referenced in the same Cartesiancoordinate system.

The lines from both the aerial image and the semantic map may be erro-neous, especially concerning the line endpoints, due to occlusion, errors in thesemantic map, different sensor coverage, etc. A measure for line-to-line dis-tances that can handle partially occluded lines is therefore needed. Hence, thelength of the lines is not considered and line matching is based only on theline directions and the distance between two characteristic points, one point oneach line. The line matching calculations are performed in two steps describedbelow: determination of the two characteristic points and computation of thedistance measure to find the best matches.

7.4.1 Characteristic Points

In this section it is described how the characteristic points on the two linescompared are determined. For Lg the line midpoint, Pg, is used. To cope withthe possible errors described above, the point Pa on Li

a that is closest to Pg isselected as the best candidate to be used in our line distance measure.

To calculate Pa, let en be the orthogonal line to Lia that intersects Lg in

Pg, see Figure 7.5. The intersection between en and Lia is denoted as φ where


Figure 7.5: Selection of characteristic points for the computation of the distance measurebetween two lines. The figure shows the line Lg (ground level wall candidate) with itsmidpoint Pg , the line Li

a (aerial image wall candidate), and the normal to Lia, en. To the

left, Pa = φ since φ is on Lia. To the right, Pa is the endpoint of Li

a since φ is not on Lia.

φ = en × Lia (using homogeneous coordinates). The intersection φ may be

outside the line segment Lia, see right part of Figure 7.5. It should therefore be

checked if φ is within the endpoints and if it is, set Pa = φ. If φ is not withinthe endpoints, then Pa is set to the closest endpoint on Li

a.

Pa =

{φ ∀φ ∈ [mmin, mmax]Φ ∀φ /∈ [mmin, mmax]

(7.1)

where

Φ =

{Li

a1if ‖Li

a1− φ‖ < ‖Li

a2− φ‖

Lia2

if ‖Lia1− φ‖ > ‖Li

a2− φ‖

(7.2)

and Lia1

and Lia2

are the endpoints of Lia, ‖ · ‖ is the Euclidean distance, and

mmin =

[min(Li

ax1, Li

ax2)

min(Liay1

, Liay2

)

](7.3)

mmax =

[max(Li

ax1, Li

ax2)

max(Liay1

, Liay2

)

]. (7.4)

Liaxj

and Liayj

denote the x- and y-coordinate of the endpoints of Lia.

7.4.2 Distance Measure

The calculation of the distance measure is inspired by [Guerrero and Sagüés,2003], which describes geometric line matching in images for stereo match-ing. The complexity in these calculations has been reduced by exclusion of theline lengths which also results in fewer parameters that need to be determined.Matching is performed using Lg’s midpoint Pg, the closest point Pa on Li

a andthe line directions θg and θa. First, a difference vector is calculated as

rΔ = [Pgx− Pax

, Pgy− Pay

, θg − θa]T . (7.5)

Second, the similarity is measured as the Mahalanobis distance

7.5. LOCAL SEGMENTATION OF AERIAL IMAGES 119

d = rΔTR

−1rΔ (7.6)

where the diagonal covariance matrix R is defined as

R =

⎡⎣ σ2

Rx 0 00 σ2

Ry 0

0 0 σ2Rθ

⎤⎦ (7.7)

with σRx, σRy , and σRθ being the expected standard deviation of the errorsbetween the ground-based and aerial-based wall estimates. Only the relationbetween the parameters in Equation 7.7 influences the line matching. The im-portant relation is σ2

Rθ/σ2Rx and usually σ2

Rx = σ2Ry for symmetry reasons. Note

that the distance measure is not strictly a metric in a mathematical sense, dueto the non-symmetric method for selecting characteristic points.

7.5 Local Segmentation of Aerial ImagesThis section describes how local segmentation of the colour aerial image isperformed. Generally, segmentation methods can be divided into two groups;edge-based and similarity-based [Gonzales and Woods, 2002]. In our case theseapproaches are combined by first performing edge based segmentation for de-tection of closed areas and then colour segmentation based on a small trainingarea to confirm the area’s homogeneity. Figure 7.6 gives a short description ofthe sequence that is performed for each line Lg. The process is stopped, eitherwhen a region has been found or when all lines in La that are close enough tothe present line in Lg to be considered, have been checked. The “close enough”criterion can be implemented using the Euclidean distance between the char-acteristic points Pg and Pa defined in Section 7.4.1. However, in the currentimplementation this was not activated during the experiment in order to beable to study whether additional regions were found.

7.5.1 Edge Controlled Segmentation

Based on the edge image Ie constructed from the aerial image, a closed areathat is limited by edges is searched for. Since there might be gaps in the edges,small gaps need to be found and filled [Mueller et al., 2004]. Morphologicaloperations are used to first dilate the edge image in order to close gaps and thensearch for a closed area on the side of the matched line that is opposite to themobile robot. When this area has been found it is dilated in order to compensatefor the previous dilation of the edge image. This procedure is further describedby Figures 7.7 and 7.8.


1. Sort the set of lines La based on d from Equation 7.6 in increasingorder and set i = 0.

2. Set i = i + 1.

3. Define a start area Astart on the side of Lia that is opposite to the

robot.

4. Check if Astart includes edge points (parts of edges in Ie). If yes,return to step 2. This check ensures that a region has a minimumwidth and depth.

5. Perform edge controlled segmentation, see Section 7.5.1.

6. Perform homogeneity test, see Section 7.5.2.

Figure 7.6: Description of the local segmentation process.

1. Initialize a starting area, Astart

2. Dilate the edge map

3. Find the closed area Asmall that includes the part of Astart that isfree from edge pixels

4. Calculate Afinal as the dilation of Asmall

Figure 7.7: Edge-based algorithm for finding closed areas and filling in small gaps in theedges.

7.5.2 Homogeneity Test

The initial starting area Astart is used as a training sample for a colour modeland the rest of the region is evaluated based on this colour model. This meansthat the colour model does not gradually adapt to the growing region, butinstead requires a homogeneous region on the complete region that is underinvestigation. Regions that gradually change colour or intensity, such as curvedroofs, might then be partly rejected.

There are different approaches to represent colour models. One approachthat is popular for colour segmentation is a Gaussian Mixture Model (GMM).Like Dahlkamp et al. [Dahlkamp et al., 2006] we tested both GMM and amodel described by the mean and the covariance matrix in RGB colour space.The mean/covariance model was selected since it is faster and it was noted thatthe mean/covariance model performs approximately equally well as the GMMin our case. A limitOlim is calculated for each model so that 95% of the training

7.5. LOCAL SEGMENTATION OF AERIAL IMAGES 121

Figure 7.8: Illustration of edge controlled segmentation. a) shows a small part of Ie andAstart. In b) Ie has been dilated and in c) Asmall has been found. Finally, d) showsAfinal as the dilation of Asmall.

sample pixels (i.e. pixels in Astart) have a Mahalanobis distance smaller thanOlim. Olim is then used as the separator limit between pixels belonging to theclass and the pixels that do not belong to the class.

7.5.3 Alternative Methods

Above a two step segmentation method to detect homogeneous regions sur-rounded by edges was presented. There exist a number of different segmen-tation methods that could have been applied instead. Two such methods arediscussed in the following. The conclusions presented below are based on pre-liminary tests performed on the aerial image used in our experiment. For thesetests, the parameters used in the respective algorithms were tuned manually.

The first method tested is a graph-based image segmentation (GBIS) [Felzen-szwalb and Huttenlocher, 2004]. GBIS can adapt to the texture and can be setto reject small areas and therefore ignore small-sized disturbances such as shad-ows from chimneys. For this reason GBIS tends to produce very homogeneousresults. A drawback is that GBIS has a tendency to leak and continue to growoutside areas that humans would consider to be closed. Therefore, GBIS doesnot seem to be an option to replace both steps in our two step method, but it isan alternative to the homogeneity test. In conjunction with the edge controlledsegmentation it turns out that GBIS produces similar segmentation results tothe mean/covariance model.

The second method tested is a modified flood fill algorithm. The algorithmtakes starting pixels from Astart and performs region growing limited by colourdifference to the starting pixels and local gradient information. Let C be themean value vector (RGB) of the starting pixels, Pi any pixel that has beenselected to be inside the region and Pn a neighbouring pixel (4-connected withPi) to be tested to see whether it should be included in the region. For each Pn

a local value gloc is calculated as

gloc = e−P

j=r,g,b(Pn(j)−C(j))2/σ2cole−

Pj=r,g,b(Pn(j)−Pi(j))

2/σ2grad (7.8)


The value of gloc is then compared to a threshold to see ifPn should be includedin the region or not. Due to the use of the local gradient this algorithm performsequally well as the mean/covariance model, both as a replacement for the twosteps and when it is used only for the homogeneity check. This modified floodfill algorithm can also leak, like GBIS, but only to areas with colours similar toAstart, since C only depends on the starting pixels.

7.6 Experiments7.6.1 Data Collection

Data were collected with the mobile robot Tjorven, equipped with differentialGPS, a horizontally mounted laser range scanner, cameras and odometry (Sec-tion 2.2 gives more details). The robot is equipped with two different types ofcameras, an ordinary camera mounted on a PT-head and an omni-directionalcamera. Here, the omni-directional camera is used. From each omni-imageeight planar images (every 45◦), with a horizontal and vertical field-of-viewof 56◦, were computed. These planar images are the input to the virtual sensor.The images were taken approximately every 1.5 m along the robot trajectoryand were stored together with the corresponding robot pose. The trajectory ofthe mobile robot is shown in Figure 7.2. Since the ground where the robot wasdriven during the experiment is mainly flat, inertial sensors were not needed.This can be confirmed by visual inspection of the resulting occupancy map inFigure 7.9.

Figure 7.9: Occupancy map used to build the semantic map presented in Figure 7.1.

7.6. EXPERIMENTS 123

7.6.2 Tests of Local Segmentation

The occupancy map shown in Figure 7.9 was used for the experiment. Thismap was built from data measured by the laser range scanner (with 180 degreesfield of view) and position data obtained from fusion of odometry and DGPS,see Appendix B.3. The grid cell size was 0.5 m, the range of the data waslimited to 40 m and the map was built using the known poses and a standardBayes update equation as described in [Thrun et al., 1998]. Even though this2D map works well in our experiments (with exception of the hedge/buildingmix-up described in Section 5.4.2), one should note that a fixed horizontallymounted 2D laser is not the optimal choice of sensor configuration for detectionof building outlines. Alternative methods suitable for capturing large objects inoutdoor environments are 3D laser [Surmann et al., 2001], a vertically mountedlaser range scanner [Früh and Zakhor, 2004] or (motion) stereo vision [Huberand Graefe, 1994].

The occupied cells in the map (marked in black in Figure 7.9) were labelledby the virtual sensor giving the probabilistic semantic map presented in Fig-ure 7.1. The probabilistic semantic map contains two classes: buildings (valuesabove 0.5) and nonbuildings (values below 0.5). From this semantic map thegrid cells with a high probability of being a building3 (above 0.9) were extractedand converted to the lines LM

g presented in Figure 7.2. Matching of these lineswith the lines extracted from the aerial image LN

a was then performed (see Fig-ure 7.3). Finally, based on the best line matches, segmentation was performedas described in Section 7.5, where each ground-level line Lg can lead to oneextracted region.

The three parameters in R (Equation 7.7) were set to σRx = 1 m, σRy =1 m, and σRθ = 0.2 rad. The first two parameters reflect a possible error of 2pixels between the robot position and the aerial image for the given resolution,and the third parameter allows, for example, each endpoint of a 10 pixel longline to be shifted one pixel (parallel edges in the aerial image do not alwaysresult in parallel lines, see roof outline in Figure 7.3). In the tests describedin the following paragraph, it will be shown that the matching result is notsensitive to small changes of these parameters. In the experiment, the start areaAstart (Figure 7.6) was a 8×8 pixels square, equivalent to 4×4 m and a squarestructuring element of size 3× 3 was used for the dilations described in Section7.5.1.

Two different types of test have been performed. The parameters for thesetests are defined in Table 7.1. Tests 1-3 represent the nominal cases where thecollected data are used as they are. These tests intend to show the influence ofa changed relation between σRx, σRy and σRθ by varying σRθ. In Test 2 σRθ isdecreased by a factor of 2 and in Test 3 σRθ is increased by a factor of 2. In Tests

3The limit 0.9 was chosen with respect to the probabilities used in the process of building theprobabilistic semantic map, see Chapter 5. With this limit at least two positive building readingsare needed for a single cell to be used in L

Mg .


4 and 5 additional uncertainty (in addition to the uncertainty already present inLM

g and LNa ) was introduced. This uncertainty is in the form of Gaussian noise

added to the midpoints (σx and σy) and directions (σθ) of LMg and evaluated in

Monte Carlo simulation [Metropolis and Ulam, 1949] with 20 runs.

Test σx [m] σy [m] σθ [rad] σRθ [rad] Nrun

1 0 0 0 0.2 1

2 0 0 0 0.1 1

3 0 0 0 0.4 1

4 1 1 0.1 0.2 20

5 2 2 0.2 0.2 20

Table 7.1: Parameters used in the local segmentation tests.

7.6.3 Result of Local Segmentation

The local segmentation has a limited range and the ground truth area can bebeyond this range without affecting the resulting segmentation, e.g. by includ-ing new buildings that are not seen by the robot. A traditional quality measuresuch as the true positive rate is therefore not suitable for these tests, since itdepends on the size of the ground truth area. Instead, the positive predictivevalue, PPV or precision, was used as the quality measure. PPV is calculated as

PPV =TP

TP + FP(7.9)

where TP is the number of true positives and FP is the number of false positives.The results of Test 1 show a high positive predictive value of 96.5%, see

Table 7.2. The resulting segmentation is presented in Figure 7.10 where thebuilding regions are found along the robot trajectory. Three deviations from anideal result can be noted. At a and b tree tops were obstructing the wall edges inthe aerial image and therefore the area opposite to these walls was not detectedas a building, and a gap between two regions appears at c due to a wall visiblein the aerial image. Finally, a false area, to the left of b, originates from an errorin the semantic map where a low hedge in front of a building was marked asbuilding because the building was the dominating object in the camera view.

The results of Test 1-3 are very similar, indicating that the algorithm in thiscase was not particularly sensitive to the changes in σRθ . In Tests 4 and 5 thescenario of Test 1 was repeated using a Monte Carlo simulation with intro-duced pose uncertainty. These results are presented in Table 7.2. One can notethat the difference between the nominal case Test 1 and Test 4 is very small.In Test 5 where the additional uncertainties are higher, the positive predictive


a

b

c

Figure 7.10: The result of the local segmentation of the aerial image, based solely on thefew wall estimates shown in Figure 7.2. The ground truth building outlines are drawnin black.

value decreased slightly. Based on these results the approach for local segmen-tation of the aerial image was found to be very robust.

Test PPV [%]

1 96.5

2 97.0

3 96.5

4 96.8 ± 0.2

5 95.9 ± 1.7

Table 7.2: Results for the tests defined in Table 7.1. The results of Test 4 and 5 are pre-sented with the corresponding standard deviation computed from the 20 Monte Carlosimulation runs.

7.7 Summary and ConclusionsThis chapter discusses how aerial images can be used to extend the observationrange of a mobile robot. A virtual sensor for building detection on a mobilerobot is used to build a ground level probabilistic semantic map. This map isused to link semantic information to a process for building detection in aerialimages. The approach addresses two difficulties simultaneously: 1) buildingsare hard to detect in aerial images without elevation data and 2) the limita-tion of the sensors of mobile robots. Concerning the first difficulty the resultsshow a high classification rate and it can therefore be concluded that the se-mantic information can be used to compensate for the absence of elevation


data in aerial image segmentation. The benefit from the extended range of therobot’s view can clearly be noted in the presented example. Even though theroof structure in the example is quite complicated, the outline of large buildingparts can be extracted although the mobile robot only has seen a minor part ofthe surrounding walls.

There are a few issues that should be noted:

• It turns out that a complete building outline is seldom segmented due tofactors such as different roof materials, different roof inclinations andadditions on the roof.

• It is important to check several lines from the aerial image since moreedges than expected could have been extracted. For example, roofs canhave extensions in other colours, and not only roofs and ground can beseen in the aerial image. When the nadir view is not perfect, walls can ap-pear in the image in addition to the roof outline. Such a wall will producetwo edges in the aerial image, one where ground and wall meet and onewhere wall and roof meet.

• The presented solution performs a local segmentation of the aerial imageafter each performed matching of a ground-level line with lines in theaerial image. An alternative solution would be to first segment the wholeaerial image and then confirm or reject the regions as the mobile robotfinds new wall estimates.

An extension to local segmentation elaborated in the next chapter is to use thebuilding estimates as training areas for further colour segmentation in order tomake a global search for buildings within the aerial image. This global searchcan also utilize the robot’s knowledge of already traversed areas to recogniseother potentially driveable paths.

Chapter 8

Global Segmentation of AerialImages

8.1 IntroductionAerial images contain information that can be used to extend the range of mo-bile robot sensors. This was shown in the previous chapter where segmentationof an aerial image to detect building areas near to the path of a mobile robot(directed by the walls found by the robot) was performed. An extension thatwill increase the view further is to use the building estimates as training areasfor colour segmentation in order to make a global search for buildings withinthe entire aerial image. Based on the robot’s knowledge of already traversedareas, it is also possible to recognise other driveable areas.

In this chapter the approach from Chapter 7 is extended. The extensionincludes global segmentation of buildings in the aerial image, the introductionof a new semantic class for ground (that is potentially driveable by the robot)and the introduction of the concept and framework of the predictive map. Theaim of global segmentation is to build a map that predicts regions such asdriveable ground and buildings. To perform global segmentation colour modelsare used. These colour models are acquired from the aerial image, directedby information from the local segmentation (Chapter 7) and by informationcollected with a mobile robot. The purpose with the global segmentation isto detect building outlines and driveable paths faster than the mobile robotcan explore the area by itself. Using this method, the robot can estimate theoutline of found buildings and “see” around one or several corners withoutactually having visited these areas by itself. As for the local segmentation, thismethod does not assume a perfectly up-to-date aerial image; buildings mayexist although they are missing in the aerial image, and vice versa. It is thereforepossible to use globally available geo-referenced images from sources such asGoogle Earth and Microsoft Virtual Earth, even though up-to-date imagery ispreferable.

127

128 CHAPTER 8. GLOBAL SEGMENTATION OF AERIAL IMAGES

Figure 8.1: Flow chart of the process for calculating the predictive map (PM).

8.1.1 Outline and Overview

In this chapter the range of the robot’s view is increased by prediction of thesurrounding regions using a process that involves segmentation of aerial im-ages. Trained colour models are used in global segmentation of the entire aerialimage. The purpose is to build a map that predicts different types of areas, e.g.,ground and buildings, and it is therefore called the predictive map (PM). Whenthe PM includes both driveable ground and obstacles such as buildings, it canserve as an input to a path planning algorithm.

The global segmentation of an aerial image using colour models capturesall pixels with the same property as the training sample. For example, if thebuildings that were detected by local segmentation are used as training samples,buildings with roofs in similar colours as these buildings will be detected. Tocomplement the detection of buildings an additional class, ground, is utilizedin the experiments of this chapter.

To calculate the predictive map incrementally twomain steps are performed;1) the aerial image is segmented when a new colour model (CM) is availableand 2) the predictive map is recalculated using the result from the latest seg-mentation. Figure 8.1 shows a flow chart of the updating process.

In Section 8.2 previous work related to mobile robots using aerial imagesand on detection of driveable ground is described. The segmentation processis presented in Section 8.3. This process includes the extraction of training

8.2. RELATED WORK 129

samples, calculation of a colour model, and segmentation based on the colourmodel. The results from the segmentation process are stored in class layers, onelayer per class. The predictive map and its calculation from the class layers aredescribed in Section 8.4. In Section 8.5 the combination of the local and globalinformation, the results obtained in the previous chapter and the predictive maprespectively, is presented and discussed. The experiments performed and the ob-tained result are presented in Section 8.6. Finally, the chapter is concluded inSection 8.7 with a summary and a discussion.

8.2 Related WorkIn Chapter 6, automatic building detection in aerial images was discussed. InSection 7.2 work related to aerial images and mobile robots relevant for theprocess described in Chapter 7 was presented. In this section these works arecomplemented with an overview of works on mobile robots using aerial imagesand works on detection of driveable ground.

Information from aerial images can be used to support long range naviga-tion of autonomous vehicles. Silver et al. discuss a method to produce cost mapsfrom aerial surveys [Silver et al., 2006]. The aerial surveys give heterogeneousdata (e.g. data recorded with different sampling density) that are registered andused in classification of ground surface. The used features are image based, bothfrom visible light and near-infrared light, elevation based (rasterized elevationdata available from multiple sources) and point cloud based (from airbornelaser range scanners).

In a similar example heterogeneous data from maps and aerial surveys areused to construct a world model with semantic labels [Scrapper et al., 2003].This model was compared with data from sensors mounted on a ground ve-hicle. The system uses the semantic labelling to focus a search for particularfeatures in the vehicle neighbourhood, allowing a fast scene interpretation bytargeting the sensor data processing to regions that are predicted to be mostinteresting.

Manually extracted training samples have been used to segment a satel-lite image for use in mobile robot localization and navigation [Dogruer et al.,2007]. HSI colour space was chosen in order to reduce the sensitivity to shadesin the classification. Evaluation with a mobile robot has not yet been per-formed.

In this chapter a class for ground is used, but for the mobile robot it wouldbe of benefit to know whether the ground is driveable or not. There is a possi-bility to extract information about the traversability of ground based on mea-surements performed by an unmanned vehicle. Models of driveable ground canbe extracted in different ways. Laser range scanners are popular and efficientwhen it comes to observe the ground close to the vehicle, but for long rangesit seems that vision is more often used. In the following described works, the


intention has been to find areas of driveable ground not only as opposed toobstacles, but with the purpose of obtaining a forecast of the conditions aheadof the vehicle using vision to extrapolate driveable areas.

The area traversed by a mobile robot has been used as the indicator of ob-stacle free ground [Ulrich and Nourbakhsh, 2000]. When the robot has drivenover a specific area it knows that this area is driveable. A forward looking cam-era registers the area that the robot will pass and models in HSI colour spaceare calculated from image histograms. The system extrapolates the informationforward and uses this to separate free ground from obstacles.

One can also assume that when a vehicle already is on a road, a forwardlooking camera has the road in view (the part of the image showing the groundclosest to the vehicle) [Song et al., 2006]. Song et al. used this to extract in-formation for a motion planning system used on ill-structured roads duringDARPA Grand Challenge.

Gaussian Mixture Models (GMM) are popular for colour segmentation.GMM were used in detection of drivable areas in desert terrain [Dahlkampet al., 2006] for DARPA Grand Challenge. An algorithm that extends the rangeof a laser range scanner by the use of vision was developed. The laser scans thearea close to the vehicle and the obstacle free area is mapped to the cameraimage where it is used for training the GMM. The rest of the image can thenbe classified as drivable or nondriveable areas. It turned out that the authorscould reduce the complexity of the GMM to only one model described by themean and the covariance matrix with only a slight decrease of performance.

AdaBoost was used by Guo et al. to train a system to detect roads in difficultsituations [Guo et al., 2006]. They used two types of weak classifiers, one basedon colour samples that are compared to colour histograms and the other onrectangular features (reminiscent of Haar basis functions) introduced for facedetection [Viola and Jones, 2004]. The final system detects roads in front of avehicle and handles both shadowed regions and water pools.

A recent example to forecast terrain traversability is given by Karlsen andWitus. They used vision and extracted texture features based on standard de-viation and entropy [Karlsen and Witus, 2007]. The features observed dur-ing training were clustered using fuzzy c-means. The system is able to forecastvehicle-terrain interaction in upcoming terrain.

Based on the above referenced work it can be concluded that several systemsfor finding driveable terrain exist. It can therefore be assumed that also with theomni-directional camera onboard our mobile robot driveable terrain could beidentified and used as a class complementary to the ground class currently used.

8.3 SegmentationThe segmentation of the aerial image is based on colour models. In the exampleused in this chapter, models are calculated for the two classes building and

8.3. SEGMENTATION 131

ground. Figure 8.1 shows a flow chart of the updating process. In this sectionthe first part, up to the image segmentation, is explained. The algorithm isadapted to work also in an on-line situation. When a New sample belonging toclass cl is available, a new colour model CM is calculated. Based on the qualityof CM, a measure p, 0 ≤ p ≤ 1 should be estimated. The last step is to classifythe pixels in the aerial image with the trained colour model. In summary thealgorithm is as follows:

1. Take a sample (a number of pixels from an area that is expected to belongto class cl).

2. Train a colour model in RGB colour space.

3. Estimate p.

4. Perform a classification using the colour model.

In the following paragraphs the training samples (step 1), the colour models(step 2), and the classification (step 4) are described. The parameter p shouldreflect the certainty of the classification performed with the colour model andcould also be influenced by the probabilities of the connected regions in theprobabilistic semantic map. Estimation of the parameter p is still an unsolvedissue and left for future work. In our experiments p = 0.7 was used.

8.3.1 Training Samples

The system needs to define training samples of the classes that shall be seg-mented. Two classes are of interest here, buildings and ground. The methodpresented in the previous chapter was used to find local estimates of buildings.Therefore, to define the colour models for the building class, the building esti-mates found by local segmentation are used as training areas.

To extract colour models that represent the different ground areas, the oc-cupancy grid map and the edge version of the aerial image, Ie, are combined.The free cells in the occupancy grid map define the regions in Ie that representground. The combination of Ie and the occupancy grid map can be done eitherunder the assumption that the navigation is precise giving a perfect registrationor by reduction of the area of free cells with the estimated size of the naviga-tion error. Only the latter was used in our work. Assuming that the DGPS gaverather accurate positions and an accurate heading angle could be calculatedfrom the trajectory, the latter approach was used to remove the small naviga-tion errors. The area of free cells in the occupancy grid map was reduced bymorphological erosion with a square structuring element of size 5 × 5 pixelsto compensate for errors up to 2 pixels (1 m) in all directions, resulting in aground region. An example of the combination of the ground region with theedges in Ie is shown in Figure 8.2. Next, edge controlled segmentation of the


Figure 8.2: The combined binary image of free points (reduced using morphologicalerosion with a square structuring element of size 5× 5 pixels) and edges in Ie.

ground region is performed, as described in Section 7.5.1, to find the individualground areas with boundaries defined by the Ie. The largest areas1 found inthe edge controlled segmentation, point out samples in the aerial image thatare used to train mean/covariance models, the same type of colour models as inSection 7.5.2.

The free space found in the occupancy grid map can be considered to bedriveable ground assuming that there are no negative obstacles or other fea-tures, which cannot be sensed with the horizontally mounted 2D laser scannerand prevent the robot from driving safely. However, since it cannot be guar-anteed that the free space in the occupancy grid map in fact corresponds to adriveable area, the new class is called ground.

The result from the local building segmentation (an example was presentedin Figure 7.10) and the ground information from the occupancy grid map arereferred to as the local information, since it results from direct observation bythe mobile robot.

8.3.2 Colour Models and Classification

Both the colour models and the classification algorithm follows the sameideas as the procedure for the homogeneity test used in the local segmen-tation presented in Section 7.5.2. The training samples are described in a

1The limit was set to 50 pixels (12.5 m2) in order to avoid small areas that could representmovable objects such as cars and small trucks.

8.4. THE PREDICTIVE MAP 133

mean/covariance model in RGB space and the segmentation is based on theMahalanobis distance to classify the aerial image:

1. Set a limit Olim based on how many outliers that are expected in themodel (here Olim represents 95% of the training sample pixels).

2. Calculate the Mahalanobis distances between the colour model and thepixels in the sample. Set dlim to the distance that represent Olim.

3. Use the Mahalanobis distance in the segmentation of the aerial image,with dlim as the separator between pixels belonging to cl and the pixelsthat don’t.

8.4 The Predictive MapThis section describes the layout of the predictive map (PM) and the methodused to calculate incremental updates of it.

The PM is designed to handle multiclass problems and updating this mapcan be performed incrementally. The PM is a grid map of the same size asthe aerial image that is segmented. For each of the n classes, a separate layerli, with i ∈ {1, . . . , n}, is used to store the accumulated segmentation results.These layers also have the same size as the aerial image. The colour models usedto segment the aerial image define a subset of a class through a binary classifier.

To calculate the predictive map incrementally twomain steps are performed:

1. When the aerial image has been segmented with a new colour model, thelayer of that class is updated.

2. The predictive map is recalculated using the result from the updated lay-ers.

From the segmentation of the aerial image a temporary layer for class is ob-tained. The old layer of class, lcl, is fused with the temporary layer using a maxfunction. Alternative methods to fuse the layers, such as a Bayesian method,should be evaluated when a method to calculate the parameter p has been de-cided.

8.4.1 Calculating the PM

The predictive map is based on voting from separate layers li for the n classes,one layer for each class. The voting is performed on the layers cell by cell usingIF-THEN rules biased with cij :

IF lxyi > lxy

j + cij ∀ j �= i THEN pmxy = classi (8.1)


where lxyi denotes cell (x, y) in layer i and pmxy denotes cell (x, y) in PM. If the

condition cannot be fulfilled due to conflicting information, pmxy is set to un-known. To evaluate the similarity between cells, buffer zones are introduced inthe voting process. The buffer zones are collected in the off-diagonal elementsof a matrix C where cij ≥ 0, i �= j, i = {1, 2, . . . , n}, j = {1, 2, . . . , n}. Intro-ducing the buffer zones makes it possible to adjust the sensitivity of the votingindividually for all classes. If cij = 0 the rules in Equation 8.1 will turn intoordinary voting where the largest value wins and where ties give unknown.

During the experiments presented in this chapter two layers were used (n =2); one building layer and one ground layer, and C was set to

C =

[− 0.10.1 −

](8.2)

(the values of the diagonal elements are not used).All in all, the PM contains information about n + 2 categories. First there

are the n different classes, then the unknown cells due to ambiguous class val-ues and finally the unexplored cells that represent the remaining pixels, whichcannot be explained by any of the trained colour models.

8.5 Combination of Local and Global SegmentationThe approaches described above and in the previous chapter result in two setsof information. The first is the local information that has been confirmed bythe mobile robot and the second is stored in the PM. Where these sets overlapthey can be fused into one final estimate. Since the local information has beenconfirmed by the mobile robot it is reasonable to let the local information haveprecedence over the PM by giving it a higher probability p. Fusion of the PMand the local information is carried out using the method described in Section8.4.1 .

8.6 Experiments8.6.1 Experiment Set-Up

The experiment reported in this chapter makes use of the same set-up as wasdescribed in Section 7.6.1. The additional information used is the ground esti-mation obtained from the occupancy grid map as described in Section 8.3.1.

8.6.2 Result of Global Segmentation

The result of the global segmentation is shown in Figures 8.3 and 8.4. Visualinspection of the result illustrates the potential of the approach. The PM based

8.6. EXPERIMENTS 135

on ground colour models from regions in Figure 8.2 and building colour modelsfrom the regions in Figure 7.10 is presented in Figures 8.3(a) (cells classified asground and buildings) and 8.3(b) (unexplored and unknown cells).

Compared with the aerial image in Figure 7.2 the result is promising. Onecan now follow the outline of the main building and most of the paths, includ-ing paved paths, roads and beaten tracks, have been found. The main problemexperienced during the work is caused by shadowed ground areas that lookvery similar to dark roofs resulting in the major part of the unknown cells.

If areas representing the unknown cells have already been classified by themobile robot, as in Figures 7.9 and 7.10, that result has precedence over thePM (see discussion in Section 8.5). The final result is obtained when the PM iscombined with the local information. For these pixels p is set to 0.9 and anotherupdate of the PM (using the method described in Section 8.4) is performedresulting in the map shown in Figure 8.4.

A formal evaluation of the ground class is hard to perform. Ground truthfor buildings can be manually extracted from the aerial image, but it is hardto specify in detail the area that belongs to ground. Based on the ground truthof buildings and an approximation of the ground truth of ground as the non-building cells, statistics of the result are presented in Table 8.1. In the table allvalues in the right column, where the results from the combined PM and localinformation are shown, are better than those in the middle column (only PM).

As in Chapter 7, the positive predictive value, PPV or precision, has beenused as the quality measure. PPV is calculated as

PPV =TP

TP + FP(8.3)

where TP are the number of true positives and FP are the number of falsepositives. Since the PPV depends on the actual presence of the different classesin the aerial image, normalized values are also presented. The normalized valuesare calculated as

PPVnorm =TPcl

TPcl + FPclGTcl

NGTcl

(8.4)

where TPcl and FPcl are the numbers of true and false positives of class clrespectively. GTcl is the number of ground truth cells of class cl and NGTcl

(not ground truth) is the difference between the total number of cells in the PMand GTcl. The area covered by buildings is smaller than the ground area givingan increase in the normalized PPV for buildings and a decrease for ground,compared to the nominal PPV.


Descriptions PM (Fig. 8.3) [%] PM + local (Fig. 8.4) [%]

PPV buildings (norm) 66.6 (88.6) 73.0 (91.3)

PPV ground (norm) 96.8 (88.6) 97.3 (90.4)

Building cells 12.3 13.8

Ground cells 21.7 25.8

Unclassified cells 55.5 52.4

Unknown cells (ties) 10.5 8.1

Table 8.1: Results of the evaluation of the predictive map, PM, displayed in Figure 8.3and the fusion of PM and the local information in Figure 8.4. The last four rows showthe actual proportions of the cells.


(a) Ground (gray) and building (black) estimates. The white cells are unexplored or unknown.

(b) Ties or unknown cells (black), not classified cells (gray), and classified cells (white).

Figure 8.3: The result of the global segmentation of the aerial image (see Section 8.4)using both ground and building models.


(a) Ground (gray) and building (black) estimates. The white cells are unexplored or unknown.

(b) Ties or unknown cells (black), not classified cells (gray), and classified cells (white).

Figure 8.4: The PM combined with the local information (see Section 8.5).


8.7 Summary and ConclusionsThis chapter presented a method to segment aerial images with the purpose ofextending the observation range of a mobile robot. The inputs to the methodare information about ground taken from an occupancy grid map and buildingsamples obtained from the local segmentation described in the previous chapter.The benefit from the extended range of the robot’s view can clearly be noted inthe presented example:

• The outline of the main building and additional building parts have beenfound.

• Most of the paths and roads have been found.

• The PM clearly shows the regions where more information needs to becollected in order to build a complete map.

In the local segmentation step it was noted that it can be hard to extract acomplete building outline due to factors such as different roof materials, dif-ferent roof inclinations and additions on the roof, specifically when the robothas only seen a small portion of the building outline. The global segmentationis a powerful extension here. Even though the roof structure in the exampleis quite complicated, the outline of a large building could be extracted basedon the limited view of the mobile robot, which had only seen a minor part ofsurrounding walls.

The introduction of ground as a new class confirms the potential of usinginformation from aerial images for planning tasks. Good estimates of whereground can be expected have been achieved and it is believed that planningalgorithms can take advantage of this information to improve navigation inunknown environments.

8.7.1 Discussion

Oh et al. assumed that probable paths were known in a map and used thisinformation to bias a robot motion model towards areas with higher probabil-ity of robot presence [Oh et al., 2004]. Using the approach suggested in thischapter these areas could be automatically found from aerial images.

Son et al. derived building structures from blueprints and correlate the in-formation with satellite images [Son et al., 2007]. Dogruer et al. used manuallyextracted training samples to segment a satellite image for use in mobile robotlocalization and navigation [Dogruer et al., 2007]. Both these works are ex-amples that could make use of our approach where the needed information,instead of being extracted from blueprints and manual samples, would be col-lected automatically by a mobile robot.


With the presented method, changes in the environment compared to anaerial image that is not perfectly up-to-date are handled to a certain degree.Assume that a building, present in the aerial image, has been removed after theimage was taken. It may therefore be classified as a building in the PM if it hada roof colour similar to a building already detected by the mobile robot. Whenthe robot approaches the area where the building was situated, the building willnot be detected. If the mobile robot classifies the area as ground, the PM willturn into unknown (of course depending on cij and p), not only for that spe-cific area but also globally, with the exception of areas where local informationexists.

What about the other way around? Assume that a new building is erectedand this is not yet reflected in the aerial image. If the wall matching indicates anedge as a wall this can, of course, introduce errors. However, there are severalcases where it would not be a problem. When the area is cluttered, e.g., a forest,several close edges will be found and no segmentation is therefore performed.The same result is obtained if the building is erected in a smooth area, forexample an open field, since there are no edges to be found. The result of thesecases is that the building will only be present in the probabilistic semantic mapin the form of a possible wall.

The uncertainties in the robot pose and association problems between theprobabilistic semantic map and the aerial image have to be handled. The pre-sented work requires that the robot information can be associated (registered)with the aerial image. This was solved by using global positioning provided byGPS. An alternative method for registration is to use multi-line matching.

Multi-line matching, in comparison to the single line matching used here,can relax the need for accurate localization of the mobile robot. An example ofsuccessful matching between ground readings and aerial image for localizationis given in [Früh and Zakhor, 2004] and for matching of building outlines in[Beveridge and Riseman, 1997] and [Zhang et al., 2005].

Part IV

Conclusions

Chapter 9

Conclusions

This chapter summarizes and concludes the thesis. First, the achievements ofthe work are summarized. Second, the limitations of the suggested methods arediscussed and finally some important directions for future work are given.

9.1 What has been achieved?The research presented in this thesis deals with semantic mapping of outdoorenvironments for unmanned ground vehicles. The semantic information is ac-quired mainly by a vision-based virtual sensor, though laser-based informationis also used. Semantic information from the virtual sensor is then combinedwith an ordinary occupancy grid map to construct a probabilistic semanticmap. This map is used in turn as the link to information extracted from aerialimages about both major obstacles in the form of buildings and ground that ispotentially traversable by the vehicle.

Semantic Information Semantic information is typically used in the mobilerobotics community in the context of human robot interaction (HRI). In HRIit is obvious that the semantics of a scene are crucial for successful operations.In order to be compatible with humans, the robots have to transform theirsensor readings and relate them to human spatial concepts. Even though HRIis in the focus of attention, there are a number of other applications wheresemantics can play an important role, for example;

• semantic mapping [Wolf and Sukhatme, 2007, Calisi et al., 2007, Rosset al., 2006],

• execution monitoring to find problems in the execution of a plan[Bouguerra et al., 2007],

• localization through connection of human spatial concepts to particularlocations [Galindo et al., 2005, Mozos et al., 2007],

143

144 CHAPTER 9. CONCLUSIONS

• path following in conjunction with aerial images [Oh et al., 2004],

• improving 3D models of indoor environments [Nüchter et al., 2003,Weingarten and Siegwart, 2006, Nüchter et al., 2005], and

• model validation by forming the link between measurements and groundtruth data [Wulf et al., 2007].

This list of applications shows the importance of semantic concepts in mobilerobotics. Our work falls into the semantic mapping category and provides toolsand algorithmic approaches that can be used also in the other application do-mains listed above.

Semantic Information Extraction To extract semantic information the conceptof a virtual sensor based on visual input was developed. This virtual sensor isa generic approach that utilizes machine learning to adapt to given concepts.The virtual sensor makes use of a set of features extracted from gray scaleimages. The AdaBoost algorithm is used to select features and learn a classifier.The experiments showed that virtual sensors can be learned for several classesof objects using the same feature set. Virtual sensors for buildings, windows,trucks and nature were evaluated and the virtual sensor for buildings was usedfor building a probabilistic semantic map. The feature set can be extended tofurther improve the classifier and to handle more concepts.

The presented experiments demonstrated that the selected feature set han-dles variations in seasons and is robust towards the choice of camera. Thesuggested method using machine learning and generic image features makes itpossible to extend virtual sensors to a range of other important human spatialconcepts.

Probabilistic Semantic Mapping In Chapter 5, a method to build probabilis-tic semantic maps based on semantic and occupancy information is presented.A virtual sensor for pointing out buildings along a mobile robot’s track is usedtogether with information from an ordinary occupancy grid map. The methodhandles the wide field-of-view of the planar camera (56◦), which may containdifferent sized objects that can belong to different classes. Despite the large un-certainty about the location of the classified object in the image, a very accuratesemantic map is produced.

From the experiments described in Chapter 5, several benefits of using thevirtual sensor with its good generalisation properties are noted. The approachwas found to be very robust even though

• the training set was quite small,

• different resolutions were used in the training phase and in the map build-ing phase, and

9.1. WHAT HAS BEEN ACHIEVED? 145

• different cameras were used in the training phase and in the map buildingphase.

Aerial Image Segmentation The results from the probabilistic semantic mapare used as the link to building outlines in aerial images. Both local and globalsegmentation in the aerial image are performed in order to detect buildings andground. Fusion with ground information, which is found from an occupancygrid map, then results in a local information map and a predictive map (seedescription of the relations between the semantic maps below).

With this approach two difficulties are addressed simultaneously: 1) build-ings are hard to detect in aerial images without elevation data and 2) the limi-tation in range of the sensors onboard the mobile robots. Concerning the firstdifficulty the results show a high classification rate and we can therefore con-clude that the ground-level semantic information can be used to compensatefor the absence of elevation data in aerial image segmentation. Second, the in-formation gained from the aerial image segmentation results in an extendedrange of the robot’s view into areas that have not been visited by the robot.The benefits from this extended range include

• detection of building outlines and areas that could be obstacles,

• detection of paths and roads that are potentially driveable and thereforeimportant from a planning perspective, and

• clear indication of where more information needs to be collected in orderto complete the maps.

Relation Between the Semantic Maps Three different types of grid maps con-taining semantic information have been defined in this thesis:

The probabilistic semantic map is a grid map where cell values in the interval[0, 1] represent the probability that the cell belongs to a particular class.The probabilistic semantic map is based on occupancy information andit is only the outline of objects that are represented. The probabilisticsemantic map is presented in Chapter 5.

The local information map is a grid map where occupied cells represent re-gions belonging to a semantic class. By contrast with the probabilisticsemantic map, the cells in the local information map represent the es-timated area of objects and not only their partial outlines. All classifiedobject regions are spatially connected to the input information used in thecreation of the map. The local information map is described in Chapter7.


The predictive map, PM, is a global semantic grid map that predicts the pres-ence of different classes in an entire aerial image. The classified regionsare therefore not necessarily spatially connected with the input informa-tion, but rather found via the colour models utilized in the segmentationprocess. The PM is described in Chapter 8.

The probabilistic semantic map is only based on ground level informationcollected by the mobile robot while the local information map shows objects ofparticular classes close to the robot’s path, based also on information extractedfrom aerial images. The PM predicts the surroundings of the mobile robot ina much larger area. The quality of these predictions depends on how homo-geneous the aerial image is, i.e., if the colour models manage to separate oneclass from the other classes. The PM is based on classified regions in the localinformation map, which in turn is based on information from the probabilisticsemantic map. The local information map therefore has precedence over thePM in the sense that in the case of conflicting entries, the local informationmap is treated as being more credible.

9.2 LimitationsThe key objectives of the work performed were outdoor semantic mappingand to show how semantic information can be used in conjunction with aerialimagery to extend the robot’s understanding of the surrounding world. Thepresented system has deliberately been divided into three modules with the pur-pose of making parts of the system exchangeable. In this way three objectivesare addressed. First of all, the system can be adapted to utilize other sensormodalities. Second, the system can be easily improved through improvement ofthe individual modules. Third, the modules may be used individually for situa-tions when only parts of the system are needed. With this said, the limitationsof the system can be discussed and it is my belief that these shortcomings canbe handled by future research and work.

Starting with the virtual sensor, the main limitation is that the classificationis performed on the entire image. Even though the results from using this virtualsensor in the probabilistic semantic map were promising, improved and morerobust results can be expected if the virtual sensor was extended so that itcould define which parts of an image belong to the specified class. Methods toovercome this issue include windowing techniques, where windows of varyingsize divide the image into sub-images that are individually classified. By thisthe location of the object in the image could be estimated more accurately.The virtual sensor has showed resolution independence within a certain range,which indicates that it can be used on sub-images.

The probabilistic semantic map currently handles two classes. One can fore-see that in the future a need for more classes will arise. It should be possible

9.3. FUTURE WORK 147

to handle more classes by building one semantic map for each class of interestutilizing the virtual sensors for the respective classes, and letting these mapsrepresent separate layers. The different maps can then be fused into one singlesemantic map, e.g., by the use of the same fusing technique as for the PM.

In the evaluation of the probabilistic semantic map it was noted that theoccupancy grid map where the data have been collected in a horizontal planeclose to the ground is not optimal for finding building outlines. Therefore, tobuild accurate maps that include objects of a certain vertical scale, it is rec-ommended that the sensor readings should be able to reflect objects within thatscale. Alternative solutions include the use of 3D-lasers, e.g., vertically mountedlaser range scanners, or vision only solutions using stereo cameras or motionstereo.

The derived semantic maps are limited to two dimensions, i.e., they are 2Dmaps with the 3D-world information projected onto a layer at ground level.Situations where the 2D-information is not enough to describe the environmentmay therefore occur, for instance when information from several overlappingspatial layers is of interest. Consider a situation where the mobile robot detectsdriveable ground under trees, while the aerial image only shows the trees, sinceonly the top layer is visible from above. If the robot detects that objects suchas tree tops may obstruct the view of the ground in the aerial image, trainingsamples for colour models of ground cannot be taken in these areas. If theforest is considered to be dense at tree top level, colour models of trees canbe extracted. One way to represent this type of overlapping information is tointroduce layered maps.

9.3 Future WorkBased on the limitations of the system presented and the suggestions mentionedearlier in the respective chapters, the most promising directions for future workare discussed for each module as follows.

Virtual Sensor The introduction of a method to refine the virtual sensor sothat it can point out which parts of an image are mainly responsible for theclassification could be used to further improve the separation of different classesin the probabilistic semantic map. To refine the virtual sensor, classification ofsub-images could be performed, exploiting the robustness of the virtual sensorto changes in resolution. An example of a sub-image technique is describedin [Morita et al., 2005] where small squares of the upper half of images areclassified as tree, building and uniform regions.

Probabilistic Semantic Mapping A natural extension of the work on the prob-abilistic semantic map is to introduce other object classes and refined ground


classification. An example would be drivable areas that can be detected usingthe onboard sensor system.

Aerial Image Segmentation The segmentation to produce the predictive mapis pixel based. It is therefore likely that post-processing of the PM, e.g., withfilters taking neighbouring cells into account, could improve the results. Post-processing with Hidden Markov Models and Markov Random Field has beenused in similar applications [Mozos et al., 2006, Wolf and Sukhatme, 2007].

It is further expected that shadow detection, which merges shadowed areaswith corresponding areas in the sun, can reduce the number of false positivebuilding pixels from the segmentation and decrease unknown areas caused byties.

Multi-line matching should be evaluated as an alternative method for regis-tration. Multi-line matching can relax the demands on accurate global naviga-tion of the mobile robot. An example of successful matching between groundreadings and aerial image for localization is given in [Früh and Zakhor, 2004]and for matching of building outlines in [Beveridge and Riseman, 1997].

The accuracy of the PM can probably be further improved by using a mea-sure of the colour model quality to assign a value to the parameter p (the qualityparameter for the layers in the PM), see Chapter 8. Also the probabilities fromthe semantic map from which the ground wall estimates are extracted and thecertainty of the virtual sensor could be used in the calculation of p. When theparameter p is in use, alternatives to the max-function used to update the indi-vidual layers should be investigated.

The use of automatic systems in different forms will be more and more com-mon in future. The area of mobile robotics is expected to grow fast and thetrends in the automotive industry aim at more automatic functions and willprobably result in driverless cars. If these types of systems can understand hu-man concepts their usability will be greatly improved and it is therefore ex-pected that utilization and extraction of semantic information will play an im-portant role in the future.

Aerial images are nowadays globally available via the Internet and severaltypes of systems can make benefit of the rich information they contain. Utiliza-tion of overhead information in the form of aerial images brings, in combina-tion with semantic knowledge, a new dimension to mobile robot mapping. Theexamples given in this thesis show the potential of semantic maps that can beused for planning and exploration.

In this sense, the ideas and work presented in this thesis take us one step fur-ther towards systems that use multimodal inputs and transform their internalrepresentations into human spatial concepts.

Part V

Appendices

Appendix A

Notation and Parameters

A.1 AbbreviationsAbbreviations used in the thesis are explained in the following.

AdaBoost Adaptive Boosting

BIRON Bielefeld Robot Companion

BOC Bayes Optimal Classifier

CCD Charge-Coupled Device (electronic light sensor)

CM Colour Model

DGPS Differential GPS

DEM Digital Elevation Model

ESOM Ensemble of Self-Organizing Maps

GBIS Graph-Based Image Segmentation

GIS Geographical Information System

GMM Gaussian Mixture Model

GPS Global Positioning System

HMM Hidden Markov Model

HRI Human-Robot Interaction

HSI Hue, Saturation, Intensity (colour space)

ICP Iterative Closest Point

IMU Inertial Measurement Unit

INS Integrated Navigation System

IR Infra-Red

JPEG Joint Photographic Experts Group (still image compression

standard)LIDAR Light Detection and Ranging

151

152 APPENDIX A. NOTATION AND PARAMETERS

MCL Monte Carlo Localization

MDC Minimum Distance Classifier

MRF Markov Random Fields

PCA Principal Components Analysis

PM Predictive Map

PPV Positive Predictive Value

PT Pan-Tilt

RFCH Receptive Field Cooccurence Histograms

RGB Red, Green, Blue (colour space)

SAR Synthetic Aperture Radar

SIFT Scale-Invariant Feature Transform

SLAM Simultaneous Localization And Mapping

SLR Single-Lens Reflex (camera)

SSH Spatial Semantic Hierarchy

SVM Support Vector Machine

TIFF Tagged Image File Format

UAV Unmanned Aerial Vehicle

VS Virtual Sensor

WGS World Geodetic System, WGS84 is the latest revision

A.2 ParametersBelow is a list of parameters and their notation used in equations etc. through-out the thesis. The values of the parameters are presented where appropriate.

αi covering angle of object i in a sectorβdev corner tolerance [deg] ±20

Δ planar camera field-of-view [deg] 56

ΦTP true positive rate

ΦTN true negative rate

ΦTP false positive rate

ΦFN false negative rate

θ sector opening angle [deg] 30-56�θ edge orientations

Astart start area [pixels] 8× 8

A.2. PARAMETERS 153

C45 number of corners with direction of90n + 45 degrees, n ∈ 1, . . . , 4

d Mahalanobis distance

Dt distribution that weight training samplesin AdaBoost

fx feature number x

ht the best weak classifier from iteration tin AdaBoost

H the strong classifier (AdaBoost)�Hx histogram with x bins

Ie edge image

LV S VS maximum range [m] 50

n number of objects in a view

N number of instances

Npv number of planar views 8

Nrun number of Monte Carlo runs

P (build|VS=build) building probability given thatthe VS indicates building > 0.5

P (build|VS=¬build) 1 - nonbuilding probability giventhat the VS indicates nonbuilding < 0.5

Pcell value of a cell in a grid maps sensor reading

t iteration (AdaBoost)

T max number of iterations (AdaBoost) 30

Ts number of sensor readings

Appendix B

Implementation Details

B.1 Line ExtractionThis section gives a short overview of the line extraction implementation usedto find straight line segments in a binary image. This binary image may be theresult of an edge detection operation but can also originate from other sources.

For line extraction, Matlab functions implemented by Peter Kovesi [Kovesi,2000] have been used. These functions are implemented in m-files described inthe following:

edgelink.m This function links edge points in a binary image into chains. Ifan edge diverges at a junction the function tracks one of the branches. Brokenbranches can be remerged by mergeseg.m.

lineseg.m This function forms straight line segments from the edge list calcu-lated by edgelink.m. The function breaks down each array of edgepoints in theedge list to straight lines that fulfill the tolerance TOL. An edge merging phaseis performed to connect lines that may have been separated in the edge linkingphase.

mergeseg.m Function used by lineseg.m. The function scans through the listof line segments to check if any segments can be merged. Segments are merged ifthe orientation difference is less than ANGTOL and if the ends of the segmentsare within LINKRAD of each other.

The parameter setting for the line extraction, as used in the experiments, isdescribed in Table B.1.

155

156 APPENDIX B. IMPLEMENTATION DETAILS

Name Value Description

TOL 2 pixels Maximum deviation from a straight line before asegment is broken in two

ANGTOL 0.05 rad Angle tolerance used when attempting to mergeline segments

LINKRAD 2 pixels Maximum distance between end points of linesegments for segments to be eligible for linking

Table B.1: Parameters used for line extraction.

B.2 Geodetic Coordinate TransformationThe GPS receiver outputs position data in the form of longitude and latitudecoordinates represented in WGS84. WGS84 is the world geodetic system datingfrom 1984. It is the reference system used by GPS and uses a reference ellipsoidrepresenting the earth.

The WGS84 coordinates are appropriate for a system that covers the wholeearth. In smaller areas a metric system is preferable. The aerial images are regis-tered in local coordinate systems, closely connected to the Swedish coordinatesystem RT90 2.5 gon V 0:-15. The WGS position data therefore have to betransformed. For this a transformation function from SWEREF 99 to RT90has been used. WGS 84 and SWEREF 99 are, in principle, interchangeable1.The difference is in the order of 0.1 m but these systems are slowly divergingdue to the motion of the European plate.

The transformation makes use of Gauss conformal projection according to[Lantmäteriet, 2005]. The parameters used by the transformation are listed inTable B.2.

B.3 LocalizationThe robot was localised using DGPS and odometry. Since the trajectory is closeto buildings, the positions from the DGPS suffer from multipath signals result-ing in errors from a few meters up to 100 m. When these errors occur intermit-tently they can be filtered out using a motion model of the mobile robot. Therobot can only move with a certain velocity and therefore many of the erro-neous positions can be removed. The main problem occurs when the multipathsignals result in a constant shift of the position. If the start of the shift is notdetected it is no longer possible to remove the false positions.

1National land survey of Sweden, www.lantmateriet.se

B.3. LOCALIZATION 157

Parameter Value

Type of projection Transverse Mercator (Gauss-Krüger)

Reference ellipsoid GRS 80

Semi-major Axis (a) 6378137

Inverse flattening (1/f) 298.257222101

Central meridian 15◦48’22”.624306 East Greenwich

Latitude of origin 0◦

Scale on central meridian 1.00000561024

False Northing, x0 -667.711 m

False Easting, y0 1500064.274 m

Table B.2: Parameters used for transformation from SWEREF99 to RT90.

The pose data used in the experiments of this thesis have been manuallycalibrated in the following way. Using only five parameters, the odometry datahave been tuned to fit with the DGPS positions. As has been discussed aboveodometry suffers from drift. However, it is possible to achieve accurate posi-tions by calibration of the odometry as long as the properties of the groundsurface are constant and the trajectory length is limited. The five parametersthat were used in the calibration are:

• Start position in X (north)

• Start position in Y (east)

• Initial heading

• Length scale

• Turn scale

The first three parameters are used to set the position and orientation of therobot at the first GPS-fix. The last two parameters handle the wheel sizes (lengthscale) and the difference in left and right wheel radii (turn scale). The lengthscale is used as a factor multiplied to the odometry increments before integra-tion. The turn scale parameter is multiplied with the current velocity and addedto the heading increment before integration. For the performed experiments thisgives good pose estimates of the robot.

The robot heading is calculated from the obtained trajectory. The calibratedpose data are linearly interpolated to obtain the pose for each image. This is nota limitation since the images are taken either during forward motion or whenthe robot does not move.

Bibliography

Iyad Abuhadrous, Samer Ammoun, Fawzi Nashashibi, François Goulette, andClaude Laurgeau. Digitizing and 3D modeling of urban environmentsand roads using vehicle-borne laser scanner system. In Proceedings ofthe IEEE/RSJ International Conference on Intelligent Robots and Systems(IROS), pages 76–81, Sendai, Japan, September 28 – October 2 2004.

Peter Allen, Ioannis Stamos, Atanas Gueorguiev, Ethan Gold, and Paul Blaer.AVENUE: Automated site modeling in urban environments. In Proceedingsof 3rd Conference on Digital Imaging and Modeling", pages 357–364, Que-bec City, Canada, May 2001.

Drago Anguelov, Rahul Biswas, Daphne Koller, Benson Limketkai, Scott San-ner, and Sebastian Thrun. Learning hierarchical object maps of non-stationary environments with mobile robots. In Proceedings of the 17th An-nual Conference on Uncertainty in Artificial Intelligence (UAI), pages 10–17,Edmonton, Canada, August 2002.

Drago Anguelov, Daphne Koller, Evan Parker, and Sebastian Thrun. Detectingand modeling doors with mobile robots. In Proceedings of the IEEE Inter-national Conference on Robotics and Automation (ICRA), volume 4, pages3777–3784, New Orleans, LA, USA, April 26-May 1 2004.

Ashtech. G12 GPS OEM board & sensor reference manual. Ashtech, Sepember2000.

Patrick Beeson, Nicholas K. Jong, and Benjamin Kuipers. Towards autonomoustopological place detection using the extended voronoi graph. In Proceedingsof the IEEE International Conference on Robotics and Automation (ICRA),pages 4384–4390, Barcelona, Spain, April, 18–22 2005.

Paul J. Besl and Neil D. McKay. A method for registration of 3-D shapes. IEEETransactions on Pattern Analysis and Machine Intelligence, 14(12):239–256,February 1992. ISSN 0162-8828.

159

160 BIBLIOGRAPHY

J. Ross Beveridge and Edward M. Riseman. How easy is matching 2D linemodels using local search? IEEE Transactions on Pattern Analysis and Ma-chine Intelligence, 19(6):564–579, 1997.

Johan Bos, Ewan Klein, and Tetsushi Oka. Meaningful conversation with amobile robot. In EACL ’03: Proceedings of the tenth conference on Euro-pean chapter of the Association for Computational Linguistics, pages 71–74,Budapest, Hungary, 2003. Association for Computational Linguistics. ISBN1-111-56789-0.

Abdelbaki Bouguerra, Lars Karlsson, and Alessandro Saffiotti. Semanticknowledge-based execution monitoring for mobile robots. In Proceedingsof the IEEE International Conference on Robotics and Automation (ICRA),pages 3693–3698, Rome, Italy, April 10 2007.

Pär Buschka and Alessandro Saffiotti. A virtual sensor for room detection. InProceedings of the IEEE/RSJ International Conference on Intelligent Robotsand Systems (IROS), pages 637–642, Lausanne, CH, 2002.

Pär Buschka. An Investigation of Hybrid Maps for Mobile Robots. PhD thesis,Dept. of Technology, Örebro University, Sweden, February 2006.

Daniele Calisi, Alessandro Farinelli, Giorgio Grisetti, Luca Iocchi, DanieleNardi, Stefano Pellegrini, Diego Tipaldi, and Vittorio Amos Ziparo. Con-textualization in mobile robots. In Proceedings of the ICRA 2007 Workshopon Semantic Information in Robotics, Rome, Italy, April 10 2007.

John Canny. A computational approach for edge detection. IEEE Transactionson Pattern Analysis and Machine Intelligence, 8(2):279–98, November 1986.

Luiz Alberto Cardoso. Computer aided recognition of man-made struc-tures in aerial photographs. Master’s thesis, Naval postgraduate school,Monterey, California, 1999. URL http://www.cs.nps.navy.mil/people/

faculty/rowe/cardosothesis.htm.

Robert W. Carroll. Detecting building changes through imagery and automaticfeature processing. In URISA 2002 GIS & CAMA Conference Proceedings,Reno, Nevada, USA, April 7–10 2002.

Cheng Chen and Han Wang. Large-scale loop-closing with pictorial matching.In Proceedings of the 2006 IEEE International Conference on Robotics andAutomation (ICRA), pages 1194–1199, Orlando, Florida, May 2006.

Heng-Da Cheng, Xihua Jiang, Y. Sun, and Jingli Wang. Color image segmen-tation: advances and prospects. Pattern Recognition, 34(12):2259–2281,2001.

BIBLIOGRAPHY 161

Kok Seng Chong and Lindsay Kleeman. Sonar based map building for a mobilerobot. In Proceedings of the IEEE International Conference on Robotics andAutomation (ICRA), pages 1700–1705, Albuquerque, New Mexico, USA,April 1997.

Matthieu Cord, Michel Jordan, Jean-Pierre Cocquerez, and Nicolas Paparodi-tis. Automatic extraction and modelling of urban buildings from high reso-lution aerial images. In Proceedings of ISPRS Automatic Extraction of GISObjects from Digital Imagery, pages 187–192, Munich, Germany, Sepember1999.

Crossbow. DMU User’s manual. Crossbow Technology, Inc., 2001.

Hendrik Dahlkamp, Adrian Kaehler, David Stavens, Sebastian Thrun, and GaryBradski. Self-supervised monocular road detection in desert terrain. In Pro-ceedings of Robotics: Science and Systems, Philadelphia, USA, August 16–192006.

Frank Dellaert and David Bruemmer. Semantic SLAM for collaborative cogni-tive workspaces. In AAAI Fall Symposium Series 2004: Workshop on TheInteraction of Cognitive Science and Robotics: From Interfaces to Intelli-gence, 2004.

Can Ulas Dogruer, Bugra Koku, and Melik Dolen. A novel soft-computingtechnique to segment satellite images for mobile robot localization and nav-igation. In Proceedings of the IEEE/RSJ International Conference on Intel-ligent Robots and Systems (IROS), pages 2077–2082, San Diego, CA, USA,October 29 – November 2 2007.

Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Classification.Wiley, New York, second edition, 2001.

Staffan Ekvall, Patric Jensfelt, and Danica Kragic. Integrating active mobilerobot object recognition and SLAM in natural environments. In Proceedingsof the IEEE/RSJ International Conference on Intelligent Robots and Systems(IROS), pages 5792–5797, Beijing, China, October 9–15 2006.

Ahmed El-Rabbany. Introduction to GPS: the Global Positioning System.Artech House Publishers, Boston, London, 2002.

Pedro F. Felzenszwalb and Daniel P. Huttenlocher. Efficient graph-based imagesegmentation. International Journal of Computer Vision, 59(2):167–181,2004.

Martin A. Fischler and Robert C. Bolles. Random sample consensus: Aparadigm for model fitting with applications to image analysis and auto-mated cartography. Communications of the ACM (Association for Comput-ing Machinery), 24(6):381–395, June 1981.

162 BIBLIOGRAPHY

Jordi Freixenet, Xavier Munoz, David Raba, Joan Marti, and Xavier Cufi. Yetanother survey on image segmentation: Region and boundary informationintegration. In European Conference on Computer Vision, volume 3, pages408–422, Copenhagen, Denmark, May 2002.

Christian Freksa, Reinhard Moratz, and Thomas Barkowsky. Schematic mapsfor robot navigation. In C. Freksa, W. Brauer, C. Habel, and K. Wender,editors, Spatial Cognition II: Integrating Abstract Theories, Empirical Stud-ies, Formal Methods, and Practical Applications, pages 100–114. Berlin:Springer, 2000.

Yoav Freund and Robert E. Schapire. A decision-theoretic generalization ofon-line learning and an application to boosting. Journal of Computer andSystem Sciences, 55(1):119–139, 1997.

Stephen Friedman, Hanna Pasula, and Dieter Fox. Voronoi random fields: Ex-tracting the topological structure of indoor environments via place labeling.In Proceedings of the International Joint Conference on Artificial Intelligence(IJCAI), pages 2109–2114, Hyderabad, India, January 6–12 2007.

Christian Früh and Avideh Zakhor. An automated method for large-scale,ground-based city model acquisition. International Journal of Computer Vi-sion, 60(1):5–24, 2004.

Christian Früh and Avideh Zakhor. Constructing 3D city models by mergingaerial and ground views. IEEE Computer Graphics and Applications, 23(6):52–61, Nov/Dec 2003.

Christian Früh and Avideh Zakhor. Data processing algorithms for generatingtextured 3D building facade meshes from laser scans and camera images. InProceedings of First Symposium on 3D Data Processing, Visualization andTransmission, pages 834–847, Padua, Italy, June 2002.

Christian Früh and Avideh Zakhor. Fast 3D model generation in urban envi-ronments. In International Conference on Multisensor Fusion and Integra-tion for Intelligent Systems, pages 165–170, Baden-Baden, Germany, August2001.

Cipriano Galindo, Alessandro Saffiotti, Silvia Coradeschi, Pär Buschka, Juan-Antonio Fernández-Madrigal, and Javier González. Multi-hierarchical se-mantic maps for mobile robotics. In Proceedings of the IEEE/RSJ Interna-tional Conference on Intelligent Robots and Systems (IROS), pages 3492–3497, Edmonton, Alberta, Canada, 2005.

Cipriano Galindo, Juan-Antonio Fernández-Madrigal, Javier González, andAlessandro Saffiotti. Using semantic information for improving efficiencyof robot task planning. In Proceedings of the ICRA 2007 Workshop onSemantic Information in Robotics, Rome, Italy, April 10 2007.

BIBLIOGRAPHY 163

M. García-Alegre, Angela Ribeiro, Lia García-Pérez, Rene Martínez, DomingoGuinea, and Ana Pozo-Ruz. Autonomous robot in agriculture tasks. In 3rdEuropean Conference on Precision Agriculture, pages 25–30, Montpellier,2001.

Atanas Georgiev and Peter K Allen. Localization methods for a mobile robot inurban environments. IEEE Transactions on Robotics, 20(5):851–864, 2004.

Rafael C. Gonzales and Richard E. Woods. Digital Image Processing. Prentice-Hall, 2002. ISBN 0-201-50803-6.

Jose J. Guerrero and Carlos Sagüés. Robust line matching and estimate ofhomographies simultaneously. In Pattern Recognition and Image Analysis:First Iberian Conference, IbPRIA 2003, pages 297–307, Puerto de Andratx,Mallorca, Spain, June 2003. ISBN 3-540-40217-9.

Yanlin Guo, Harpreet Sawhney, Rakesh Kumar, and Steve Hsu. Learning-basedbuilding outline detection from multiple aerial images. In Proceedings of the2001 IEEE Computer Society Conference on Computer Vision and PatternRecognition, volume 2, pages 545–552, Los Alamitos, CA, USA, 2001.

Ying Guo, Vadim Gerasimov, and Geoff Poulton. Vision-based drivable surfacedetection in autonomous ground vehicles. In Proceedings of the IEEE/RSJInternational Conference on Intelligent Robots and Systems (IROS), pages3273–3278, Beijing, China, October 9–15 2006.

Richard I. Hartley and Andrew Zisserman. Multiple View Geometry in Com-puter Vision. Cambridge University Press, ISBN: 0521540518, second edi-tion, 2004.

Andrew Howard, Denis F. Wolf, and Gaurav S. Sukhatme. Towards 3D map-ping in large urban environments. In Proceedings of the IEEE/RSJ Interna-tional Conference on Intelligent Robots and Systems (IROS), pages 1258–1263, Sendai, Japan, August 2004.

Johannes Huber and Volker Graefe. Motion stereo for mobile robots. IEEETransactions on Industrial Electronics, 41(4):378–383, 1994.

Patric Jensfelt. Approaches to Mobile Robot Localization in Indoor Environ-ments. PhD thesis, Royal Institute of Technology, Stockholm, Sweden, 2001.

Seungdo Jeong, Jounghoon Lim, Il Hong Suh, and Byung-Uk Choi. Vision-based semantic-map building and localization. In Bogdan Gabrys, Robert J.Howlett, and Lakhmi C. Jain, editors, KES (1), volume 4251 of LectureNotes in Computer Science, pages 559–568. Springer, 2006. ISBN 3-540-46535-9.

164 BIBLIOGRAPHY

Robert E. Karlsen and Gary Witus. Terrain understanding for robot naviga-tion. In Proceedings of the IEEE/RSJ International Conference on IntelligentRobots and Systems (IROS), pages 895–900, San Diego, CA, USA, October2007.

Yong-Shik Kim, Bong Keun Kim, Kohtaro Ohba, and Akihisa Ohya. A dataintegration for localization with different-type sensors. In The 13th Inter-national Conference on Advanced Robotics (ICAR), pages 773–778, Jeju,Korea, August 21–24 2007.

Ron Kohavi. A study of cross-validation and bootstrap for accuracy esimationand model selection. In Proceedings of the International Joint Conference onArtificial Intelligence (IJCAI), pages 1137–1145, Montreal, Canada, 1995.

Peter D. Kovesi. MATLAB and Octave functions for computer vision and imageprocessing. School of Computer Science & Software Engineering, The Uni-versity of Western Australia, 2000. Last checked November 2007. Availablefrom: <http://www.csse.uwa.edu.au/∼pk/research/matlabfns/>.

Benjamin Kuipers. The spatial semantic hierarchy. Artificial Intelligence, 119:191–233, May 2000.

KVH. C100 Compass Engine, Technical manual. KVH Industries, Inc., 1998.

John Lafferty, Andrew McCallum, and Fernando Pereira. Conditional randomfields: Probabilistic models for segmenting and labeling sequence data. InProceedings of the Eighteenth International Conference on Machine Learn-ing (ICML), pages 282–289, Williamstown, MA, USA, June 28 – July 12001. ISBN 1-55860-778-1.

Lantmäteriet. Gauss conformal projection (transverse mercator), Krügers for-mulas. <http://www.lantmateriet.se>, August 31 2005.

Stephen Levitt and Farzin Aghdasi. Fuzzy representation and grouping in build-ing detection. In Proceedings 2000 International Conference on Image Pro-cessing, volume 3, pages 324–327, Vancouver, BC, Canada, September 10–13 2000.

Yi Li and Linda G. Shapiro. Consistent line clusters for building recognition inCBIR. In Proceedings of the International Conference on Pattern Recogni-tion, volume 3, pages 952–957, Quebec City, Quebec, Canada, August 2002.

Benson Limketkai, Lin Liao, and Dieter Fox. Relational object maps for mobilerobots. In Proceedings of the International Joint Conference on ArtificialIntelligence (IJCAI), pages 1471–1476, Edinburgh, Scotland, UK, July 30 –August 5 2005.

BIBLIOGRAPHY 165

Chung-An Lin. Perception of 3-D Objects from an Intensity Image Using Sim-ple Geometric Models. PhD thesis, Faculty of the Graduate School, Univer-sity of Southern California, USA, December 1996.

Jovanka Malobabic, Herve Le Borgne, Noel Murphy, and Noel O’Connor. De-tecting the presence of large buildings in natural images. In Proceedingsof the 4th International Workshop on Content-Based Multimedia Indexing,Riga, Latvia, June 21–23 2005.

Helmut Mayer. Automatic object extraction from aerial imagery – a surveyfocusing on buildings. Computer Vision and Image Understanding, 74(2):138–149, May 1999.

Nicholas Metropolis and Stanislaw Ulam. The Monte Carlo method. Journalof the American Statistical Association, 44(247):335–341, September 1949.

Hans P. Moravec and Alberto Elfes. High resolution maps from wide anglesonar. In Proceedings of the IEEE International Conference on Robotics andAutomation (ICRA), pages 116–121, St. Louis, Missouri, USA, 1985.

Hideo Morita, Michael Hild, Jun Miura, and Yoshiaki Shirai. View-based lo-calization in outdoor environments based on support vector learning. InProceedings of the IEEE/RSJ International Conference on Intelligent Robotsand Systems (IROS), pages 3083–3088, Edmonton, Alberta, Canada, 2005.IEEE.

Hideo Morita, Michael Hild, Jun Miura, and Yoshiaki Shirai. Panoramic view-based navigation in outdoor environments based on support vector learn-ing. In Proceedings of the IEEE/RSJ International Conference on IntelligentRobots and Systems (IROS), pages 2302–2307, Beijing, China, October 9–15 2006. IEEE.

Óscar Martínez Mozos. Supervised learning of places from range data usingAdaBoost. Master’s thesis, University of Freiburg, Freiburg, Germany, 2004.

Óscar Martínez Mozos, Cyrill Stachniss, and Wolfram Burgard. Supervisedlearning of places from range data using AdaBoost. In Proceedings of theIEEE International Conference on Robotics and Automation (ICRA), pages1742–1747, Barcelona, Spain, April 2005.

Óscar Martínez Mozos, Axel Rottmann, Rudolph Triebel, Patric Jensfelt, andWolfram Burgard. Semantic labeling of places using information extractedfrom laser and vision sensor data. In Proceedings of the IROS 2006Workshop: From Sensors to Human Spatial Concepts, pages 33–39, Beijing,China, October 10 2006.

166 BIBLIOGRAPHY

Óscar Martínez Mozos, Patric Jensfelt, Hendrik Zender, Geert-Jan M. Kruijff,and Wolfram Burgard. From labels to semantics: An integrated system forconceptual spatial representations of indoor environments for mobile robots.In Proceedings of the ICRA 2007 Workshop on Semantic Information inRobotics, Rome, Italy, April 10 2007.

Marina Mueller, Karl Segl, and Hermann Kaufmann. Edge- and region-basedsegmentation technique for the extraction of large, man-made objects inhigh-resolution satellite imagery. Pattern Recognition, 37:1621–1628, 2004.

Ramakant Nevatia, Chungan Lin, and Andres Huertas. A system for buildingdetection from aerial images. In A. Gruen, E. P. Baltsavias, and O. Henric-sson, editors, Automatic Extraction of Man-Made Objects from Aerial andSpace Images, pages 77–86. Birkhäuser Verlag, 1997.

Curtis W. Nielsen, Bob Ricks, Michael A. Goodrich, David Bruemmer, DougFew, and Miles Walton. Snapshots for semantic maps. In Proceedings of the2004 IEEE Conference on Systems, Man, and Cybernetics, volume 3, pages2853–2858, The Hague, The Netherlands, October 10-13 2004.

Andreas Nüchter, Hartmut Surmann, Kai Lingemann, and Joachim Hertzberg.Semantic scene analysis of scanned 3D indoor environments. In Proceedingsof the 8th International Fall Workshop Vision, Modeling, and Visualization2003, pages 215–222, Munich, Germany, November 2003. IOS Press. ISBN3-89838-048-3.

Andreas Nüchter, Oliver Wulf, Kai Lingemann, Joachim Hertzberg, BernardoWagner, and Hartmut Surmann. 3D mapping with semantic knowledge. InProceedings of the RoboCup International Symposium 2005, pages 335–346, Osaka, Japan, July 2005.

Sang Min Oh, Sarah Tariq, Bruce N. Walker, and Frank Dellaert. Map-basedpriors for localization. In Proceedings of the IEEE/RSJ International Confer-ence on Intelligent Robotics and Systems (IROS), pages 2179–2184, Sendai,Japan, 2004.

Kazunori Ohno, Takashi Tsubouchi, Bunji Shigematsu, and Shin’ichi Yuta. Dif-ferential GPS and odometry-based outdoor navigation of a mobile robot.Advanced Robotics, 18(6):611–635, 2004.

Martin Persson, Mats Sandvall, and Tom Duckett. Automatic building detec-tion from aerial images for mobile robot mapping. In Proceedings of theIEEE International Symposium on Computational Intelligence in Roboticsand Automation (CIRA), pages 273–278, Espoo, Finland, June 2005. ISBN0-7803-9355-4.

BIBLIOGRAPHY 167

Fiora Pirri. Indoor environment classification and perceptual matching. In Di-dier Dubois, Christopher A. Welty, and Mary-Anne Williams, editors, Princi-ples of Knowledge Representation and Reasoning: Proceedings of the NinthInternational Conference (KR2004), pages 73–84, Whistler, Canada, June2–5 2004. AAAI Press. ISBN 1-57735-199-1.

Fabio T. Ramos, Ben Upcroft, Suresh Kumar, and Hugh F. Durrant-Whyte.Recognising and segmenting objects in natural environments. In Proceedingsof the IEEE/RSJ International Conference on Intelligent Robots and Systems(IROS), pages 5866–5871, Beijing, China, October 9–15 2006. IEEE.

Robert J. Ross, Christian Mandel, John A. Bateman, Shi Hui, and Udo Frese.Towards stratified spatial modeling for communication & navigation. InProceedings of the IROS 2006 Workshop: From Sensors to Human SpatialConcepts, pages 41–46, Beijing, China, October 10 2006.

Axel Rottmann, Óscar Martínez Mozos, Cyrill Stachniss, and Wolfram Bur-gard. Place classification of indoor environments with mobile robots usingboosting. In Proceedings of the National Conference on Artificial Intelligence(AAAI), pages 1306–1311, Pittsburgh, PA, USA, 2005.

Michel Roux and Henri Maître. Three-dimensional description of dense urbanareas using maps and aerial images. In A. Gruen and H. Li, editors, Auto-matic Extraction of Man-Made Objects from Aerial and Space Images (II),pages 311–322. Birkhäuser Verlag, 1997.

Mark A. Ruzon and Carlo Tomasi. Color edge detection with the compassoperator. In Proceedings of the IEEE Conference on Computer Vision andPattern Recognition (CVPR), pages 160–166, Ft. Collins, CO, USA, June23–25 1999.

Robert E. Schapire. A brief introduction to boosting. In Proceedings of theSixteenth International Joint Conference on Artificial Intelligence (IJCAI),pages 1401–1406, Stockholm, Sweden, July 31 – August 6 1999.

Klaus-Jürgen Schilling and Thomas Vögtle. An approach for the extractionof settlement areas. In A. Gruen and H. Li, editors, Automatic Extractionof Man-Made Objects from Aerial and Space Images (II), pages 333–342.Birkhäuser Verlag, 1997.

Chris Scrapper, Ayako Takeuchi, Tommy Chang, Tsai Hong, and MichaelShneier. Using a priori data for prediction and object recognition in an au-tonomous mobile vehicle. In Grant R. Gerhart, Charles M. Shoemaker, andDouglas W. Gage, editors, Unmanned Ground Vehicle Technology V. Pro-ceedings of the SPIE, Volume 5083, pages 414–418, September 2003. doi:10.1117/12.485917.

168 BIBLIOGRAPHY

David Silver, Boris Sofman, Nicolas Vandapel, J. Andrew Bagnell, and AnthonyStentz. Experimental analysis of overhead data processing to support longrange navigation. In Proceedings of the IEEE/RSJ International Conferenceon Intelligent Robots and Systems (IROS), pages 2443–2450, Beijing, China,October 9–15 2006.

Marjorie Skubic, Pascal Matsakis, George Chronis, and James Keller. Gen-erating multi-level linguistic spatial descriptions from range sensor readingsusing the histogram of forces. Autonomous Robots, 14(1):51–69, January2003.

Ulf Söderman, Simon Ahlberg, Åsa Persson, and Magnus Elmqvist. Towardsrapid 3D modelling of urban areas. In Proceeding of the Second Swedish-American Workshop on Modeling and Simulation, (SAWMAS-2004), CocoaBeach, FL, USA, February 2004.

Kil-ho Son, Ji-hwan Woo, and In-So Kweon. Automatic 3D urban modelingfrom satellite image. In The 13th International Conference on AdvancedRobotics (ICAR), pages 936–940, Jeju, Korea, August 21–24 2007.

Dezhen Song, Hyun Nam Lee, Jingang Yi, and Anthony Levandowski. Vision-based motion planning for an autonomous motorcycle on ill-structured road.In Proceedings of the IEEE/RSJ International Conference on IntelligentRobots and Systems (IROS), pages 3279–3286, Beijing, China, October 9–15 2006.

Thorsten Spexard, Shuyin Li, Britta Wrede, Jannik Fritsch, Gerhard Sagerer,Olaf Booij, Zoran Zivkovic, Bas Terwijn, and Ben Kröse. BIRON, whereare you? - Enabling a robot to learn new places in a real home environ-ment by integrating spoken dialog and visual localization. In Proceedings ofthe IEEE/RSJ International Conference on Intelligent Robots and Systems(IROS), pages 934–940, Beijing, China, October 9–15 2006. IEEE.

Cyrill Stachniss. Exploration and Mapping with Mobile Robots. PhD the-sis, University of Freiburg, Department of Computer Science, Freiburg, Ger-many, April 2006.

Cyrill Stachniss, Óscar Martínez Mozos, and Wolfram Burgard. Speeding-upmulti-robot exploration by considering semantic place information. In Pro-ceedings of the IEEE International Conference on Robotics and Automation(ICRA), pages 1692–1697, Orlando, FL, USA, May 2006.

Hartmut Surmann, Kai Lingemann, Andreas Nüchter, and Joachim Hertzberg.A 3D laser range finder for autonomous mobile robots. In Proceedings ofthe 32nd International Symposium on Robotics (ISR), pages 153–158, April19–21 2001.

BIBLIOGRAPHY 169

The MathWorks. Matlab 7.0, including Image Processing Toolbox 5.0.<http://www.mathworks.com>.

Christian Theobalt, Johan Bos, Tim Chapman, Arturo Espinosa-Romero, MarkFraser, Gillian Hayes, Ewan Klein, Tetsushi Oka, and Richard Reeve. Talk-ing to Godot: Dialogue with a mobile robot. In Proceedings of IEEE/RSJInternational Conference on Intelligent Robots and Systems (IROS), pages1338–1343, Lausanne, Switzerland, September 30 – October 4 2002.

Sebastian Thrun. Robotic mapping: A survey. In G. Lakemeyer and B. Nebel,editors, Exploring Artificial Intelligence in the New Millenium. MorganKaufmann, 2002.

Sebastian Thrun. Learning metric-topological maps for indoor mobile robotnavigation. Artificial Intelligence, 99(1):21–71, 1998.

Sebastian Thrun, Arno Bücken, Wolfram Burgard, Dieter Fox, Thorsten Fröh-linghaus, Daniel Henning, Thomas Hofmann, Michael Krell, and TimoSchmidt. Map learning and high-speed navigation in RHINO. In DavidKortenkamp, R. Peter Bonasso, and Robin Murphy, editors, Artificial intel-ligence and mobile robots: case studies of successful robot systems, pages21–52. AAAI Press / The MIT Press, 1998.

Kinh Tieu and Paul Viola. Boosting image retrieval. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition, pages 228–235, Hilton Head Island, South Carolina, USA, June 2000.

Ricardo A. Téllez and Cecilio Angulo. Acquisition of meaning through dis-tributed robot control. In Proceedings of the ICRA 2007 Workshop on Se-mantic Information in Robotics, Rome, Italy, April 10 2007.

Elin A. Topp and Henrik I. Christensen. Topological modelling for human aug-mented mapping. In Proceedings of the IEEE/RSJ International Conferenceon Intelligent Robots and Systems (IROS), pages 2257–2263, Beijing, China,October 9–15 2006.

Antonio Torralba, Kevin P. Murphy, William T. Freeman, and Mark A. Rubin.Context-based vision system for place and object recognition. In IEEE In-ternational Conference on Computer Vision (ICCV), pages 273–280, Nice,France, October 13–16 2003.

André Treptow, Andreas Masselli, and Andreas Zell. Real-time object trackingfor soccer-robots without color information. In European Conference onMobile Robotics (ECMR), pages 33–38, Radziejowice, Poland, 2003.

Rudolph Triebel, Patrick Pfaff, andWolfram Burgard. Multi-level surface mapsfor outdoor terrain mapping and loop closing. In Proceedings of the 2006

170 BIBLIOGRAPHY

IEEE/RSJ International Conference on Intelligent Robots and Systems, pages2276–2282, Beijing, China, October 9–15 2006.

Florence Tupin and Michel Roux. Detection of building outlines based on thefusion of SAR and optical features. ISPRS Journal of Photogrammetry &Remote Sensing, 58:71–82, 2003.

Iwan Ulrich and Illah Nourbakhsh. Appearance-based obstacle detection withmonocular color vision. In AAAI National Conference on Artificial Intelli-gence, pages 866–871, Austin, TX, USA, July 30 – August 3 2000.

US Army. Map reading and land navigation. Online version of field manualNo. 3-25.26, US Army, July 2001. Last checked February 2008. Availablefrom: <http://www.globalsecurity.org/military/library/policy/army/fm/3-25-26/index.html>.

Dominique v. Zwynsvoorde, Thierry Siméon, and Rachid Alami. Incremen-tal topological modeling using local Voronoi-like graphs. In Proceedings ofthe IEEE/RSJ International Conference on Intelligent Robots and Systems(IROS), pages 897–902, Takamatsu, Japan, October 30 – November 5 2000.

Paul Viola and Michael J. Jones. Robust real-time face detection. InternationalJournal of Computer Vision, 57(2):137–154, 2004.

Jan Weingarten and Roland Siegwart. 3D SLAM using planar segments. InProceedings of the IEEE/RSJ International Conference on Intelligent Robotsand Systems (IROS), pages 3062–3067, Beijing, China, October 9–15 2006.IEEE.

Denis F. Wolf and Gaurav S. Sukhatme. Semantic mapping using mobile robots.Accepted for publication in the IEEE Transactions on Robotics, 2007.

Oliver Wulf, Andreas Nüster, Joachim Hertzberg, and Bernardo Wagner.Ground truth evaluation of large urban 6D SLAM. In Proceedings ofthe IEEE/RSJ International Conference on Intelligent Robots and Systems(IROS), pages 650–657, San Diego, CA, USA, October 29 – November 22007.

Rongrui Xiao, Chris Lesher, and Bob Wilson. Building detection and localiza-tion using a fusion of interferometric synthetic aperture radar and multispec-tral image. In Proceedings of the ARPA Image Understanding Workshop,pages 583–588, Monterey, CA, USA, November 20–23 1998.

Aiwu Zhang, Shaoxing Hu, Xin Jin, and Weidong Sun. A method of merg-ing aerial images and ground laser scans. In Proceedings of the 2005 IEEEInternational Geoscience and Remote Sensing Symposium (IGARSS), pages222–225, Seoul, Korea, July 25–29 2005.

Publikationer i serien Örebro Studies in Technology

1. Bergsten, Pontus (2001) Observers and Controllers for Takagi – Sugeno Fuzzy Systems. Doctoral Dissertation.

2. Iliev, Boyko (2002) Minimum-time Sliding Mode Control of Robot Manipulators. Licentiate Thesis.

3. Spännar, Jan (2002) Grey box modelling for temperature estimation. Licentiate Thesis.

4. Persson, Martin (2002) A simulation environment for visual servoing. Licentiate Thesis.

5. Boustedt, Katarina (2002) Flip Chip for High Volume and Low Cost – Materials and Production Technology. Licentiate Thesis.

6. Biel, Lena (2002) Modeling of Perceptual Systems – A Sensor Fusion Model with Active Perception. Licentiate Thesis.

7. Otterskog, Magnus (2002) Produktionstest av mobiltelefonantenner i mod-växlande kammare. Licentiate Thesis.

8. Tolt, Gustav (2003) Fuzzy-Similarity-Based Low-level Image Processing. Licentiate Thesis.

9. Loutfi, Amy (2003) Communicating Perceptions: Grounding Symbols to Artificial Olfactory Signals. Licentiate Thesis.

10. Iliev, Boyko (2004) Minimum-time Sliding Mode Control of Robot Manipulators. Doctoral Dissertation.

11. Pettersson, Ola (2004) Model-Free Execution Monitoring in Behavior-Based Mobile Robotics. Doctoral Dissertation.

12. Överstam, Henrik (2004) The Interdependence of Plastic Behaviour and Final Properties of Steel Wire, Analysed by the Finite Element Metod. Doctoral Dissertation.

13. Jennergren, Lars (2004) Flexible Assembly of Ready-to-eat Meals. Licentiate Thesis.

14. Jun, Li (2004) Towards Online Learning of Reactive Behaviors in Mobile Robotics. Licentiate Thesis.

15. Lindquist, Malin (2004) Electronic Tongue for Water Quality Assessment. Licentiate Thesis.

16. Wasik, Zbigniew (2005) A Behavior-Based Control System for Mobile Manipulation. Doctoral Dissertation.

17. Berntsson, Tomas (2005) Replacement of Lead Baths with Environment Friendly Alternative Heat Treatment Processes in Steel Wire Production. Licentiate Thesis.

18. Tolt, Gustav (2005) Fuzzy Similarity-based Image Processing. Doctoral Dissertation.

19. Munkevik, Per (2005) ”Artificial sensory evaluation – appearance-based analysis of ready meals”. Licentiate Thesis.

20. Buschka, Pär (2005) An Investigation of Hybrid Maps for Mobile Robots. Doctoral Dissertation.

21. Loutfi, Amy (2006) Odour Recognition using Electronic Noses in Robotic and Intelligent Systems. Doctoral Dissertation.

22. Gillström, Peter (2006) Alternatives to Pickling; Preparation of Carbon and Low Alloyed Steel Wire Rod. Doctoral Dissertation.

23. Li, Jun (2006) Learning Reactive Behaviors with Constructive Neural Networks in Mobile Robotics. Doctoral Dissertation.

24. Otterskog, Magnus (2006) Propagation Environment Modeling Using Scattered Field Chamber. Doctoral Dissertation.

25. Lindquist, Malin (2007) Electronic Tongue for Water Quality Assessment. Doctoral Dissertation.

26. Cielniak, Grzegorz (2007) People Tracking by Mobile Robots using Thermal and Colour Vision. Doctoral Dissertation.

27. Boustedt, Katarina (2007) Flip Chip for High Frequency Applications – Materials Aspects. Doctoral Dissertation.

28. Soron, Mikael (2007) Robot System for Flexible 3D Friction Stir Welding. Doctoral Dissertation.

29. Larsson, Sören (2008) An industrial robot as carrier of a laser profile scanner. – Motion control, data capturing and path planning. Doctoral Dissertation.

30. Persson, Martin (2008) Semantic Mapping Using Virtual Sensors and Fusion of Aerial Images with Sensor Data from a Ground Vehicle. Doctoral Dissertation.

Documents

Semantic Mapping using Virtual Sensors and Fusion of Aerial …135976/FULLTEXT01.pdf · 2009-01-28 · Persson, Martin (2008). Semantic Mapping using Virtual Sensors and Fusion of