Yves Albers-Schoenberg - MT - Micro Aerial Vehicle Localization using Textured 3D City Models(2).pdf

7/23/2019 Yves Albers-Schoenberg - MT - Micro Aerial Vehicle Localization using Textured 3D City Models(2).pdf

1/81

Department of Informatics

Yves Albers-Schoenberg

Micro Aerial Vehicle

Localization using

Textured 3D City Models

Master Thesis

Robotics and Perception LabUniversity of Zurich

Supervision

Dr. Andras MajdikProf. Dr. Davide Scaramuzza

November 2013


2/81


3/81

Contents

Abstract iii

1 Introduction 11.1 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Autonomous Flight in Urban Environments . . . . . . . . . . . . 3

1.3.1 Above-Rooftop Flight . . . . . . . . . . . . . . . . . . . . 51.3.2 Street-Level Flight . . . . . . . . . . . . . . . . . . . . . . 5

1.4 Legal Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.5 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Textured 3D City Models 92.1 3D Cadastral Models . . . . . . . . . . . . . . . . . . . . . . . . . 102.2 The Google Street View API . . . . . . . . . . . . . . . . . . . . 122.3 Generating Perspective Cutouts. . . . . . . . . . . . . . . . . . . 15

2.4 Refining Geotags by Using Cadastral 3D Models . . . . . . . . . 162.5 Backprojecting Street View Images and Depth Map Rendering . 19

3 Vision-based Global Positioning 213.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.2 Air-ground Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 243.3 EPnP and Ransac . . . . . . . . . . . . . . . . . . . . . . . . . . 263.4 Vision-based Positioning . . . . . . . . . . . . . . . . . . . . . . . 30

4 Experimental Setup 314.1 Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.2 Test Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5 Results and Discussion 365.1 Visual-Inspection . . . . . . . . . . . . . . . . . . . . . . . . . . . 365.2 Uncertainty Quantification. . . . . . . . . . . . . . . . . . . . . . 405.3 Virtual-views and Iterative Refinements . . . . . . . . . . . . . . 435.4 GPS Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6 Conclusion and Outlook 52

A Appendix 59A.1 OpenCV EPnP + Ransac . . . . . . . . . . . . . . . . . . . . . . 59A.2 ICRA Submission. . . . . . . . . . . . . . . . . . . . . . . . . . . 61

i


4/81


5/81

Abstract

This thesis presents a proof-of-concept of a purely vision-based global position-ing system for a Micro Aerial Vehicle (MAV) acting in an urban environment.The overall goal is to contribute to the advance of autonomously acting aerial

service robots in city-like areas. It is shown that the increasing availabilityof textured 3D city models can be used to localize a MAV in the case wheresatellite-based GPS is not, or only partially available. Textured urban scenes arecreated by overlaying Google Street View images on a georeferenced cadastral3D city model of Zurich. The most similar Street View image is then identifiedbased on an image search algorithm for a particular MAV image and the globalcamera position of the MAV is derived. An extensive test dataset containingaerial recordings of a 2km long trajectory in the city of Zurich is used to verifyand evaluate the proposed approach. It is concluded that the suggested vision-based positioning algorithm can be used as a complement or an alternative tosatellite-based GPS with comparable results in terms of localization accuracy.Finally, suggestions are presented on how to improve the introduced vision-

based positioning approach and implement it in a future real-life application.Results of this thesis have been used in the ICRA submission Micro Air VehicleLocalization and Position Tracking from Textured 3D Cadastral Models 1

1International Conference on Robotics and Automation (ICRA 2014), Micro Air Vehi-cle Localization and Position Tracking from Textured 3D Cadastral Models (under review),Andras L. Majdik, Damiano Verda, Yves Albers-Schoenberg, Davide Scaramuzza

iii


6/81


7/81

Chapter 1

Introduction

This chapter describes the goal of the underlying master thesis and gives andoverview on autonomous flight of Micro Aerial Vehicles (MAVs) in urban en-vironments. Challenges are highlighted and a motivation for the suggestedvision-based positioning approach is provided. Moreover, a literature review isconducted summarizing the current state-of-the art.

1.1 Goal

The goal of this master thesis is to design a vision-based global positioningsystem for a Micro Aerial Vehicle (MAV) acting in urban environments. Themain idea is to conduct absolute positioning of the MAV using a textured 3D

city model together with a monocular on-board camera. Basically, the MAVis localized with respect to the surrounding buildings which are perceived bymeans of a camera. By introducing this novel positioning technique, it is shownthat the increased availability of textured 3D city models can successfully beused as an alternative to satellite-based global positioning systems in case whenthese are not, or only partially available. This thesis is an extension of theauthors semester thesis which resulted in the publication MAV Localizationusing Google Street View Data [21]. Rather than implementing a real-timesystem, this thesis aims to give a proof-of-concept of the underlying principlesof vision-based positioning using textured 3D models for MAVs acting in urbanenvironments.

1.2 Motivation

With the rapid advance of low-cost Micro Aerial Vehicles new applications suchas airborne goods-delivery 1 ,inspection 2 , traffic surveillance or first-aid deliveryin case of accidents start to emerge. Moreover, it is conceivable, that tomorrowssmall-sized aerial service robots will increasingly carry out tasks autonomouslyi.e. without any direct human intervention.Accurate localization is indispensable for any autonomously acting robot and isa prerequisite for the successful completion of tasks in a real-life environment.

1As envisaged by the US company Matternet: http://matternet.us.2As envisaged by the US company Skycatch: http://skycatch.com.

1
http://matternet.us/http://matternet.us/http://skycatch.com/http://skycatch.com/http://skycatch.com/http://matternet.us/


8/81

2 1.2. Motivation

Figure 1.1: On the left: There is no direct line of sight for the GPS signal tothe red satellites due to an urban canyon. On the right: The GPS signals arereflected by the surrounding buildings

Satellite-based global positioning systems like GPS, Glonass, Galileo or Com-pass work based on the principle of triangulation and have become the state-of-the-art for global outdoor positioning forming a crucial component of manymodern technological systems. Everyday life applications like smart-phones,

driving-assistance or fleet tracking heavily rely on the availability of satellite-based signals for positioning. While standard consumer grade GPS receivershave a typical accuracy between 3 15 meters in 95% of the time, augmen-tation techniques like differential GPS (DGPS) or Wide Area AugmentationSystems (WAAS) to support aircraft navigation can reach a typical accuracyof 1 3 meters [23]. The accuracy and reliability of a standard GPS sensingdevice fundamentally depends on the number of visible satellites which are inthe line of sight of the receiver. In urban areas, the availability of satellite-based GPS signals is often reduced if compared to unobstructed terrain, or evencompletely unavailable in case of restricted sky view. So-called urban canyonstend to shadow the GPS signals, and building facades reflect the signals violat-ing the underlying triangulation assumption that signals travel along a directline of sight between the satellite and the receiver. Several approaches havebeen suggested in the literature to deal with these drawbacks such as using ad-ditional ground stations or fusing the GPS measurements together with datafrom Inertial Measurement Units (IMUs) for dead-reckoning. This thesis aims toprovide a vision-based alternative to satellite-based global positioning in urbanenvironments by taking advantage of 3D city models together with geotaggedimage databases such as Google Street View 3 or Flickr 4. The motivation is todevelop novel approaches for MAV positioning paving the way for tomorrowsaerial robotics applications in urban environments.

3https://maps.google.ch/4http://www.flickr.com/
https://maps.google.ch/http://www.flickr.com/http://www.flickr.com/https://maps.google.ch/


9/81

Chapter 1. Introduction 3

1.3 Autonomous Flight in Urban Environments

In the context of this work the termautonomous flightin urban environments isreferred to the capability of a MAV to independently i.e. without any humanpiloting execute the following directive:

Fly from Address A to Address B

This capability is a basic requirement for any autonomously acting aerial robotfulfilling tasks in city-like environments. Fig. 1.2shows a simplified referencecontrol scheme of an autonomous robot in the style of [31]. As framed by thered dashed line, an autonomously flying MAV will carry out all the four ma-jor building blocks of navigation, namely localization, path planning, motioncontrol and perception in an automated way.

Start:Address A

End:Address B

Mission Commands

Localization Path Planning

Motion ControlPerception Safety

Real World Environment Autonomous Flight

Figure 1.2: The mission commands fly from Address A to Address B are givenby the operator. To execute the mission commands, the MAV needs to itera-tively carry out the following steps until the goal is reached: Firstly, localizeand determine the current position; secondly, plan the next step (path) to reachthe target; thirdly, generate motor commands to execute the planned path and

interact with the environment; fourthly, extract information from the environ-ment to get an update on the current state.

Localization and Map BuildingThis work focuses on the localization stepin the above control scheme. It is explicitly assumed that the MAV has access toa given map of the environment i.e. the 3D city model and does not needto simultaneously localize and map the environment (SLAM). SLAM systemslike [17]have been successfully applied to localize MAVs in indoor environmentswhere no map is available [7].

Hereafter, the termglobal localizationis referred to the positioning of the robotwith respect to a global coordinate system such as the World Geodetic System


10/81

4 1.3. Autonomous Flight in Urban Environments

1984 (WGS84). Besides positioning (e.g. determination of latitude and longi-

tude) localization usually also includes information on the robots attitude (i.e.yaw, roll and pitch).

Path-Planning As defined in [31] path-planning involves identifying a trajec-tory that will cause the robot to reach the goal location when executed. This isa strategic problem-solving competence that requires that robot to plan how toachieve its long term goals. Path-planning usually involves the determination ofintermediate way-points between the current position and the goal. Even thoughpath planning is a long-term process, path-planning can change when new in-formation on the environment gets available or the mission control commandsare changed. A crucial competence of any autonomous robot acting in humanenvironments is the capability of short-term obstacle avoidance. Especially inurban areas where the robots workspace is shared with pedestrians, cars and

public transport, a robust obstacle avoidance system is a basic prerequisite forany safe robot operation.

Motion Control Motion control is the process of generating suitable motorcommands so that the robot executes the planned path. In case of a quadro-copter, motion control regulates the rotary speed of the four rotors to move theMAV to the desired position and attitude. Generally, it is differentiated betweenopen-loop controlwhere the robots position is not fed back to the kinematiccontroller to regulate the velocity or the position and closed-loop controlwherethe robots system state (velocity, position) are fed back as an input to thekinematic controller. The most widely used closed-loop control mechanism isa Proportional Integral Derivative (PID) controller which minimizes an errorbetween a measured system variable and its desired set-point.

Perception Perception refers to the process of information extraction fromthe robots environment. Duringsensing raw data is collected depending onthe robots specific sensor configuration. Various types of sensors are used inrobotics such as laser scanners or ultrasonic sensors for range sensing, IMUs forattitude estimation or cameras for positioning and motion detection. Generally,one differentiates between active sensors that release energy and measure theenvironmental response to that energy and passive sensors which detect ambi-ent energy changes without releasing energy to the environment. Moreover, onedifferentiates betweenexteroceptivesensors that measure environmental proper-ties such as the temperature and interoceptivesensors which measure the robotsinternal state such as the actuator positions. A detailed overview on different

sensing technologies can be found in [31]. The meaningful interpretation of rawsensor data is referred to information extraction and is a key process in theperception phase. In this work, the main sensor used is a monocular cameraproducing a continuous image stream.

Safety Reliable safety measures are a core requirement for any autonomousmobile system acting in a real-world environment. Especially in urban areaswhere the robot shares its workspace with human beings, well-tested safetymeasures such as obstacle avoidance are crucial for any robotic application.

Based on the context-specific requirements for the above functions, two scenariosfor autonomous flight in urban environments are defined and explained below.


11/81


Figure 1.3: Above-rooftop flight Figure 1.4: Above-rooftop flight

Figure 1.5: Street-level flight Figure 1.6: Street-level flight

1.3.1 Above-Rooftop Flight

In this scenario, the MAVs are flying above the buildings as illustrated in Figures1.3and1.4. Depending on the city-specific urban structure (e.g. mega city inan emerging country with skyscrapers vs. ancient city with historic buildings inWestern Europe) a minimum flying height will be defined such that the MAVis always flying above the rooftops of the buildings. The main advantage of

this scenario is the absence of obstacles in the form of man-made structures andhumans. Therefore, trajectory planning is drastically simplified resulting in afaster and safer system. Moreover, MAV localization can be robustly carriedout based on satellite-based GPS as no buildings obstruct the direct line ofsight to the satellites. Recent research has dealt and largely solved GPS-basedautonomous flight. Low-cost autopilots such as a the PX4 5 can be used togetherwith open source software such as qgroundcontrol 6 orPaparazzi 7 to control aGPS-based flight mission. To demonstrate the practice and the limitations ofthis approach, an autonomous test flight has been conducted using a Parrot ARDrone 2.0 together with Qgroundcontrol. A video presentation summarizingthe results of this flight can be found attached to this thesis. It is clearly shownthat the GPS way-point following works good in principle. However, it is alsodemonstrated that the position accuracy i.e. the MAVs ability to follow the

designated path is not precise enough in order to use GPS-based flight in thestreet-level flight scenario described below.

1.3.2 Street-Level Flight

In this scenario, the MAV flies at street-level i.e. betweenthe building facades asillustrated in Figures1.5 and 1.6 Depending on the city-specific characteristicsand the local obstacle scenario (e.g. street with cars, public transport, pedes-

5https://pixhawk.ethz.ch/px4/en/start6www.http://qgroundcontrol.org/7http://paparazzi.enac.fr
https://pixhawk.ethz.ch/px4/en/starthttp://www.http//qgroundcontrol.org/http://paparazzi.enac.fr/http://paparazzi.enac.fr/http://www.http//qgroundcontrol.org/https://pixhawk.ethz.ch/px4/en/start


12/81

6 1.4. Legal Framework

trians) there will be a minimum height of approximately 4-5 meters for safety

reasons. The city-specific positions of overhead contact wires and crossoverswill moreover determine an acceptable range for a safe flying altitude. Themain challenges associated with autonomous flight in this scenario is obstacleavoidance and trajectory planning. Flying from address A to address B requiresa path planning strategy which takes into account the local scene structure andthe prevailing traffic situation. However, also accurate positioning becomesmore challenging than in the above-rooftop scenario as the satellite-based GPSsignals can be shadowed by the surrounding buildings as illustrated in Figure1.1.

In a realistic application, the two scenarios are likely to be combined. Take-off/landing and short-distance flights will be carried out at street-level whilelong-distance flights could be conducted above the rooftops. This work aims tocontribute to solve the problem of localizing the MAV in the outlined street-levelflight scenario.

1.4 Legal Framework

This section provides a brief overview of the legal environment concerning theoperation of MAVs in urban areas 8. In this context there are two main legalaspects to be considered: a) the rules governing the operation of unmannedaircrafts vehicles (UAVs) and b) the protection of data and the private sphereof individuals.

a) The rules regulating the operation of UAVs

The operation of UAVs is governed by the ordinance on special categories ofaircrafts (Ordinance) issued by the Federal Department of the Environment,Transport, Energy and Communications (DETEC) 9.The Ordinance distinguishes between UAVs weighting more than 30 kilogramsand those weighting up to 30 kilograms.The most significant Ordinances rules governing UAVs weighting up to 30 kilo-grams, which are of relevance for our purposes, are the followings:

According to art. 14 of the Ordinance, the operation of UAVs with a totalweight of up to 30 kilograms do not require an authorization of the SwissFederal Office of Civil Aviation (FOCA);

A constant and direct eye contact with the UAV has to be maintained atall times (art. 17 para. 1, Ordinance).

Autonomous operation of UAVs (through cameras or GPS) within the eyecontact area of the pilot is allowed provided that the he is always in theposition to intervene on the UAV; otherwise the authorization of FOCAis required;

8This section has been composed together with a legal counselor9Ordinance on special categories of aircrafts issued and subsequently amended by the Fed-

eral Department of the Environment, Transport, Energy and Communications on 24 November1994 (sr.748.941);


13/81


It is prohibited to operate UAVs: a) within 5 kilometres of a civil or

military aerodrome and b) 150 metres above the ground within controlareas (art. 17 para. 2, Ordinance);

The operation of UAVs weighing more than 0,5 kilograms requires a lia-bility insurance of at least CHF 1 million (art. 20 para. 2, Ordinance)

Specific rules governing military areas must be observed.

b) The protection of data and the private sphere of individualsThe use of MAVs to process personal data in urban areas might trigger theapplication of the Federal Act on Data Protection (FDPIC).The scope of the law is very wide and includes the processing 10 of personaldata 11 carried out by Federal authorities, private organisations and private in-dividuals. It is important to note that the FDPIC does not apply to personaldata processed by a natural person exclusively for personal use and that it isnot disclosed to third parties (art. 2 FDIPC).Moreover, there is no breach of privacy if the data subject has made the datagenerally accessible and has not expressly prohibited its processing (art. 12para. 3 FDIPC). A breach of privacy is justified by the consent of the injuredparty or by an overriding private or public interest such as the processing ofpersonal data for the purposes of research, planning and statistics (provided theresults of the work is anonymised) (art. 13 FDIPC).Whenever an activity of data processing within the meaning of FDIPC is in-volved, a number of duties and obligations apply, namely: registration of thedata processing systems, protection against unauthorised access to data, infor-mation duties, duty to ensure the correctness of data, etc...

In this context, it is worth mentioning the recent judgement of the Swiss FederalSupreme Court of the Google Street View case 12. In order to comply with theFDPIC, the Supreme Court imposed to Google to further develop the anonymi-sation software and to perform upon request swift ad-hoc anonymisation 13.

1.5 Literature Review

In the recent years, several research papers have addressed the developmentof autonomous Unmanned Ground Vehicles (UGVs), leading to striking newtechnologies like self-driving cars. These can map and react in highly-uncertainstreet environments using partially [6] or completely neglecting GPS systems

[34]. In the coming years, a similar bust in the development of autonomouslyacting Micro Aerial Vehicle is expected. Several recent papers have addressedvisual localization and navigation in indoor environments using low-cost MAVs[5,37] or[40] which tackles the problem of safely navigating an MAV through a

10Art. 3 (g) FDPIC: Processing: any processing of data, irrespective of the means used, inparticular, the collection, storage, use, modification, communication, archive and deletion ofdata;

11Art. 3 (a) FDPIC: Personal data (data): all information of an identified or identifiableperson

12BGE 138 II 346;13For further information on the Google Street view case: http://www.edoeb.admin.ch/

datenschutz/00683/00690/00694/01109/index.html?lang=en
http://www.edoeb.admin.ch/datenschutz/00683/00690/00694/01109/index.html?lang=enhttp://www.edoeb.admin.ch/datenschutz/00683/00690/00694/01109/index.html?lang=enhttp://www.edoeb.admin.ch/datenschutz/00683/00690/00694/01109/index.html?lang=enhttp://www.edoeb.admin.ch/datenschutz/00683/00690/00694/01109/index.html?lang=en


14/81

8 1.5. Literature Review

corridor using optical flow. Most of these approaches are based on Simultaneous

Localization and Mapping (SLAM) systems such as[17] using a monocular cam-era. Other approaches rely on stereo vision or laser odometry as described in[1].Several papers have addressed vision-based localization in city environments. In[36]the authors present present a method for estimating the geospatial trajec-tory of a moving camera with unknown intrinsic parameters. A similar approachis discussed in [14]which aims to localize a mobile camera device by performinga database search using a widebaseline matching algorithm. [10] introduces aSIFT-based approach[20]to detect buildings with mobile imagery. In [39]theauthors propose an image-based localization system using GPS tagged images.The camera position of the query view is therein triangulated with respect to themost similar database image. Note that most of these approaches address thelocalization of ground-level imagery with respect to geo-referenced ground-levelimage databases. However, this thesis explicitly focuses on vision-based aerial

localization for MAVs. An interesting paper addressing the vision-based MAVlocalization in urban canyons is given by [15]based on optical flow. Moreover,the probably most similar work related to the approach presented in this thesisis given in[38]in which the authors makes use of metric, geo-referenced visuallandmarks based on images taken by a consumer camera on the ground to lo-calize the MAV. However, in contrast, the approach presented in this thesis iscompletely based on publicly available 3D city models and image databases. Ashort literature overview on textured 3D models is presented at the beginningof the next chapter.


15/81

Chapter 2

Textured 3D City Models

A basic requirement for the vision-based global positioning approach discussedin this work is the availability of a textured 3D city model. Such a model hastwo important aspects, firstly, it contains the geo-referenced 3D geometry ofthe city in a global coordinate system and secondly, it contains the imageryof the city environment attached to the 3D geometry. In other words, such amodel is a virtual reproduction of a citys spatial appearance. Essentially, whatis needed in this thesis, is a model which allows to link every pixel of the modeltexture to its specific 3D point in a global reference frame. An image databasefor which every 3D point can be deducted will hereafter be called 3D referencedimage database. The large-scale 3D reconstruction of entire cities is a dynamicresearch area. Traditionally, the creation of such photorealistic city models

has been based on the fusion of aerial oblique imagery together with LightDetection and Ranging Data (LiDAR) as described in [22, 12] or use stereovision techniques. Other approaches use street-level imagery in combinationwith LiDAR and GPS for positioning as described in [11, 29] or use only asingle video-camera as described in [28]. More recent approaches rely on usergenerated geo-tagged image databases such as Flickr 1 or Picasa2 as described in[2]. Given the results presented in these papers it may be concluded that large-scale textured 3D city model will become more and more available in the nearfuture. At present, Google Street View 3 is the most advanced commerciallyavailable large-scale city image database in terms of geographic coverage. Eventhough the current version of the Google Street View API does not support theextraction of 3D data i.e. it does not provide an official interface to recoverdepth information the coming generations of Google Street View are likely to

be available in full 3D as oulined in [3]. To anticipate this development, in thiswork, an alternative approach is applied to create textured 3D city scenes byoverlaying 3D cadastral models with Google Street View imagery as describedin this chapter.

1http://www.flickr.com/2https://picasaweb.google.com/3https://www.google.com/maps

9
http://www.flickr.com/https://picasaweb.google.com/https://www.google.com/mapshttps://www.google.com/mapshttps://picasaweb.google.com/http://www.flickr.com/


16/81

10 2.1. 3D Cadastral Models

2.1 3D Cadastral Models

Accurate 3D city models based on administrative cadastral measurements be-come increasingly available to the public all over the world. In Switzerland,the municipal authorities of Basel 4, Bern5 and Zurich6 provide access to theircadastral 3D data. The city model of Zurich used in this work was acquired fromthe urban administration and claims to have an average lateral position errorofl = + 10 cm and an average error in height ofh= + 50 cm. The citymodel is referenced in the Swiss Coordinate SystemCH1903which is describedin detail in[26]. An online conversion calculator between CH1903 and WGS84can be found under 7 Please note that this model does not contain any tex-ture information. As specified in[35], the model is available in several currentComputer-aided design (CAD) file formats and comes along in three differentLevels-Of-Detail (LODs).

Digital Terrain Model (LOD 0) The digital terrain model is availableas Triangulated Irregular Network (TIN) or in the format of interpolatedcontour lines cf. Fig. 2.1(a).

3D Block Model (LOD 1)The 3D block model represents the buildingsand their height in the form of blocks (prisms) cf. Fig. 2.1(b).

3D Rooftop Model (LOD 2) The 3D Rooftop model represents thefacades and the rooftops of the buildings in more detail and also modelswalls and bridges cf. Fig. 2.1(c).

(a) Terrain Model (b) Block Model (c) Rooftop Model

Figure 2.1: The figures show the different Levels-of-Detail (LODs) in which the

cadastral 3D model is available. The images in this Figure belong to the city ofZurich.

In this work, the LOD 2 model is used to get the highest level of accuracyavailable. However, as shown in Fig. 2.2, the LOD 2 model is a simplificationof the reality. Balconies (as shown in yellow), windows (as shown in green) andspecial structures (as shown in red) are usually not modelled. It is evident that

4http://www.gva-bs.ch/produkte_3d-stadtmodelle.cfm5http://www.geobern.ch/3d_home.asp6http://www.stadt-zuerich.ch/ted/de/index/geoz/3d_stadtmodell.html7http://www.swisstopo.admin.ch/internet/swisstopo/de/home/apps/calc/

navref.html
http://www.gva-bs.ch/produkte_3d-stadtmodelle.cfmhttp://www.geobern.ch/3d_home.asphttp://www.stadt-zuerich.ch/ted/de/index/geoz/3d_stadtmodell.htmlhttp://www.swisstopo.admin.ch/internet/swisstopo/de/home/apps/calc/navref.htmlhttp://www.swisstopo.admin.ch/internet/swisstopo/de/home/apps/calc/navref.htmlhttp://www.swisstopo.admin.ch/internet/swisstopo/de/home/apps/calc/navref.htmlhttp://www.swisstopo.admin.ch/internet/swisstopo/de/home/apps/calc/navref.htmlhttp://www.stadt-zuerich.ch/ted/de/index/geoz/3d_stadtmodell.htmlhttp://www.geobern.ch/3d_home.asphttp://www.gva-bs.ch/produkte_3d-stadtmodelle.cfm


17/81

Chapter 2. Textured 3D City Models 11

Figure 2.2: 3D cadastral model (left) compared to the corresponding StreetView images (right). It is shown that facade details such as balconies (yellow),windows(green) and special geometries (red) are not modelled.

the discrepancy between the simplified cadastral city model and the Street Viewimages introduces an error when the images are backprojected onto the modelfacades as described in chapter 2.5.


18/81

12 2.2. The Google Street View API

2.2 The Google Street View API

Google offers an application programming interface (API) to access the StreetView database 8. There is a static (i.e. non-interactive) interface which allowsthe user to specify the location (i.e. latitude and longitude), the attitude (i.e.yaw and pitch) and the image size to download a certain perspective image viabrowser. For example to display Figure2.4 (a) the following URL was accessed:

http://maps.googleapis.com/maps/api/streetview?size=640x320&heading=

270&pitch=0&location=47.376645,8.548712&sensor=false

Moreover, a dynamic API based on Java Script is available which providespanoramaic 360 degree views. A detailed description of the two APIs can befound on the Google Developer online reference. The Street View panoramasare provided in the form of an equirectangular projection which contains 360degrees of horizontal view (a full wrap-around) and 180 degrees of vertical view(from straight up to straight down) 9. An example of such an equirectangularpanoramic image is shown in Figure2.4 (c). For this work, a python 10 script isused to access the dynamic Google Street View API and download the panora-mas for the test area described in the experimental setup cf. chapter4. TheStreet View panoramas are stored in a set of tiles which must be downloadedseparately and stitched together to receive a panoramic image. Different sets oftiles are available depending on the specific zoom level zzoom. In this thesis, thepanoramas have been used for zzoom = 3 which results in a panoramic imagesize of Pwidth x Pheight = 3328 x 1664 pixels which consists of 6.5 x 3.5 tileshaving each the resolution of 512 x 512 pixels [27]. The maximum possible zoomlevel offered is zzoom = 4 which results in a panoramic image size ofPwidth x

Pheight = 6656 x 3328 pixels. However, not all panoramas are available at thiszoom level. One particular advantage of the dynamic API is that it allows forlarge-scale downloads of panoramic images without any restriction. In contrast,if the static API is used, the user can get quickly blocked when downloadingexcessively i.e. when too many download enquiries for subsequent images arerequested. To get perspective images with the dynamic API, the user musttherefore generate perspective cutouts of the panoramic images by himself asdescribed in chapter2.3.The basic functional setup of the Street View script is described in Figure 2.3.

8https://developers.google.com/maps/documentation/streetview/9https://developers.google.com/maps/documentation/javascript/

streetview?hl=en10http://www.python.org
http://maps.googleapis.com/maps/api/streetview?size=640x320&heading=270&pitch=0&location=47.376645,8.548712&sensor=falsehttp://maps.googleapis.com/maps/api/streetview?size=640x320&heading=270&pitch=0&location=47.376645,8.548712&sensor=falsehttps://developers.google.com/maps/documentation/streetview/https://developers.google.com/maps/documentation/javascript/streetview?hl=enhttps://developers.google.com/maps/documentation/javascript/streetview?hl=enhttp://www.python.org/http://www.python.org/https://developers.google.com/maps/documentation/javascript/streetview?hl=enhttps://developers.google.com/maps/documentation/javascript/streetview?hl=enhttps://developers.google.com/maps/documentation/streetview/http://maps.googleapis.com/maps/api/streetview?size=640x320&heading=270&pitch=0&location=47.376645,8.548712&sensor=falsehttp://maps.googleapis.com/maps/api/streetview?size=640x320&heading=270&pitch=0&location=47.376645,8.548712&sensor=false


19/81


FUNCTION: download panoramas

DESCRIPTION: Download Street View panoramas

INPUT

A text file containing a list Ldownload of WGS84 referenced GPS coor-dinates (latitude, longitude) g ps1,...,gpsj ,...,gpsm derived from GoogleMaps for which the closest (in terms of Euclidean distance) availablepanoramic image should be downloaded.

The panoramic zoom level zzoom defining the panoramic image sizePheight x Pwidth.

OUTPUT

A folder containing a set Ipanosof panoramic imagesp1,...,pj ,...,pM forevery GPS coordinate g psj Ldownload.

A list Lgeo containing the geotags geo1,...,geoj ,...,geom for the down-loaded panoramic images. Every geotag is given by the latitude, longi-tude, yaw, roll and pitch of the panoramic camera position.

FUNCTIONAL REQUIREMENTS

Download for every GPS coordinate gpsj Ldownload the tiles whichtogether make up the closest panoramic image. Stitch the tiles togetherand save the panoramic images pj in Ipanos.

For everygpsj Ldownload, get the geotaggeoj of the closest panoramicimage and save it in Lgeo.

Figure 2.3: Functional setup of Street View script used to download Street Viewpanoramas


20/81

14 2.2. The Google Street View API

(a) This Figure shows a perspective StreetView image which was accessed via thestatic Google Street View API using theinternet browser.

(b) This Figure shows a perspective StreetView image which was accessed via thestatic Google Street View API using theinternet browser.

(c) This Figure shows a panoramic Street View image (equirectangular projection)stitched together by using the dynamic Street View API. The yaw spans from 0 to 360degrees (x-axis along image width) whereas the pitch extends from 0 to 180 degrees(y-axis along image height).

(d) fov = 60 degrees (e) fov = 90 degrees (f) fov = 120 degrees

(g) pitch = -45 degrees (h) pitch = 0 degrees (i) pitch = 45 degrees

(j) yaw = -45 degrees (k) yaw = 0 degrees (l) yaw = 45 degrees

Figure 2.4: Figures (d)-(l) show different perspective cutouts from thepanoramic image in (c) using different cutout parameters as described in2.3

.


21/81


2.3 Generating Perspective Cutouts

As shown later in chapter3.2,a perspective cutout i.e. an image which meetsthe underlying assumptions of a perspective camera model as described in [13] of the Street View panoramas needs to be generated. This is done followingthe procedure outlined in[27]. The functional setup of the cutout function isdescribed in Figure 2.5. Based on the input parameters, the internal camera

FUNCTION: perspective cutout

DESCRIPTION: Generate a perspective cutout of a panoramic StreetView image

INPUT

Panoramic Street View image pj.

Panoramic image size Psize given by the image width Pwidth and theimage heightPheight.

Desired image size Csize of the perspective cutout given by Cwidth andCheight.

Horizontal field of view hfov for the desired cutout.

Image center for the desired perspective cutout specified by yaw andpitchof the panoramic projection.

OUTPUT

Perspective view ck according to input specifications.


Transform the equirectangular projection to a perspective view.

Figure 2.5: Functional setup of perspective cutout function.

matrixKstreet for the generated perspective cutout can be calculated as follows:

cx = Cheight/2

cy =Cwidth/2 (2.1)

Whereascxandcyrepresent the optical camera center. The camera focal lengthsfx, fy are given by:

fy =fx = Cwidth

2tan(hfov/2) (2.2)

The Street View camera matrix is hence given by:

Kstreet =

fx 0 cx0 fy cy0 0 1

(2.3)


22/81

16 2.4. Refining Geotags by Using Cadastral 3D Models

2.4 Refining Geotags by Using Cadastral 3D Mod-

els

The provided geotags geoj Lgeo (cf. Figure2.3) for the Google Street Viewimagery are not exact. As shown in [32] where 1400 images were used for ananalysis the average error of the camera positions is 3.7 meters and the averageerror of the camera orientation is 1.9 degrees. In the same work, an algorithmis proposed to improve the precision of the Street View image poses. Thisalgorithm uses the described cadastral 3D city model of Zurich to detect theoutlines of the buildings by rendering out 3D panorama views as illustrated inFig. 2.6(a-b). Accordingly, the outlines of the buildings are also computed forthe Street View panoramas using the image segmentation technique describedin [19]. Finally, the refined pose is computed by an iterative optimization,

namely by minimizing the offset between the segmented outlines from the StreetView panoramas and the outlines of the rendered out panorama view from the3D cadastral model. For this work, the described refinement algorithm wasapplied to correct the Google Street View geotags used in the experimentalsetup cf. chapter4. Fig. 2.6shows the difference when overlaying rendered outpanoramas of the cadastral 3D model before applying the correction algorithmandafterapplying the correction algorithm. It is clearly evident that the matchquality i.e. the accuracy when overlaying the 3D city model with Street Viewimages drastically increases after the application of the described refinementalgorithm.

(a) Rendered Building Outlines BeforeCorrection

(b) Rendered Building Outlines AfterCorrection

(c) Overlay Panorama and Rendered

Outlines Before Correction

(d) Overlay Panorama and Rendered

Outlines After Correction

Figure 2.6: Figure (a) shows the rendered out building outlines based on theoriginal geotag of a panoramic Street View image. Figure (b) shows the ren-dered out bulding outlines based on the refined geotag. Figure (c) overlays thepanoramic image with the outlines based on the original geotag. Figure (d)overlays the panoramic image with the outlines based on the refined geotag. Itis clearly shown that the overlaying of Figure (d) is much more precise thanFigure (c).

Note that the refinement algorithm was run by the authors of[32] as the code


23/81


has not been published by the time writing. A functional setup of the refinement

algorithm is, however, provided in Figure2.7.


24/81

18 2.4. Refining Geotags by Using Cadastral 3D Models

FUNCTION: refine geo tags

DESCRIPTION: Refine the panoramic geotags using the 3D cadastralmodel

INPUT

A text file containing a list Lgeo of Street View geotags (latitude, longi-tude, yaw, pitch, roll) geo1,...,geoj ,...,geom derived from the function

download panoramas (cf. Figure 2.3 which describe the Street Viewcamera locations for a set of panoramic images p1,...,pj ,...,pM.

A setIpanos of panoramic images p1,...,pj,...,pM

The 3D cadastral model for the locations in Lgeo.

OUTPUT

A list Lrefined containing the refined panoramic camera locationsxyz1,...,xyzj ,...,xyzm referenced in the 3D model coordinate frameCH1903.

For every original geotag geoj related to the panoramic image pj , the

refined external camera matrixRTj given by the refined rotation matrixRj which describes the rotation of the Street View camera with respectto the model origin and the translation vector Tj which describes thetranslation with respect to the model origin. Note thatxyzj =inv(Rj )Tj ).


Segment building outlines in the panoramic images p1,...,pj ,...,pM.

Render out panoramic building outlines from the 3D cadastral model forthe Street View camera localizations in Lgeo.

Overlay the segmented building outlines with the panoramic renderings

and measure the offset

Iteratively refine the Street View camera localizations by running anoptimization to minimize the offset.

Figure 2.7: The functional setup of the refinement algorithm proposed by[32]to correct the panoramic geotags of the Street View images.


25/81


2.5 Backprojecting Street View Images and Depth

Map Rendering

A given perspective cutout of a downloaded Street View panorama can be back-projected onto the 3D cadastral model taking into account the refined positionlocation as illustrated in Fig. 2.8(a)-(d). This is done with the open-source 3Dmodelling software Blender 11. Some sample files showing textured 3D modelscenes are added to this thesis. Note that the quality of the backprojectionlargely depends on the accuracy of the refined position estimates (i.e. the re-fined geotags) of the Street View camera and the modelling accuracy of the 3Dcadastral model. The main goal of the backprojection is to assign the texture i.e. the Street View images to their corresponding 3D geometries in thecadastral model. An alternative approach to map the 2D pixel coordinates of

the Street View cutouts to their global 3D coordinates in the city model is toadd the Street View camera perspective to the 3D model and subsequently ren-der out the global 3D coordinates for all the pixels. This process is illustratedin Figure2.8(e)-(f).

(a) Street View Camera View (b) Extracted UV map

(c) Backprojected Street View Images (d) Textured Scene

(e) Z-Coordinate Map:Every pixel is mapped toits global Z-coordinate

(f) X-Coordinate Map:Every pixel is mapped toits global X-coordinate

(g) Y-Coordinate Map:Every pixel is mapped toits global Y-coordinate

Figure 2.8: Figures (e) -(g) illustrate the rendered out global 3D model coordi-nates in the style of a heat map.

11http://www.blender.org
http://www.blender.org/http://www.blender.org/


26/81

20 2.5. Backprojecting Street View Images and Depth Map Rendering

Moreover, the functional setup on how to render out 3D coordinates for pixels in

the Street View images is described in Figure2.9. Note that the 3D coordinatesfor the pixels can be either rendered out in the global coordinate system orin the local camera coordinate system. If the global reference frame is used,every pixel in the Street View image can be directly linked to its absolute globalcoordinates in the city model reference system. Alternatively, the depth valuescan be rendered out for every pixel and then be converted to the local cameracoordinate frame. Remember that the global 3D coordinates are referenced inthe Swiss coordinate system CH1903 as outlined in chapter2.1.

FUNCTION: get 3D coordinates

DESCRIPTION: Render out the global 3D coordinates and/or depth forthe Street View pixels

INPUT

The Street View camera locations RTk specifying the external cameraparameters of a specific perspective cutout ck. Whereas Rk is the rota-tion matrix of the Street View camera with respect to the model originandTk gives the translation vector with respect to the model origin.

The internal camera parameters Kstreet of the perspective cutout asgiven by2.5.

The cadastral 3D model which contains the location RTk.

OUTPUT

For every pixel pkuv in ck the global 3D coordinates to which the pixelcorresponds i.e. Xglobal(pkuv ),Yglobal(pkuv ),Zglobal(pkuv ). pkuv standsfor the pixel in cutoutk, rowu and columnv.

Alternatively, for every pixel pkuv in ck get the depth D(pkuv ) whichcorresponds to the pixel. If desired, also the 3D coordinates in thelocal camera frame can be extracted i.e. Xlocal(pkuv ), Ylocal(pkuv ),Zlocal(pkuv).


Create a perspective camera in the 3D model according to the externalparameterRTk and the internal parameters Kstreet.

Render out coordinate paths i.e. save the corresponding 3D modelcoordinatesXglobal(pkuv),Yglobal(pkuv ),Zglobal(pkuv ) for every pixelpkuvin the image plane of the perspective cutout ck.

Figure 2.9: This figure outlines the process of linking the Street View cutoutpixels to their corresponding 3D coordinates.


27/81

Chapter 3

Vision-based Global

Positioning

This chapter presents the vision-based global positioning approach. The func-tional requirements are derived and the main steps are explained in detail.

The underlying idea of the vision-based global positioning approach is straight-forward and illustrated in Figure3.1: First (a), in the preprocessing phase a 3Dreferenced image database containing perspective Street View cutouts is gener-ated. In this context, 3D referenced means that we can link every pixel in theimage database to the corresponding global 3D point which resulted in the 2Dimage projection. Important steps include: download the Street View panora-

mas, create perspective cutouts, refine the Street View geotags and finally renderout the 3D path of every cutout. Second, the MAV image (b) which we wantto localize is searched in the Street View cutout database. This is done usingthe so-called air-ground algorithm which outputs 2D-2D match points that linkcorresponding feature points between the MAV and the Street View image (c).Third, the resulting 2D-2D matches can be converted into 2D-3D matches whichlink the MAV image feature points to their global 3D counterparts i.e. the 3Dpoints which result in the projection of the 2D feature points. This is done withthe help of the 3D referenced image database established in the preprocessingphase. Finally, a so-called PnP algorithm can be used to estimate the MAVsexternal camera parameters (d) which describe the global location and attitudeof the MAV with respect to the global reference frame (e).

3.1 Preprocessing

As described before, during the preprocessing phase, a 3D referenced imagedatabase is created. In a realistic setting, this process should be done offline asit is computationally expensive. Note that the procedure described at the endof this section in Figure3.2is rather complicated as we do not have direct accessto a textured 3D city model. We therefore need to manually link the texture(i.e. the Street View images) to the 3D cadastral model. However, as outlinedin the beginning, it is conceivable that we will have Google Street View in full3D in the near-future. The following procedure could then be carried in a much

21


28/81

22 3.1. Preprocessing

more direct way.

(a) 3D referenced image database (b) MAV image

(c) MAV image Street View cutout feature correspondences

(d) PnP p osition estimation (e) Global lo calization

Figure 3.1: Main steps of vision-based localization

Note that the specific cutout parameters mentioned in Figure3.2 i.e. field ofview, yaw and pitch will largely depend on the MAV camera used in a realflight scenario. Generally, the field of view of the Street View cutouts shouldbe as high as possible to make use of the biggest possible overlap between theMAV image and the Street View cutout. Cutouts with different yaw parameters

should be stored in the image database to ensure a 360 degree coverage of theflying area. Similarly, different pitch cutouts can be used to ensure verticalscene coverage and different fields of view can ensure the coverage of differentzoom levels. Of course, the more cutouts are stored in the image database, thelonger it will take to retrieve the correct match with the air-ground algorithmdescribed in the next section. The specific parameters used in this thesis aredescribed in the experimental setup cf. chapter 4.


29/81

Chapter 3. Vision-based Global Positioning 23

FUNCTION: Preprocessing

DESCRIPTION: Steps required to generate a 3D referenced imagedatabase.

INPUT

Flight areagps1,...,gpsj ,...,gpsm Aflight where the MAV will operate.This is a list of WGS84 referenced GPS coordinates

OUTPUT

Geo-referenced image database Icutout containingNperspective cutouts

c1,...,ck,...,cNalong the flying route as described in Figure2.3. Internal camera matrix Kstreet specifying the focal lengths fx, fy and

the optical centers cx.cy of the perspective Street View cutouts.

A mapping which links every pixel pkuv of cutout ck Icutout to itsglobal 3D model coordinates Xglobal(pkuv ), Yglobal(pkuv ), Zglobal(pkuv ).Whereas pkuv stands for the pixel in cutout k in row u and columnv cf. function get 3D coordinatesin Figure2.9.


Download the Street View panoramas for every GPS coordinate gpsj Aflight and store them in a panorama image database Ipanosusing func-

tion download panoramas cf. Figure2.3.

Process the panoramas p1,...,pj ,...,pM Ipanos and generate the per-spective cutouts c1,...,ck,...,cN Icutout with the function perspec-tive cutout cf. Figure2.5.

Refine the the GPS coordinates gps1,...,gpsj,...,gpsm Aflight usingthe algorithm 2.7 and store the refined positions referenced in the 3Dmodel coordinate frame as xy z1,...,xyzj ,...,xyzm Aref

Based on Aref and the perspective cutout inputs, derive the rotationmatrixRk and the translation vector Tk specifying the external cameraparametersRTk of the cutouts c1,...,ck,...,cN in the 3D model coordi-nate frame.

Calculate the internal camera parameters Kstreet based on the perspec-tive cutout inputs cf. chapter2.3.

Create a mapping between the pixels pkuv and their global 3D modelcoordinates by rendering out the Xglobal, Yglobal, Zglobal path from the3D model for every cutoutck Icutout usingRTk andKstreet cf. Figure2.8.

Figure 3.2: Steps required to prepare the 3D referenced image database whichis used in the vision-based positioning algorithm.


30/81

24 3.2. Air-ground Algorithm

3.2 Air-ground Algorithm

The air-ground algorithm was introduced in [21] and partially resulted from theauthors semester thesis 1. The main goal of this algorithm is to find the mostsimilar Street View image for a given MAV image by finding corresponding fea-ture points. In the said thesis, it is shown that state-of-the-art image search tech-niques usually fail to robustly identify correct feature matches between streetlevel aerial images recorded by a MAV and perspective Google Street View im-ages. The reason for this are significant viewpoint changes between the twoimages, image noise, environmental changes and different illuminations. Theair-ground algorithm introduces a novel technique to simulate artificial viewsaccording to the air-ground geometry of the system and hence manages to sig-nificantly increase the number of feature points. Moreover, a state-of-the-artoutlier rejection technique using virtual line descriptors (KVLD) [4] is used to

reduce the number of wrong correspondences. Please refer to the cited papersfor details on the air-ground algorithm. Figure 3.3 shows an example imageshowing the matches found by the air-ground algorithm between the MAV im-age on the left side and its corresponding Street View image on the right side.The green lines illustrate the corresponding feature points whereas the magentalines describe the virtual lines as described in [4].

Figure 3.3: Match points found between MAV image (left) and Street Viewcutout (right) with the air-ground algorithm. Note that there are still someoutliers.

Note that the output of the original air-ground algorithm is essentially a set of2D-2D image correspondences between the MAV image and the most similarStreet View image. As described in[21], by identifying the most similar StreetView image, one can localize the MAV image in the sense of a topological map.However, no metric localization i.e. the exact global position in a metric map

can be derived based solely on the 2D-2D correspondences. As described inthis thesis, the 2D-3D correspondences between the MAV image coordinatesof the feature points and the 3D coordinates referenced in a global coordinateframe can be established using the cadastral 3D city model. Based on thesecorrespondences, the global position of the MAV can be inferred as shown inthe next section. The functional setup of the air-ground algorithm is illustratedin Figure3.4.

1Micro Aerial Vehicle Localization using Google Street View


31/81


FUNCTION: air-ground algorithm

DESCRIPTION: Find the visually most similar Street View image for agiven MAV image.

INPUT

A geotaged image database Icutout containing a set of perspective StreetView cutouts c1,...,ck,...,cN.

MAV image dj for which we want to identify the most similar StreetView image in the set Icutouts.

OUTPUT

The most similar Street View cutout cj which corresponds to the MAVimage dj i.e. the Street View cutout with the highest number ofcorresponding feature points.

A list uMAV containing Nmatchesx2 entries whereas the first columnrefers to the u coordinate and the second column to thev coordinate inthe MAV image plane of a corresponding feature point. The image pixelcoordinate system is the standard used by OpenCV a

A list uSTREET containing Nmatchesx2 entries whereas the first columnrefers to the u coordinate and the second column to thev coordinate inthe Street View image plane of a corresponding feature point.

Note that the feature point of the first row ofuMAVcorresponds to thefeature points of the first row in uSTREETand so on. Nmatches standsfor the total number of found feature points between the two images.


Generate artificical views of the images to be compared by means of anaffine transformation.

Identify salient feature points in the artificial views.

Backproject the feature points of the artificial views to the original im-ages

Find corresponding feature points by means of an approximate nearestneighbor search.

ahttp://docs.opencv.org/modules/calib3d/doc/camera_calibration_and_3d_reconstruction.html

Figure 3.4: Functional setup of the air-ground algorithm. Please refer to [21]for details.
http://docs.opencv.org/modules/calib3d/doc/camera_calibration_and_3d_reconstruction.htmlhttp://docs.opencv.org/modules/calib3d/doc/camera_calibration_and_3d_reconstruction.htmlhttp://docs.opencv.org/modules/calib3d/doc/camera_calibration_and_3d_reconstruction.htmlhttp://docs.opencv.org/modules/calib3d/doc/camera_calibration_and_3d_reconstruction.html


32/81

26 3.3. EPnP and Ransac

3.3 EPnP and Ransac

The goal of this section is to calculate the MAVs external camera parameterswhich are given by the 3 x 3 rotation matrix RMAVand the 3 x 1 translationvector TMAV, or alternatively by the 3 x 4 matrix RTMAV as follows:

RMAV =

r11 r12 r13r21 r22 r23r31 r32 r33

, TM AV =

t1t2t3

, RTMAV =

r11 r12 r13 t1r21 r22 r23 t2r31 r32 r33 t3

(3.1)

Basically, the cameras external parameters define the cameras heading andlocation in the world reference frame. Or in other words, they define the coordi-

nate transformation from the global 3D coordinate frame to the cameras local3D coordinate frame. Note that TM AV specifies the position of the origin ofthe global coordinate system expressed in the coordinates of the local camera-centred coordinate system[13]. The global camera position XM AV in the worldreference frame is given by:

XMAV = R1MAVTMAV (3.2)

Several approaches have been proposed in the literature to estimate the ex-ternal camera parameters based on 3D points and their 2D projections by aperspective camera. In [8], the term perspective-n-point (PnP) problem wasintroduced and different solution were described to retrieve the absolute camerapose given n 3D-2D correspondences. The authors in [18] addressed the PnP

problem for the minimal case where n equals 3 points and introduced a novelparametrization to compute the absolute camera position and orientation. Inthis thesis the Efficient Perspective-n-Point Camera Pose Estimation (EPnP)algorithm[9] is used to estimate the MAV camera position and orientation withrespect to the global reference frame. In their paper, the authors present a noveltechnique todetermine the position and orientation of a camera given its intrin-sic parameters and a set ofn correspondences between 3D points and their 2Dprojections. The advantage of EPnP with respect to other state-of-the-art non-iterative PnP techniques is that it has a much lower computational complexity.The computational complexity grows linearly with the number of points sup-plied. Moreover, EPnP has proven to be more robust than other non-iterativetechniques in terms of noise in the 2D location. An alternative to non-iterativeapproaches, are iterative techniques which optimize the pose estimation by min-imizing a specific criterion. These techniques have shown to achieve a very highaccuracy if the optimization is properly initialized and successfully converges toa stable solution. However, convergence is not guaranteed and iterative tech-niques are computationally much more expensive than non-iterative techniques.Moreover, it was shown by the authors of [9] that EPnP achieves almost thesame accuracy as statet-of-the-art iterative techniques. To summarize, EPnPwas used in this thesis because of its speed, robustness to noise and its simpleimplementation. Note that any other PnP technique could be used at this pointto estimate the external camera parameters RMAV andTMAVof the MAV. Theminimal number of correspondences required for EPnP is n = 4.Given that the output of our Air-ground matching algorithm may still contain


33/81


wrong correspondences so-called outliers and that the model-generated 3D

coordinates may depart from the real 3D coordinates, the EPnP algorithm isapplied together with a RANSAC scheme [8]to discard the outliers. RANSACis a randomized iterative algorithm which is used to estimate model parameters(in our case the rotation matrix RMAV and the translation vector TMAV de-scribing the MAV camera position) in the presence of outliers. The main ideaof Ransac is straight-forward and can best be explained using the example offitting a straight line as shown in Figure 3.5. First, Ransac randomly selects

Figure 3.5: Left side: sample data for fitting a line containing outlier points.

Right side, fitted line by applying Ransac. Images take from Wikipedia.

two points from the sample data and fits a line. Second, the number of inlierpoints i.e. the points which are close enough to the fitted line according toa certain threshold are determined. This procedure is repeated for a certainnumber of times and the model parameters which have the highest number ofinliers are selected to be the best ones. Ransac works robustly as long as theoutlier percentage of the model data is below 50 percent. As specified in [21]the number of iterationsNneeded to select at least one random sample set freeof outliers with a given confidence level p e.g. p= 0.95 can be computed as.

N= log(1 p)log(1 (1 y)s)

(3.3)

Where y is the outlier ratio of the underlying model data and s the numberof model parameters needed to estimate the model. In the case of the lineexamples= 2. In the case of EPnP the minimal set is at least equal to s = 4.The procedure used by Ransac in the case of EPnP to test whether a givencorrespondence point is an inlier is as follows: Firstly, Ransac randomly selectss points from the 3D-2D correspondences and supplies them to EPnP whichcalculates RMAV and TM AV. The remaining 3D points are then reprojectedto the 2D image plane based on RMAV and TMAVaccording to the following


34/81

28 3.3. EPnP and Ransac

equation:

zc

xrepryrepr

1

= KM AVRTM AV

xglobalyglobalzglobal

1

(3.4)

Where KMAV is the 3 x 3 internal camera calibration matrix of the MAV (cf.chapter 4. Finally, Ransac calculates the so-called reprojection error which isgiven by:

ereprojection =

(xrepr xorig)2 + (yrepr yorig)2 (3.5)

Wherexorig and yorig are the original 2D image coordinates of the reprojected

3D-2D match point. If the reprojection error is below a certain pixel treshold t e.g. t= 2 pixel , the match point is considered to be an inlier. The describedprocedure is repeated several times i.e. RMAV and TMAV are calculatedbased on several sample sets and the model parameters which result in thehighest number of inliers is chosen. RMAV and TMAV are then recalculatedbased on all the inlier points. The function SolvePnP from the computer visionimage library OpenCV 2 has been used to implement the described procedure.AppendixA.1 shows a straight-forward example on how to implement EPnP +Ransac in OpenCV. The functional setup of EPnP is described in Figure 3.6.

2http://docs.opencv.org/master/modules/calib3d/doc/
http://docs.opencv.org/master/modules/calib3d/doc/http://docs.opencv.org/master/modules/calib3d/doc/


35/81


FUNCTION: EPnP Ransac

DESCRIPTION: Estimate the external camera parameters of the MAVcamera

INPUT

A set of corresponding 3D-2D points i.e. a set of 3D points Xglobal andtheir 2D projections Xcamera in the MAVs camera frame.

The internal camera parameters KMAVof the MAV camera.

The Ransac parameters i.e. allowed reprojection error treshholdreprtresh in pixels, the confidence level pconfidence , the number of

matches supplied to EPnP s which must be at least s = 4.

OUTPUT

The external camera parameters RMAV and TM AV descibing the MAVcamera position with respect to the global reference frame.


Randomly select a subset ofs 3D-2D match points.

Calculate RM AV andTMAV based on EPnP.

Reproject 3D points and calculate the reprojection error ereprojection.

Consider 3D point to be an inlier if theereprojection< reprtresh. Other-wise consider match to be an outlier.

Repeat this procedure according to the confidence level pconfidence andEquation3.3.

Take the iteration which resulted in the highest number of inliers andrecalculate the finalRMAV andTMAVbased on these inliers using EPnP.

Figure 3.6: Functional setup of EPnP + Ransac. Please refer to [9] for details.


36/81

30 3.4. Vision-based Positioning

3.4 Vision-based Positioning

Based on the previous steps, the vision-based positioning algorithm can now beeasily formulated. Algorithm1 shows the basic setup of the system. Note thatin a realistic application, not the entire 3D referenced image database will besearched. Based on a so-called position priorpprior , it will make sense to narrowdown the search space to make it as small as possible and hence speed up thewhole algorithm. Such a position prior can be given by the latest satellite-basedGPS estimate, IMU-based dead-reckoning or the previous vision-based estimate.Moreover, if there is a magnetometer available, the heading measurements couldalso be used to reduce the search space.

Algorithm 1: Vision-based Positioning

Data: 3D referenced image database IcutoutResult: Global MAV position X

M AVinitialization cf. Preprocessing cf. Figure3.2;1forevery MAV imagedj do2

ifposition priorpprior is available then3Reduce search space to Ireduced Icutout4SetIsearch = Ireduced5

ifno position prior is available then6search the whole database: Isearch = Icutout7

Run the air-ground algorithmcf. Figure3.4:8[cj , uSTREET, uMAV] = air-ground(dj , Isearch)9Search the most similar Street View cutout cj Isearch.10Determine the 2D-2D feature correspondences uSTREET, uMAV.11

ifnumber of found correspondencesNmatches is bigger than minimum12

treshholdTmin thenGet global 3D coordinates: cf. Figure2.9:13convert 2D feature coordinates uMAV by using uSTREET to14their corresponding 3D coordinates XglobalRun EPnP Ransac cf. Figure3.6:15[RMAV, TMAV]=EPnP Ransac(uMAV, Xglobal, KM AV)16Calculate global MAV positioncf. Equation3.2:17

XMAV = R1MAVTM AV18

else19Go to the next MAV image dj+120

Note that for EPnP to work, we need at least s = 4 corresponding featurepoints to estimate the external camera parameters. Therefore, Nmatches mustat least be equal to four. However, in reality, the bigger the number of featurecorrespondences, the better the accuracy of EPnP. It therefore makes sense tosetTmin to be higher than s = 4.


37/81

Chapter 4

Experimental Setup

This chapter describes the experimental setup which is used to test the vision-based positioning system. Two datasets are used to verify the performance ofthe introduced approach, firstly thebig datasetwhich was already presented in[21] and secondly, thesmall datasetwhich was newly recorded and also containsGPS data for comparison.

4.1 Platform

The quadrocopter used in this work is a commercially available Parrot AR.Drone2.0. It is equipped with two cameras, one is looking forwards and has a diagonalfield of view f ovMAV of 92

, one is looking downwards. Moreover, the AR.Drone contains on-board sensing technology such as an Inertial MeasurementUnit (IMU), an ultrasound ground altitude sensor and a 3-axis magnetometer.The drone can be controlled remotely over Wi-fi via ground station or SmartPhone. An in-depth discussion of the technology used in the AR. Drone isgiven in[25]. For the recordings in this work, only the front-looking camera hasbeen used. As shown in Figure 4.1 the camera is heavily affected by radial andtangential distortion.

(a) Original MAV image (b) Undistorted MAV image

Figure 4.1: On the left: Distorted MAV image, on the right: Undistorted MAVimage

The recorded imagery was therefore undistorted using the OpenCV library asoutlined in [33]. The OpenCV drone distortion parameters and the camera

31


38/81

32 4.2. Test Area

matrix in this work are given by:

DMAV =

0.513836 0.277329 0.000371 0.000054 0.0

(4.1)

KMAV =

558.265829 0.0 328.9994060.0 558.605079 178.9249580.0 0.0 1.0

(4.2)

According to the procedure described in Chapter 2.3, the perspective StreetView cutouts used in this work are chosen to have a horizontal field of viewhfov of 120 degrees and have the same image size Cwidth x Cheight as the droneimages which is = 640 x 360 pixels. The Street View camera matrix is hence

derived as follows:

Optical center expressed in pixel coordinates (cx, cy):

cx = Cheight/2 = 360/2 = 180

cy =Cwidth/2 = 640/2 = 320 (4.3)

Camera focal lengths (fx, fy):

fy =fx = Cwidth

2tan(hfov/2) =

640

2tan(60)= 184.7521 (4.4)

The Street View camera matrix is hence given by:

KSTREET =

184.7521 0 1800 184.7521 3200 0 1

(4.5)

4.2 Test Area

To test the vision-based positioning approach, two datasets were recorded by pi-loting the AR. Drone manually through the streets of Zurich filming the buildingfacades i.e. the front-looking camera was turned by 90 degrees with respectto the flying direction. The first dataset called ETH Zurich bigcovers a trajec-tory of roughly 2 kilometres in the neighbourhood of the ETH Zurich and has

already been used in[21]. Figure4.2(a) shows the map of the recorded flyingroute together with some sample images. The average flying altitude is roughly7 meters. This dataset has been recorded using the ROS 1 ardrone autonomysoftware package. The drone was controlled with a wireless joystick and theimages were streamed down to a MacBook Pro over a wireless connection. Intotal, the ETH Zurich big dataset contains 40599 images. For computationalreasons, to test the vision-based global positioning system, the dataset has beensub-sampled using every 10th image resulting in a total of 4059 images. Thetrajectory of the big dataset corresponds to 113 Street View panoramic imageswhich are roughly 10 - 15 meters apart from each other. As we were flying with

1http://www.ros.org
http://www.ros.org/http://www.ros.org/


39/81

Chapter 4. Experimental Setup 33

the MAV camera facing the building facades i.e. 90 degrees turned with respect

to the driving direction of the Street View car only 113 perspective cutoutsare stored in the image databaseIcutouts. In other words, the yawparameter inthe function perspective cutoutwas always set to be 90 degrees. In a more real-istic scenario, every panoramic image would be sampled according to differentyaw parameters e.g. yaw = [0, 90, 180, 270] as we can not explicitly assumeto know in which direction the MAV is looking if it is flying completely au-tonomously. However, to test the vision-based positioning approach, the abovesetting seems to be reasonable in terms of computational resources.Besides the side-looking dataset, another dataset for the same trajectory wasrecorded with a front-looking camera facing the flying direction. This front-looking dataset contains only 22363 images but covers the same area. Thereduced number of images for the front-looking dataset is due to a much easiermanual control when flying in the viewing direction of the camera and hence

results in a higher flying speed. However, the front-looking dataset has not beenused in this work.The second dataset called ETH Zurich smallis a subpath of the path gatheredin the big dataset (cf. blue line in Figure 4.2 (a)) and has been recorded to-gether with satellite-based GPS in order to have a comparison to the proposedvision-based approach. For every recorded frame, this dataset also contains therecorded GPS-coordinates i.e. latitude, longitude and altitude according toWGS84 from where the image has been taken. The timely synchronization ofthe GPS tags and the image frames is done on the software level using the opensource package cvdrone 2 which combines the OpenCV image library togetherwith the AR.Drone 2.0 API.To calculate the vision-based position estimates, every 10th image of the dataset

ETH Zurich big is processed using algorithm1outlined in section 3.4. EveryMAV image is compared to the eight nearest Street View images in order tofind the correct match i.e. the Street View image which corresponds to theobserved scene. By comparing every MAV image only to the eight nearest StreetView images, the search space and hence the computational requirements aredrastically reduced for the air-ground algorithm described in section 3.2. Thiscorresponds to the so-called position priorpprior described in algorithm1. In areal flight scenario, it is realistic to have a position prior of at least 100 metersaccuracy based on other on-board sensors such as satellite-based GPS or anIMU used for dead-reckoning. It therefore seems reasonable to only comparethe MAV images to the nearest Street View images instead of searching thewhole Street View database. However, the proposed approach also works fora bigger search space as demonstrated in [21] at the expense of an increasedcomputational complexity.The following list summarizes the parameters used to achieve the results pre-sented in the next chapter:

Number of Street View cutouts stored inIcutoutis equal to Ncutout= 113.

Number of processed MAV images used to test the vision-based positioningapproach: NMAV = 4059.

Image size of the cutouts Cwidth x Cheight = 640 x 360 pixels.

2https://github.com/puku0x/cvdrone
https://github.com/puku0x/cvdronehttps://github.com/puku0x/cvdrone


40/81

34 4.2. Test Area

Size of MAV images Dwidth xDheight = 640 x 360 pixels.

Horizontal field of viewhfov= 120 degrees,yaw = 90 degrees andpitch=20 degrees cf. Figure2.5.

Position prior pprior is equal to the eight nearest Street View cutouts.Hence, the length of the reduced image databaseIsearchis equal toNreduced=8 cf. Algorithm1.

The number of feature points s which are used by Ransac + EPnP: (1):one trial with the minimal sets = 4 (2.) and one trial with a non-minimalset ofs = 8 cf. Figure3.6.

Ransac parameters: reprojection error treshhold t = 2 pixels. Confidencelevel p = 0.95 cf. Figure3.6.

Note at this point, that different Ransac parameters reprojection error thresh-old and confidence levels have been tested on the big dataset. The settingpresented above seems to achieve the best results in terms of accuracy androbustness. If the inlier criteria are made more strict i.e. if the allowed re-projection error is decreased and/or the confidence interval is increased theaccuracy for some estimates may slightly increase. However, the number of suc-cessful estimates (given by estimates which result in more thans = 8 inliers) isdrastically decreased by using more stringent Ransac parameters.


41/81

Chapter 4. Experimental Setup 35

(a) Recorded Test Area: The red line describes the MAV flying path of theETH Zurich big dataset. The blue-white line designates the the ETH Zurich smalldataset which is a subset of the big dataset and was recorded together with satellite-based GPS.

(b) (c) (d)

Figure 4.2: (a) shows an aerial map of the recorded datasets. Figures (b)-(d)show MAV example images (top row) together with corresponding Street Viewimages (bottom row). Note that there are significant differences in terms of view-point, illumination, scale and environmental setting between the MAV imagesand the corresponding Street View images which makes a correct classificationhighly challenging.


42/81

Chapter 5

Results and Discussion

This chapter presents the results of the outlined vision-based global positioningapproach applied to the two recorded datasets. The fundamental problem whenevaluating the accuracy of the vision-based position estimates is that there isno ground-truthavailable meaning that we do not know the realflying trajec-tory. We therefore need to find alternative ways to evaluate the precision of theproposed approach. Firstly, the position estimates for each camera frame areplotted in the 3D model and assessed qualitatively. Secondly, the uncertaintyrelated to the position estimate is evaluated by using a Monte Carlo simulationto calculate the covariances. Thirdly, a visual verification of the position esti-mate is conducted by rendering-out virtual-views from the textured 3D model.The rendered-out virtual-views are then compared to the original MAV im-

ages. Ideally, the compared images should show the same scene. Moreover, itis shown how virtual-views can be used to further refine the vision-based posi-tion estimates. Finally, the vision-based position estimates are compared to therecorded satellite-based GPS measurements from the ETH Zurich small datasetin combination with a visual odometry tracking approach.

5.1 Visual-Inspection

Figure 5.2 shows the top view of the ETH Zurich big dataset. The red dotsrepresent the vision-based global position estimates for each camera frame forwhich a corresponding Street View image was found according to algorithm 1.It can be seen that almost the whole flying route is covered by position esti-mates. Some of the streets are covered very densely with position estimates meaning that there are many correct Street View correspondences found while other areas are rather sparsely covered. There are four areas designatedwith the numbers 1-4 in Figure 5.2 (a) where there are no position estimatesavailable. The reason why there are no or not enough Street View correspon-dences found in these areas (referring to the numbers 1,2,3 in Figure 5.2 (a)) is because of vegetation which occludes the buildings as illustrated in Figure5.1. The reason why there are no Street View correspondences found for areanumber 4 is not so obvious. Possible reasons could be the relatively high flyingspeed in that area resulting in motion- blurred images and a reduced number ofMAV frames per Street View image to be matched. Other possible reasons for

36


43/81

Chapter 5. Results and Discussion 37

the absence of vision-based position estimates are extreme viewpoint changes

or temporal changes in the environment (e.g. refurbishment or reconstructionof buildings) that occurred between the Street View image recordings in thesummer of 2009 and the recording of the aerial dataset in the winter of 2012.

(a) MAV image from area 1 (b) MAV image from area 2 (c) MAV image from area 3

Figure 5.1: No Street View matches where found for these areas cf. Figure5.2

as the buildings are occluded by vegetation.

By close examination of Figures5.2 (a) and (b) the reader will realize that thevision-based position estimates i.e. the red dots are not exactly the same forthe two plots. The reason for this is that in Figure5.2 (a) the minimal set ofs=4 match points was used to run the EPnP and Ransac as described in chapter3.3whereas in Figure5.2 (b) a non-minimal set of s=8 match points was used.To illustrate the difference between the two approaches, Figure5.3 shows someclose-ups of the whole map. The first row shows close-ups when using the mini-mal set whereas the second row shows close-ups when using the non-minimal set.By closely comparing the first two rows, one can conclude that the position esti-mates derived by using the non-minimal set seem to be more plausible than byusing only the minimal set. This is illustrated by the fact that the non-minimalposition estimates tend to be more organized along a smooth trajectory whichis realistic with respect to the real MAV flying path. Or stated differently, theposition estimates derived by using the minimal set tend to jump around morethan by using the non-minimal set i.e. they are more widely spread and lessspatially consistent. The reason for the more robust results in the non-minimalcase is that the position estimates derived by the EPnP are less affected by out-liers and degenerate point configurations. However, in both approaches theminimal and the non-minimal a few extreme outliers occur which are clearlynot along the flying path as highlighted by the yellow boxes in Figure5.3. Onepossible cause for these outliers are wrong point correspondences between theStreet View images and the MAV images given by the air-ground algorithm de-scribed in chapter3.2. Another potential explanation are inaccurate 3D point

coordinates supplied to EPnP resulting from inaccuracies when overlaying theStreet View images with the cadastral city model as illustrated in chapter 2.5.The last row of Figure 5.3 shows the same close-ups after filtering the non-minimal estimates (second row) based on the standard deviations calculated inthe next section. It is shown that the outliers (yellow boxes) can be successfullydiscarded by limiting the allowed standard deviations. The next section step-by-step explains how to get a measure for the uncertainty of the derived vision-basedposition estimates based on a Monte Carlo approach.


44/81

38 5.1. Visual-Inspection

(a) Top View Vision-based Position Estimates: EPnP + Ransac. For this plot, Ransacuses a minimal set of s=4 points for the EPnP algorithm.

(b) Top View Vision-based Position Estimates: EPnP + Ransac. For this plot,Ransac was required to use a non-minimal set of s=8 match points for the EPnPalgorithm cf. chapter3.3. The visual difference between using s= 4 points or s= 8points for EPNP + Ransac is shown more detailed in Figure 5.3.

Figure 5.2: The red dots represent the vision-based position estimates whereasthe black line in Figure (b) illustrates the flight path. The numbers 1-4 in Figure(a) show areas where no matches were found cf. Figure 5.1.


45/81

Chapter 5. Results and Discussion 39

Figure 5.3: Close ups of Figure5.2(a) and (b): First row: Vision-based positionestimates (red points) using a minimal set of s=4 points for EPnP and Ransac.Second row: Vision-based estimates using a non-minimal set of s=8 points forEPnP and Ransac. By comparing the first and the second row, one can seethat the estimates illustrated in the second row for the non-minimal set tendto be more aligned along the street. However, in both cases there are someclearly wrong estimates (highlighted in yellow) which are not along the flying

path. To get rid of those estimates, it is suggested to filter the vision-basedestimates based on the standard deviations as shown in the last row. Pleaserefer to chapter5.2 for more details.


46/81

40 5.2. Uncertainty Quantification

5.2 Uncertainty Quantification

To quantify the uncertainties related to the vision-based global positioning es-timates with respect to the underlying data, a Monte Carlo

Documents

Yves Albers-Schoenberg - MT - Micro Aerial Vehicle Localization using Textured 3D City Models(2).pdf