Download pdf - AN ABSTRACT OF THE DISSERTATION OFresearch.engr.oregonstate.edu › rdml › sites › research...AN ABSTRACT OF THE DISSERTATION OF Carrie Melinda Rebhuhn for the degree of Doctor

AN ABSTRACT OF THE DISSERTATION OF

Carrie Melinda Rebhuhn for the degree of Doctor of Philosophy in Robotics and

Mechanical Engineering presented on June 22, 2017.

Title: Adaptive Multiagent Traffic Management for Autonomous Robotic Systems

Abstract approved:

Kagan Tumer Geoff Hollinger

There is growing commercial interest in the use of unmanned aerial vehicles (UAVs) in

urban environments, specifically for package delivery applications. However, the size,

complexity and sheer numbers of expected UAVs makes conventional air traffic manage-

ment that relies on human air traffic controllers infeasible. To enable UAVs to safely

and efficiently operate in congested environments, it is essential to develop autonomous

UAV management strategies.

We introduce a dynamic hierarchical traffic control model that reacts to traffic con-

ditions instantaneously to reduce congestion in the airspace. An obstacle-filled airspace

lends itself to a modelling as a graph structure similar to a road network. We introduce

“controller agents”, which set costs across the airspace. These agents control traffic sim-

ilarly to adaptive metering lights in highway traffic. UAVs then plan their paths based

on the costs (e.g. conflicts, or delays) they see for traversing particular parts of the

airspace. This provides us a decentralized method for reducing traffic in an airspace.

Our hierarchical structure allows us to separate the traffic reduction problem from

the individual robot navigation problem. Each robot does not explicitly coordinate with

others in the airspace. Instead, robots execute their own individual internal cost-based

planner to travel between locations. We then use neuro-evolution to provide incentives

to these cost-based planners to reduce traffic in the environment.

Traffic quality can be expressed in several different ways. We first evaluate traffic

our traffic reduction policies in terms of ‘conflicts’, which characterizes situations where

an aircraft comes too close to another for safety in a physical space. We then examine

traffic in terms of the amount of ‘delay’ that all agents incur, which assumes that there

is a structure to ensure only a safe number of UAVs occupy the same area. Finally, we

look at the total travel time that a UAV can expect to take from the moment it enters

the airspace until the time it gets to its destination.

To facilitate an exploration of the UTM problem without waiting for a full simulation

of UAVS running with A∗ , we develop an abstraction of the UTM domain that preserves

the core UTM problem. We then investigate performance under differing levels of traffic,

a well as two different agent structures. Our results show similar performance for both

agent definitions, with delay reduction of up to 68% in high traffic cases.

With a fast version of the UTM problem, we explore the effect of redefining the

control structure such that links, or edges of the UTM graph, set costs individually.

This shifts the control paradigm toward controlling directional travel rather than areas

in the space, as was the case with sector agents used in previous approaches. Due to

our graph structure, we find that there are far more control elements in the link agent

approach than in the sector agent approach. We identify a tradeoff; link agents give

finer control, but the coordination problem for the sector agents is easier because there

are fewer sector agents. This indicates that we can improve performance out of a more

distributed link-based setup if we address the challenges of multiagent coordination.

However, the UAV traffic management domain presents a uniquely difficult coordination

problem; each agent’s action can affect the perceived value of every other agent’s actions.

This means that there is an excessive amount of noise in the system, as another agent’s

action can have a lot of impact on the reward an agent receives.

We reduce the amount of multiagent noise by reducing the number of agents that are

capable of learning. We identify that some agents have more ability to influence traffic

based on the topology and traffic profile of the graph. This metric we call impactfulness.

We use this metric to improve the learning by removing less impactful agents from the

learning process, making a more stationary system in which the impactful agents can

learn.

The contributions of this work are to:

• Introduce a cost-based traffic management approach that is platform-agnostic and

fast to implement.

• Develop a multiagent approach to setting costs in this traffic management system

that is adaptive to traffic conditions and learns long-term effects of management

decisions.

• Create an abstraction of UAV traffic that captures key physical attributes, creating

a fast and flexible simulation method.

• Quantify agent contributions to system performance by experimenting with single

agent learning, single agent exclusion, and a sliding number of agents learning in

the system.

c©Copyright by Carrie Melinda RebhuhnJune 22, 2017

All Rights Reserved

Adaptive Multiagent Traffic Management for Autonomous RoboticSystems

by

Carrie Melinda Rebhuhn

A DISSERTATION

submitted to

Oregon State University

in partial fulfillment of

the requirements for the

degree of

Doctor of Philosophy

Presented June 22, 2017

Commencement June 2017

Doctor of Philosophy dissertation of Carrie Melinda Rebhuhn presented onJune 22, 2017.

APPROVED:

Co-Major Professor, representing Robotics

Co-Major Professor, representing Mechanical Engineering

Head of the School of Mechanical, Industrial, and Manufacturing Engineering

Dean of the Graduate School

I understand that my dissertation will become part of the permanent collection ofOregon State University libraries. My signature below authorizes release of mydissertation to any reader upon request.

Carrie Melinda Rebhuhn, Author

TABLE OF CONTENTSPage

1 Introduction 1

2 Background 6

2.1 Robotic Traffic Management . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.1 Collision Avoidance . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.2 Robotic Routing and Scheduling . . . . . . . . . . . . . . . . . . . 8

2.2 Traffic Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Autonomous Navigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.1 Path Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.2 Multi-UAV Path Planning . . . . . . . . . . . . . . . . . . . . . . 13

2.3.3 Learned Navigation . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.4 Learning and Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.4.1 Reinforcement Learning and Q-Learning . . . . . . . . . . . . . . 15

2.4.2 Evolutionary Algorithms and Neural Networks . . . . . . . . . . . 16

2.4.3 Cooperative Coevolution . . . . . . . . . . . . . . . . . . . . . . . 17

2.5 Multiagent Coordination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.5.1 Multiagent Air Traffic Control . . . . . . . . . . . . . . . . . . . . 18

2.5.2 UAV Swarms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.5.3 Scaling with Large State or Action Descriptions . . . . . . . . . . 21

2.5.4 Difference Rewards . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 UAV Traffic Modeling 23

3.1 Airspace Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.1.1 Physical Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.1.2 Random Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2 Traffic Generation and Elimination . . . . . . . . . . . . . . . . . . . . . . . 26

3.3 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.3.1 Conflict . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.3.2 Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.3.3 Travel Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.4 Physical Space Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.5 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.6 Control Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.6.1 Hierarchical Path Planning . . . . . . . . . . . . . . . . . . . . . . 33

TABLE OF CONTENTS (Continued)Page

3.6.2 Impactfulness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.7 UTM Agent Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4 Learning UAV Traffic Management Agents 40

4.1 Agent Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.1.1 Sector-Based Learning Agents . . . . . . . . . . . . . . . . . . . . 40

4.1.2 Link-Based Learning Agents . . . . . . . . . . . . . . . . . . . . . 42

4.1.3 Neuro-Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.2 Experimental Results for Learning Hierarchical UTM . . . . . . . . . . . . 43

4.2.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.2.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5 Learned Abstract Policies Applied to Real Robots 50

5.1 Neural Network Controller Evolution . . . . . . . . . . . . . . . . . . . . . . 50

5.2 ROS UTM Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.3 Gazebo Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.4 Real Routing in a Maze . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

6 Fast UTM Learning 62

6.1 Abstraction of the UTM Coordination Problem . . . . . . . . . . . . . . . . 62

6.2 Experimental Results for Varying UTM Traffic Profiles . . . . . . . . . . . 63

6.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

7 Scalable UTM Learning 69

7.1 The Multiagent Noise Problem . . . . . . . . . . . . . . . . . . . . . . . . . 69

7.2 Learning with Subsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

7.2.1 Include-One . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

7.2.2 Exclude-One . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

7.2.3 Sequential Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 75

7.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

8 Conclusion 82

TABLE OF CONTENTS (Continued)Page

9 Future Work 85

9.1 Airspace Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

9.2 Physical Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

9.3 Heterogeneous Planners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

9.4 Tolerance Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

Bibliography 87

LIST OF FIGURESFigure Page

1.1 An example of how a robot traffic management system would reduce down-

stream traffic. Robots (UAVs in this case) start from the location indi-

cated by the UAV figure, with a goal indicated by the triangle. Planned

paths are marked in a dashed line. Figure (a) shows robots that use a

distance-optimal metric. Figure (b) shows how an agent could modify this

metric to reduce downstream congestion. Map data c©2016 Google. . . . . 2

3.1 (a) The UAV traffic management problem. The airspace is divided into

discrete sectors with a single UTM agent controlling the cost of travel

through each sector. UAVs traveling from start (triangle) to goal (circle)

locations plan and execute paths to reduce their individual travel cost

without explicitly coordinating with other UAVs in the airspace. This

can lead to potentially dangerous conflict incidents with significant cas-

cading effects. The goal of the sector agents is to learn appropriate costing

strategies to alleviate bottlenecks in the traffic and minimize congestion

in the airspace. (b) Graph Gh of sector agent connections used in the

experiments. Agent policies assign edge costs based on the current sector

state s, and the discretized cardinal direction followed on the edge. Dif-

ferent colors on this graph refer to different sector memberships of areas

of the space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2 Two examples of conflict situations. Figure 3.2a shows three UAVs coming

into proximity with one another simultaneously. With our linear conflict

count, the fitness penalty would be 3. With our quadratic count, the fitness

penalty would increase by 9 due to the squared term. In Figure 3.2b there

are two areas of conflict with only two UAVs in conflict. Under a linear

count, the fitness penalty would increase by 4, and under a quadratic count

the fitness penalty would increase by 8. The situation in Figure 3.2a is

preferable under the linear conflict count method, and the situation in

Figure 3.2b is preferable under the quadratic method. . . . . . . . . . . . 29

3.3 Impactfulness metric for test graph topology with 11 nodes, 38 edges.

This demonstrates a symmetrical graph with a direct route between two

nodes, with many optional alternative routes through the grid graph at

the bottom right. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

LIST OF FIGURES (Continued)Figure Page

3.4 Impactfulness metric for test graph topology with 25 nodes, 96 edges.

This represents two symmetrical subgraphs connected by links connecting

nodes {2,6,13,20,24}. The connecting links can have no impact on the

paths planned throughout the graph. . . . . . . . . . . . . . . . . . . . . . 38

3.5 Impactfulness metric for test graph topology with 20 nodes. This graph

was generated using the random graph generation algorithm as defined

by Algorithm 3. This shows a fairly even level of impactfulness across the

map, although some nodes are not able to change more than one path.

This graph indicates that most agents have at least some power to change

paths generated in the system. . . . . . . . . . . . . . . . . . . . . . . . . 39

4.1 The structure of the neural network for sector agents. There are four

inputs, one for the traffic count in each direction, and there are four

outputs, one to denote the cost the sector will assign to each direction. . . 42

4.2 Initial planned paths for 27 UAVs in the airspace travelling from start

(solid green triangle) to goal (solid green circle) locations. It can be seen

that high levels of congestion may be expected in some thoroughfare re-

gions of the airspace, such as the bottom right area where there are a

group of close fixes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.3 Change in congestion observed over one run of the linear conflict evalua-

tion trials. The overlaid heat map shows the number of conflicts experi-

enced in an area for (a) the best evaluation trial for agents with random

initialized sector travel costs and (b) the best evaluation trial for agents

with evolved sector travel costs after 100 epochs. The overall number of

conflicts has reduced with some high congestion regions removed completely. 45

4.4 Change in congestion observed over one run of the quadratic conflict eval-

uation trials. The overlaid heat map shows the number of conflicts expe-

rienced in an area for (a) the best evaluation trial for agents with random

initialized sector travel costs and (b) the best evaluation trial for agents

with evolved sector travel costs after 100 epochs. As with the linear trials,

the overall number of conflicts has been reduced. . . . . . . . . . . . . . . 46


4.5 Comparison of conflict occurrence over epochs using fixed costs versus

evolved costs with two different fitness functions. The linear performance

metric reduces conflicts ubstantially better than the quadratic method,

although both outperform the fixed-cost method. . . . . . . . . . . . . . . 47

5.1 The high-level map used for training the neural network policies the Pi-

oneer experiments. Shown for reference are barriers (square brackets)

that would create this high-level topology. For training, all traffic moves

between N1 and N4, requiring controller agents to make non distance-

optimal routes look attractive to some traffic. . . . . . . . . . . . . . . . 51

5.2 Learning performance using a graph with 4 nodes and 10 edges. 10 runs ar

shown for statistical significance. Agents are able to completely eliminate

delay in the system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.3 System architecture of the ROS experiments. The UTM agents publish

a cost graph over which the robots plan their high level paths. The low

level planning and navigation is handled by the standard ROS navigation

packages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.4 A simulated Gazebo environment with two obstacles. This map was cre-

ated by driving a Pioneer around a real environment with two square

obstacles, although multiple robots are introduced in simulation. This

graph has four nodes, each at a corner of the graph, and the arrows de-

note current target points of the Pioneers. . . . . . . . . . . . . . . . . . . 55

5.5 A Pioneer-3DX robot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.6 The map generated from running SLAM across a room. This was shared

among all the robots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.7 Robots moving between waypoints in the space using the RTM frame-

work. This map was generated by driving a robot around a physical

obstacle maze. Robots then used the amcl package to localize within

this maze. The robots then autonomously moved throughout the maze,

planning their routes considering the high-level RTM graph. . . . . . . . . 58


5.8 RQT graph for three pioneers in a Gazebo simulation. Pioneers com-

municate their position in the map with respect to which agent controls

their motion (the /membership) topic). This traffic information is then

processed by the utm agent, which is a group of evolved neural networks.

This information is then used to update the /utm graph topic, which is

the high-level cost map set by agents in the system. This is then used to

update the appearance of walls for each robot topic according to its A∗

optimal plan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.1 The sector agent (a) and link agent (b) problem formulation. Sector

agents control groups of edges rather than a single edge. We divide these

groups of edges into four cardinal directions, and determine traffic and

costs based on this discretization. . . . . . . . . . . . . . . . . . . . . . . . 63

6.2 Sector and link agent performance on a map with 20 nodes, 102 edges.

We tested three different generation rates for learning epochs of length

100s. The low-traffic scenarios (ρ = 20UAV s/50s) is zero in most of the

graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

6.3 Sector and link agent performance on a map with 20 nodes, 102 edges.

We tested three different generation rates for learning epochs of length 200s. 65

6.4 Sector agent performance on a map with 100 nodes, 580 edges. We tested

three different generation rates for learning epochs of length 100s. . . . . . 65

6.5 Sector agent performance on a map with 100 nodes, 580 edges. We tested

three different generation rates for learning epochs of length 200s. . . . . . 66

7.1 Learning with all agents, learning using only the best individual agent,

and learning excluding the worst individual agent. . . . . . . . . . . . . . 71

7.2 The maximum performance of a single agent graphed against that agent’s

blocking impactfulness. The linear trend line shows that an agent’s block-

ing impactfulness has a negative correlation with increasing average travel

time, indicating agents that can block traffic tend to have better perfor-

mance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72


7.3 The maximum performance of a single agent graphed against that agent’s

attracting impactfulness. The linear trend line shows that an agent’s

attracting impactfulness has a positive correlation with increasing average

travel time, indicating agents that can attract traffic tend to have worse

performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

7.4 Learning with 3 randomly-selected agents learning sequentially in the sys-

tem. At step 100 the second agent is activated, and the first agent is de-

activated. At step 200 the third agent is activated and the second agent

is deactivated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

7.5 Sequential activation of three agents in the system. We see a slight de-

crease in performance after the third agent is added into the system. . . . 76

7.6 Sequential activation of three agents in the system. We see that in this

case, there is largely no improvement when more than one agent learns in

the system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

7.7 Testing with different numbers of agents, by impactfulness. We see an

increase in total travel time as more agents are added as the multiagent

learning problem becomes harder. . . . . . . . . . . . . . . . . . . . . . . . 78

LIST OF TABLESTable Page

6.1 Comparison of performance for different agent control, graphs, traffic gen-

eration rates, and epoch length (T). Percentage improvement is calculated

between the delay experienced in the first and last learning epoch. . . . . 67

LIST OF ALGORITHMSAlgorithm Page

1 A∗ (G, s, h) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 CCEA(agents,domain) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 GenerateRandomAirspace(nodes) . . . . . . . . . . . . . . . . . . . . . . 26

4 Simulate(agents) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5 ShouldGenerateUAV(t) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

Chapter 1: Introduction

Commercial use of unmanned aerial vehicles (UAVs) will drastically increase the traffic

management burden in urban airspaces. Recent proposals to update legislation regard-

ing UAV operation in the US airspace [36] have accelerated this discussion, and NASA

NextGen has the goal of enabling safe low-altitude UAV flight within the next 5 years

[52]. Current methods, which rely on human controllers to manage the airspace, become

prohibitively costly and unsafe due to constraints on human multitasking and commu-

nication. If UAVs enter the airspace on a massive scale, it is clear that an automated

approach is necessary.

Existing work in air traffic control has focused on the flow of airplanes through fixed

geographical points and have typically used multiagent system methods to model the

distributed system of traffic controllers [34]. A multiagent reinforcement learning frame-

work was used in [7] to reduce congestion by setting the required separation between

aircraft, ordering ground delays or ordering reroutes. A hierarchical system of manage-

ment was considered in [20], with airlines negotiating for airspace to minimize chances

of congestion at the higher strategic level, and at the lower tactical level, air traffic

controllers could reroute aircraft by introducing no-go zones called “shocks”.

While some aspects of these approaches can be applied to UAV traffic management,

many new issues arise in this new domain that are not addressed by existing traffic

management strategies. Most notably, the national airspace is a large and relatively

obstacle-free environment, whereas this is not always the case in an urban setting with

UAVs operating in close proximity to one another. If the flow of UAVs is not well-

managed, congestion in cluttered environments may exceed a UAV’s ability to plan

around other UAV trajectories, potentially escalating to larger cascading effects through-

out the system. However, it is impractical to model the behaviors of all the actors in the

system. This is further complicated by the diverse range of planning software that may

be used across the platforms present in the airspace, especially if they are competing

for throughput and do not explicitly coordinate with foreign platforms to avoid conflict

situations.

2

(a) (b)

Figure 1.1: An example of how a robot traffic management system would reduce down-stream traffic. Robots (UAVs in this case) start from the location indicated by the UAVfigure, with a goal indicated by the triangle. Planned paths are marked in a dashedline. Figure (a) shows robots that use a distance-optimal metric. Figure (b) shows howan agent could modify this metric to reduce downstream congestion. Map data c©2016Google.

We define the UAV traffic management (UTM) problem, and model UAV traffic trav-

eling across an airspace. We model individual UAVs as points planning across the space

using A∗ path planners. UAVs first plan a path across an abstract graph-representation

of the space, and then plan over the lower-level obstacle map to travel through the space.

We then present several ways of quantifying the performance of a UAV traffic system.

This motivates the use of a high-level UTM learning framework that can map the

salient features of the UAV airspace to control actions that can mitigate congestion,

such as artificially increasing the cost of flying through particular areas. In this work,

we propose a UTM strategy employing a decentralized system of neuro-evolutionary

learning agents that manage the flow of UAV traffic through their designated sectors of

the airspace.

We introduce “controller agents”, whose actions are to set costs across the airspace.

The agents control the cost map shown to UAVs in the system, and dynamically adjust

the costs based on the current traffic profile. In a simple example, if many UAVs were

3

currently traveling through the agent’s area, the agent will increase its perceived costs

to incentivize UAVs to take alternative routes.

We first define control in a way based on current air traffic management approaches,

with “sectors” of airspace under the control of one agent. We then identify cardinal

directions within this sector that define the sector’s control policy. Each sector agent

learns a policy that assigns the cost of travel through the sector according to the current

number of UAVs in the airspace in a particular direction. We perform policy evaluation

by assessing the total number of UAV conflicts in the airspace during a single learning

epoch.

Results from testing on a simulated urban environment, the team of sector agents

was able to learn appropriate costing strategies to reduce conflicts in the entire airspace.

This was despite using only local information regarding the number of UAVs in their

sector and the intended heading of each UAV. Comparison to a uniform costing strategy

showed on average a 16.4% reduction of conflicts in the airspace.

This approach is dynamic, distributed, human-independent, and can learn to reduce

long-term congestion propagation. It is dynamic because we have agents observe the

traffic conditions and dynamically respond with a cost appropriate to current traffic

conditions. We provide distributed control by allowing agents to control divided areas

of the airspace, rather than having a centralized controller. This permits arbitrarily

large scalability in our implementation. The traffic control is human-independent as

well, as it is able to develop a policy to set costs rather than rely on a human controller

as is done in conventional air traffic control. Finally, we are able to reduce long-term

congestion propagation because agents obtain rewards from long sequences of observed

states, in which cascading congestion effects play a part. The agents aim to minimize

this cumulative group congestion metric, rather than minimizing a local instantaneous

congestion.

It also accommodates a wide variety of UAV traffic because agents do not control

individual UAV trajectories. Instead, we model our approach after dynamic tolling or

metering lights, and impose incentives or penalties to control the flow of traffic. By

changing the observed cost of routes presented to the UAVs, we escape the computational

complexity of dictating each path in the airspace, and accommodate a variety of path

optimization algorithms. This is essential to a real-world implementation, where we

cannot rely on homogeneity in the vehicles or planners moving through the airspace.

4

This setup divides the airspace into sectors with cardinal directions, and introduces

artificial structure to our control problem. Such a discretization of control can lead to

difficulty in incentivizing at the granularity needed to provide a good control structure.

Therefore we introduce a second agent architecture that defines agent control over links

between sectors, where agents set costs for each transition between two sectors. This

removes the discretization into four cardinal directions, and allows a single agent to

control traffic flow in a particular direction between two points in the airspace. These

two different definitions give us agents with different challenges and benefits. The sector

agents have access to more information, but sacrifice granularity of control. Link agents

have the ability to more finely adjust the flow of UAVs across the airspace, but operate

on less information. Link agents must also cope with greater coordination requirements;

there can be many more links than sectors, as links represent the area of contact between

sectors. This creates a much larger multiagent coordination problem.

In this work we also explore the use of our UTM system on physical robots. In this

case we replace the simulation of A∗ across a given map with a ROS system operating

across a real obstacle map built with laser scan data. We present a communication

structure that allows UTM agents to take in traffic data, output costs, and then enforce

a hierarchical path planning from the robots while still taking advantage of pre-built

ROS navigation nodes.

In order to investigate the effects of many different traffic profiles, as well as other

agent structure, we needed to develop a fast simulation that excluded low-level planning.

We introduce an abstraction that maintains the key UTM problem characteristics to

quickly evaluate these two approaches. We take the hierarchical traffic model and strip

the low-level A∗ planning component from it. We replace this with a capacity for each

link that restricts travel of UAVs, and delay any UAV that tries to travel across an at-

capacity link. We then attempt to minimize delay in the system using neuro-evolution.

This models propagation of delays, should a UAV system have regulated safe numbers

of UAVs in the airspace. This could also model the propagation of stalls in the airspace

caused by conflict-avoidance maneuvers executed in the physical space.

Using this abstract model, we scale our system to handle up to 400 UAVs, and over

100 controller agents. We demonstrate delay reduction of up to 68% in high-traffic

scenarios. We also find that, despite having a much larger coordination problem and

using no reward shaping techniques, link agents perform comparably with sector agents.

5

The contributions of this work are to:

• Introduce a cost-based traffic management approach that is platform-agnostic and

fast to implement.

• Develop a multiagent approach to setting costs in this traffic management system

that is adaptive to traffic conditions and learns long-term effects of management

decisions.

• Create an abstraction of UAV traffic that captures key physical attributes, creating

a fast and flexible simulation method.

• Quantify agent contributions to system performance by experimenting with single

agent learning, single agent exclusion, and a sliding number of agents learning in

the system.

6

Chapter 2: Background

UAV traffic management combines several fields. In Section 2.1 we will overview re-

lated work that has focused specifically on the state-of-the-art of multi-robot motion

and routing. At a higher level, there is also the problem of separating the paths of these

robots so that they do not come into physical proximity with one another, which relates

to the problem of traffic control. We discuss traditional methods of traffic modeling

and control in Section 2.2. Our work focuses primarily on this problem, which is to

manage the traffic before a collision event ever occurs. We also overview the problem

of autonomous navigation in Section 2.3, which is essential to understand in modeling a

traffic system with autonomously-moving robots. We then give background on adaptive

and learning approaches in Section 2.4, which is what we use to develop methods of man-

aging this traffic. Finally, we overview work tackling the unique challenges of multiagent

coordination in Section 2.5.

2.1 Robotic Traffic Management

For many robotic systems, a major focus is safely avoiding collisions, which becomes

largely a geometric problem of avoiding where an aircraft or a set of aircraft will be in

a space. In Section 2.1.1 we will discuss some of these approaches. In Section 2.1.2 we

will discuss the related robotic routing or scheduling problem, where robots are ordered

to move to certain locations at certain times by a centralized system.

2.1.1 Collision Avoidance

There are two main approaches to developing algorithms to ensure safety in the airspace;

the first is to control the density of aircraft which are passing through different sectors

in order to balance the load on regional air traffic controllers, and the second is to

develop optimal rerouting procedures in order to avoid a predicted separation losses

between aircraft. These problems are coupled, in that as the density of the aircraft

7

moving through sector decreases the probability of conflicts occurring (and therefore the

difficulty of solving this problem) also decreases [14]. These approaches are called air

traffic flow management (ATFM) and conflict-avoidance respectively. In this work we

focus on the problem of conflict-avoidance.

Conflict-avoidance is currently performed at ground stations during plane flight.

While many approaches to ATFM focus primarily on the airspace as an abstract schedul-

ing problem, the problem of conflict-avoidance focuses instead on preventing temporal

and spatial conflicts from occurring within an airspace by re-planning conflicted routes,

not necessarily restricting this re-planning to passing through fixes. This approach fo-

cuses on safety at the route-planning level rather than the scheduling level, and agents

have many more options for maneuvers to avoid conflicts. Optimality plays a major role

as well; aircraft must also consider the cost of maneuvers taken to ensure the safety of

the aircraft [12].

One approach is to treat this as a pairwise problem, where one aircraft will conflict

with another in a given time horizon and a reroute or velocity change must occur in order

to avoid this conflict. [63]. This unfortunately does not explicitly tackle the problem

of ‘cascading’ conflicts, which are conflicts generated by rerouting between two planes.

Approaches which address this include plotting courses for avoidance maneuvers using

mixed integer programming [26, 17], distributed search strategies [72, 77], optimal control

[89], game theory [45], and negotiation [78].

These approaches work well when the conflict between aircraft is well-defined and has

no uncertainty. However in real flight there can be differences between the announced

flight plans and the actual trajectory taken. The FAA separation restrictions accommo-

date this uncertainty in flight, however as the airspace becomes more crowded, it will

become impossible to attain the desired throughput while maintaining the current safety

restrictions. In order to maintain safety in the airspace while allowing a higher through-

put, it is necessary to automatically identify which types of planes need higher safety

allowances, and to predict ways these planes may deviate from their courses that may

not be communicated to controllers in a timely manner. The information on air traffic

necessary to make these more sophisticated decisions should be available to each plane

through the (mandatory by 2020) ADS-B implementation, and the (optional) TIS-B

subscriptions [35].

ADS-B stands for Automatic Dependent Surveillance - Broadcast, which uses satel-

8

lite navigation and periodically broadcasts its location, enabling it to be tracked by

regulatory entities [1]. TIS-B stands for Traffic Information Service - Broadcast, and is

a situational awareness tool for pilots to see real time positions of nearby aircraft [80].

In our work, we rely on the ability to observe aircraft in the space in order to deter-

mine control strategies, and these technologies would make this observability a feasible

assumption.

2.1.2 Robotic Routing and Scheduling

Previous research has explored congestion as a centralized controller scheduling and

routing problem. Qiu et al. give a comprehensive survey of centralized scheduling

methods for automated vehicles [68]. A centralized controller tells every robot in the

system when it can travel, and where it can travel. In order to compute a solution, these

methods require full state information and are often slow and computationally complex.

Dynamic routing methods [87, 21] manage the time-window of each robot through time

expanded graphs without needing full state information [40]. However, these methods

become computationally expensive when applied to a fast-changing system.

Other methods for traffic routing have focused on vehicle negotiation to resolve con-

flicts in congested areas. Large numbers of aircraft can use these schemes to resolve

local conflicts resulting in improved system performance [93]. Some research has de-

veloped peer to peer collision avoidance protocols. More specifically, a software called

AgentFly uses an agent-based distributed negotiation approach to the air traffic routing

problem [64]. This technique works well in domains with standardized communication

and vehicle abilities. However, UAVs are diverse in both hardware and software. The

method that we propose in this work allows for these characteristics of the domain and

still leaves open the possibility of future progression towards standardization.

Unlike previous research we use an incentivized routing technique instead of a central

controller or vehicle negotiation scheme. Our proposed UTM system does not assign

paths to each robot nor does it require the robots to act in a cooperative manner. Our

algorithm assumes each UAV is using a cost-based planner and so manipulates the cost

of travel through the airspace to incentivize the robots to avoid congested areas.

9

2.2 Traffic Modeling

Traffic models are useful in that they allow researchers to experiment with modifying

traffic incentives without actually performing experiments on full-scale traffic. There

are two main types of traffic; vehicular traffic, which relies on modeling human reaction

and control for embodied vehicles, and network traffic, which assumes an algorithm is

controlling the movement of traffic through the system. Our traffic type falls into the

intersection of these two systems; we assume a set of embodied vehicles following an

algorithmic control method.

Hoogendoorn and Bovy divide vehicular traffic into three broad categories based

on level of modeling detail: submicroscopic, microscopic, mesoscopic, and macroscopic.

Each of these approaches takes a varying level of abstraction of the motions of traffic

throughout a system. Submicroscopic models focus on detailed physical models of each

vehicle moving in the system. Microscopic models focus on general models of individual

vehicles, and make heuristic rules for their interactions in the space. In contrast, meso-

scopic and macroscopic models do not model individual vehicles; they focus on modeling

the traffic density based on fluid physics analogs [107], or statistical data [61].

Under the classification structure offered by Hoogendoorn and Bovy [47], we follow a

microscopic traffic model with deterministic driver modeling. This means that we model

each UAV as a point moving in the space. However, instead of modeling the motion of

each traffic entity based on following-distance or other human-based metrics, we make

the assumption that autonomous UAVs will follow cost-optimal paths.

In contrast, much of the work in traffic research at the microscopic modeling level

seeks to identify the behavior of individual drivers in the system. Toledo et al. model

the driver actions at stop lights, and how this affects the traffic flow between these lights

[92]. Miyajima et al. fit two kinds of models to real samples of driver behavior; an

optimal velocity model and a Gaussian mixture model [60].

We assume that each of our ‘drivers’ is in fact a rational robot following a cost-optimal

algorithm. This is a departure from some of the major methods of traffic modeling, and

provides an algorithmic rather than instantaneous mathematical approach to following

behavior. Because our traffic modeling must model a package delivery application, this

encapsulates a somewhat higher level of reasoning and a significantly lower level of phys-

ical influence than traditional traffic modeling.

10

Our approach of assuming algorithmic behavior of each element, rather than fitting

models to human behavior, may actually more closely resemble network traffic modeling.

UAV traffic fits somewhere in the middle; autonomous and rational entities move through

the space, but have the potential to interact physically with one another. Shortest-path

algorithms are popular for network routing, however it has been shown in previous work

that in a routing domain that objects being sent on a longer path may be beneficial to

the system as a whole as it may alleviate downstream congestion [109].

2.3 Autonomous Navigation

A navigation task is the process of determining a path from a start point to a goal

point that has sufficient safety and path-efficiency. Autonomous navigation has existed

in the literature from the earliest stages of robotics, although the inputs that robots are

now able to handle have increased substantially in complexity. Path planning research

has focused mainly on abstract navigation problems across maps, and we will discuss

algorithms for this in detail in Section 2.3.1. Of recent interest is the problem of applying

this path planning to multiple robots, which we will discuss in 2.3.2. Finally, we will

discuss approaches toward navigating in an unfamiliar environment in Section 2.3.3.

2.3.1 Path Planning

A planning algorithm locates a path through a space that a robot can travel. This may

be in terms of pixel values or waypoints in the space. Autonomous systems rely heavily

on path planning algorithms to move safely and efficiently through the space. There are

a wide variety of algorithms to accomplish this, which can focus on maximizing time

efficiency, path cost efficiency, or risk mitigation. Path planners can be deterministic

or probabilistic in the method with which they search through the space, resulting in

varying path cost optimality.

An optimal planning algorithm minimizes the cost across the planned path. A com-

mon optimal planning algorithm is the A∗ search algorithm. A∗ was introduced in 1968

by Hart et al. [44], and marked an improvement over Dijkstra’s algorithm [32] by intro-

ducing a heuristic to truncate the search, making it more time-efficient. This heuristic

estimates the cost of a proposed path through the space. Optimality guarantees are

11

maintained as long as the A∗ heuristic is admissible, meaning that it does not overesti-

mate the cost of a path. Given the trivial heuristic of always estimating a path cost of

zero, A∗ reduces back to Dijkstra’s algorithm.

Algorithm 1 A∗ (G, s, h)

1: for each vertex u in V do2: d[u] := f [u] := infinity initialize vertex u3: color[u] = WHITE4: p[u] := u5: end for6: color[s] = GRAY7: d[s] := 08: f [s] := h(s)9: INSERT (Q, s) discover vertex s

10: while Q! = ∅ do11: u := EXTRACT −MIN(Q) examine vertex u12: for each vertex v in Adj[u] do13: if w(u, v) + d[u] < d[v] then14: d[v] := w(u, v) + d[u] edge (u, v) relaxed15: f [v] := d[v]h[v]16: p[v] := u17: if color[v] = WHITE then18: color[v] := GRAY19: INSERT (Q, v) discover vertex v20: else if color[v] = BLACK then21: color[v] := GRAY22: INSERT (Q, v)23: else24: edge (u, v) not relaxed25: end if26: end if27: end for28: color[u] := BLACK finish vertex u29: end while

In this work, we use the Boost Graph Library version of A∗ , given in Algorithm 1,

with a Euclidean heuristic, meaning that we estimate the straight-line distance from a

given point in the space to another given point in the space. This will never overestimate

the path-cost, because our graph restricts the UAVs to take longer hops.

12

We change the perception of optimality from distance, the traditional approach, to

predicted traffic reduction. We do this by adding this value to the estimated Euclidean

distance, therefore maintaining the admissibility of our Euclidean heuristic. This value

used to modify the perceived distance is approximated through a learning process.

There have been many extensions to A∗ tailoring it specifically to the problem of con-

trolling UAVs. Tseng et al. examined the use of A∗ for civilian UAV flight, and adapted

it to consider the strength of signal available from base stations across the map [94].

Genetic algorithms have also been explored as a method of UAV path planning. Allaire

et al. explored using a genetic algorithm for UAV path planning across a map using a

Field Programmable Gate Array (FPGA) to parallelize genetic algorithm computation

[9]. Qu et al. also explored compared the use of a genetic algorithm and a heuristic path

planner for planning across a battlefield [69]. They looked at balancing the path cost

with the threat level of a given path.

Risk-optimization is an important planning consideration for many UAV applications.

Ant colony optimization has been used to plan across spaces with multiple different types

of threat, such as no-fly zones, hazardous weather areas, and low-altitude control zones

[110]. Ozalp and Sahingoz looked at planning using parallelized genetic algorithms to

find paths that minimize a scalarized cost metric that model threat zones around radar

[62].

A hierarchical A∗ method was introduced by Wang et al. to handle high-level naviga-

tion across large map, focusing specifically on using OpenStreetMap [100, 43]. Looking

across a large map hierarchically allowed autonomous navigation systems to plan across

much larger spaces. This reduced the combinatoric costs of applying A∗ across an entire

more detailed map. In our work, we use a similar hierarchical A∗ structure, although

our intent is to apply our traffic control algorithm rather than mitigate computational

costs.

A∗ is by no means the only path planning algorithm. Other algorithms offer even

faster solutions, with bounded optimality. Rapidly Exploring Random Trees (RRT) was

developed as a fast way of planning across the space by randomly selecting points and

identifying connections between them [55]. RRT has many extensions, including CC-

RRT∗ , which extends RRT∗ by approximating risks of constraint violation[57]. RRT

has also been used in conjunction with a gaussian process in a UAV planning application

to explore in an unknown environment [105].

13

We base our learning algorithm around modifying the cost map given to a cost-

based planner. Though there are many alternatives in path planning, we have chosen

the A∗ algorithm for planning low-level motions as well as high-level navigation through

the space. We use A∗ because it is a widely known cost-based planner, has optimality

guarantees, and a fast implementation is available through the Boost Graph Library.

Future work could explore the use of other algorithms for planning across the space, but

this is out of the scope of our work here.

2.3.2 Multi-UAV Path Planning

The planning problem becomes much more resource-intensive when applied to multiple

UAVs. The problem of multi-UAV path planning is similar to the problem of collision-

avoidance, but it looks at a longer time-horizon instead of investigating an imminent

conflict event. Joint multi-UAV path planning relies entirely on centralized communi-

cation and coordination, the complexity of which grows nonlinearly with the number of

UAVs participating in the system. Research has therefore focused on ways to increase

the speed of these joint planning algorithms, or on applying them to small numbers of

UAVs.

RRT has been investigated as one solution for generating multiple trajectories for

UAVs in the system, considering kinematic restrictions on UAVs so that they did not

interact in the space [53]. Non-conflicting paths based on differential geometry have also

been proposed, which can be mathematically constrained to deconflict with one another

[73]. Though these approaches provide valid non-conflicting paths, the experiments

presented a maximum of three UAVs interacting in the space.

Multi-UAV path planning has also addressed the application of convoy protection,

which adds restraints on UAVs moving through the space to be spatially near a convoy

at all points in their plan [33]. In another military-based application, UAVs performing a

surveillance task used a decentralized planning method combined with an auction-based

system in order to cooperate in the same space [108]. UAVs have also coordinated for

a tracking task using mixed-integer linear programming, although deconflicting paths

were not considered in this case [8].

Singh et al. tackled an informative path planning problem with multiple robots

using their algorithm eMIP, that used spatial decomposition and branch and bound

14

techniques combined with sequential allocation [76]. Stranders et al. mitigated the

complexity of a multi-robot planning problem by introducing an adjustable look-ahead

parameter to restrict plan length while coordinate a mobile sensor system [84]. Learning

approaches have also been used in group informative path planning task. Zhang et al.

introduced the cooperative and geometric learning algorithm (CGLA) to allow UAVs to

learn informative paths across the space using Q-learning approach.

2.3.3 Learned Navigation

Deterministic path planning algorithms are not the only method of navigation through

a space. In cases where the environment is entirely unknown, an iterative approach can

be useful in navigating through a domain. The use of machine learning approaches can

be applied to navigation tasks in order to process state information that is confusing or

may not have a model to connect to the desired movement actions.

Navigation tasks can be unclear particularly when there is human interaction in-

volved. The Fun Robotic Outdoor Guide (FROG) project has investigated the effects

of this by building in a social cost function into its reinforcement learning method of

navigation [65]. This can iteratively shape the robot’s movements around people by

observing their reactions to the robot.

The field of visual navigation has been the subject of recent study. The collection of

large sets of images, including driving image databases such as ImageNet, have provided

a wealth of data for research into enabling tasks such as autonomous driving [30]. Off-

road robot navigation has used machine learning on driving image sets from ImageNet

to learn paths through the space [67].

Deep learning has gained popularity for handling large amounts of image data, and

has been used for navigation tasks as well. Deep learning was applied in a multi-agent

setting in order to perform collision-avoidance based on observed images surrounding the

robot [56]. A cognitive mapper and planner (CMP) was also developed using a neural

architecture to learn to connect first-person views to a sequence of actions [42]. This

could then be used to find a desired goal state.

Similar to the task of image recognition, the meaning of natural language navigation

instructions provide occasionally unclear direction, and must be processed with machine

learning techniques. This approach is termed parser learning. Chen and Mooney connect

15

natural language commands to a formalized plan for navigating through an environment

using a lexicon learning algorithm [24]. Matuszek extends this work by removing the

need for an a priori obstacle map [58].

Learning routes across a space is an important field of research. It is particularly

important in unknown domains, or domains for which there is no map. The complexity

of our work lies not in the map being unknown, but in the actions of the UAVs traversing

the map being largely unknown. Instead of focusing on UAV navigation as the problem

to learn, we look at the routing of many UAVs. This way our learning algorithm can

handle the unknown portion, rather than focusing on navigating across a known map.

2.4 Learning and Optimization

In this section we will go over learning and optimization approaches. First in Section

2.4.1 we will cover reinforcement learning and Q-learning, which form some of the earliest

approaches to learning. Next, in Section 2.4.2 we will cover evolutionary algorithms,

which are a population-based method of iterative learning. Finally, in Section 2.4.3 we

will cover coevolution, which is the multiagent method used in this work.

2.4.1 Reinforcement Learning and Q-Learning

Reinforcement learning is an iterative method of learning to map situations and actions

to a reward signal [86]. Reinforcement learning is model free, which means that it up-

dates the quality of its actions iteratively through exploration and feedback rather than

querying a predictive model. This can provide substantial benefits when a model is

unavailable.

The general method of performing reinforcement learning is for a autonomous entity,

or agent, to sense its state, query its policy, take an action, receive a reward for this

action, and then update the values of its policy. An agent can occasionally intentionally

take an off-policy action for purposes of exploring the state space. There are two basic

methods of doing this; ε-greedy [86] and soft-max [18]. The strategy behind ε-greedy

action selection is to take an off-policy action ε percent of the time regardless of the

quality of the action. Soft-max, in contrast, selects the action based on a soft-max

activation function, where each action is chosen with the probability:

16

P (y = j|x) =ex

Twj∑Kk=1 e

xTwk

(2.1)

where xTwj yields the current valuation of state x given action j. This results in high-

valued actions being selected with higher probability and lower-valued actions to be

selected with low probability.

Q-learning is a method of incorporating delayed feedback into a value function [102].

It operates on a temporal difference update, which looks ahead at potential future states

that the agent could occupy. The update method is given by:

Q(st, at)← Q(st, at) + αt · (rt+1 + γ ·maxa

Q(st+1, a)−Q(st, at)) (2.2)

where Q(st, at) is the current evaluation of the quality of action at at state st, αt is

the learning rate, and γ is the past-state discount factor. Q-learning has been shown to

converge to the optimal policy for the single-agent case in a static domain [101], but this

does not hold for the multiagent case due to the domain’s more dynamic nature.

A common implementation of Q-learning is to use a Q-table, and to discretize states

and actions to match elements in this Q-table [51]. States with continuous representa-

tions may not fit into this tabular approach, and approaches such as tile coding [74] and

function approximation [59, 81] have been used.

Millan et al. introduced an incremental topology preserving map (ITPM) to serve

as the function approximator for continuous Q-learning [59]. Smart and Kaelbling use

locally weighted regression to approximate the Q-table [81]. Neural networks have also

provided a method of function approximation for continuous Q-learning, as used by

Tesauro for his work in temporal-difference backgammon [91]. Q-learning is not the only

method that can be used to modify neural networks; in the next section we will discuss

evolutionary algorithms that adapt populations of neural networks.

2.4.2 Evolutionary Algorithms and Neural Networks

Unlike standard reinforcement learning, evolutionary algorithms are based around having

populations of policies, rather than adapting an individual policy [15]. This mimics an

evolutionary process from nature, whereby population members produce new slightly

17

mutated population members, and then the members will compete with each other and

the fittest will survive. Genetic algorithms are a specific kind of evolutionary algorithm

that includes crossover, or recombinations of existing policies in the space [39]. This is

combined with mutation in order to produce child solutions. Mutation is the process by

which some elements of a policy in the population is changed.

Neuro-evolution consists of an evolutionary algorithm that has neural networks as its

population members. Neural networks are function approximators that can be used to

model continuous state-action control policies to arbitrary accuracy while only requiring

a coarse approximation of the system state [48]. There are many recent implementations

of neuro-evolutionary algorithms. Notably, the NEAT framework provides an adaptive

way to evolve the network topologies as well as improving the netwrok weights [82].

This allows a population of networks to adapt their structure in order to accommodate

the complexity of a given problem. Whiteson et al. present an algorithm for automatic

feature selection, FS-NEAT that extends the NEAT framework to select important inputs

[103].

Evolutionary algorithms have also been used for multi-objective optimization. The

Strength Pareto Evolutionary Algorithm 2 (SPEA2) provides a method of using evo-

lutionary algorithms with multiple objectives [112, 19]. There are many objectives to

consider in the traffic management problem, such as safety, travel time, cost of fuel,

and other real-world considerations. However, we leave the investigation of the traffic

management problem as a multiobjective problem as a topic of future work.

2.4.3 Cooperative Coevolution

Cooperative coevolution refers to a set of agents in a system that learn together using an

evolutionary algorithm. Typically these systems learn together toward a common goal,

which is reflected in the fitness values given to these agents. As an extension of stan-

dard evolutionary algorithms, CCEAs have been shown to perform well in cooperative

multiagent domains [37, 66]. Pseudocode for a cooperative coevolutionary algorithm is

given by Algorithm 2.

Routing for a UAV team was evolved using cooperative coevolution in combination

with game theory [104]. Zheng et al. also coevolved coordinated UAV paths [111]. UAV

routing was also investigated using cooperative coevolution in conjunction with an ant

18

colony algorithm [85].

In this work we use cooperative coevolution for UAV traffic management, but we

do not focus on coevolving individual UAV routes. Instead we focus on evolving a

decentralized policy for incentivizing travel through parts of the airspace. UAV path

planning is left to deterministic path planners, as discussed in Section 2.3.1.

Algorithm 2 CCEA(agents,domain)

1: Initialize k neural networks for every agent.2: for epoch = 1→ total epochs do3: Mutate population members to create k new population members for each agent.4: for i = 1→ 2k do5: Set ith population member as the active policy for every agent ∈ agents.6: team fitness = domain.Simulate(agents)

7: Assign every active population member their team fitness score.8: end for9: Drop k population members with the lowest fitness for each agent.

10: end for

2.5 Multiagent Coordination

The canonical multiagent learning problem developed from being a game with two play-

ers. In the previous section we discussed methods of modeling agent behavior. The

majority of these methods were applied to two-player domains. However, the field of

multiagent learning is increasingly focusing on large multiagent systems, such as network

routing and sensor combination [88, 99]. In this section we will discuss the challenging

properties of scaling in multiagent systems. Unlike in the last section, we will also con-

sider cooperative multiagent domains, as these tend to grow much larger in size than

competitive domains.

2.5.1 Multiagent Air Traffic Control

Recent work in multiagent systems has focused on developing effective routing for com-

mercial air traffic across the national airspace. In air traffic, delays and congestion can

cascade throughout the system, and are difficult to mitigate and control. Using rein-

forcement learning agents to manage air traffic through geographical fixes, Agogino and

19

Tumer [7] were able to reduce airspace congestion by over 90% when compared to cur-

rent traffic control methods. Cooperative coevolutionary algorithms provided similar

demonstrations of the efficacy of a multiagent learning approach [106, 27]. The setup

presented in [7] has agents control the separation between planes in the airspace. These

works provide decentralized control of an airspace, but do not consider movement around

obstacles as in our approach.

In this work, we apply a similar network of routing agents to control the flow of UAV

traffic. However, we consider the UAV domain where platforms are not restricted to

fly through particular fixes in the environment. Furthermore, our routing algorithm is

based on the assumption that we have no direct control over the path planning aspect

of the UAVs.

The problem of handling UAVs in the airspace is also emerging as a distinct field.

Aubert et al.[13] proposed a framework for UAVs to communicate potential flight plans

to a centralized air traffic management system that determines whether to accept or

reject the plan. Recent work by Digani et al. has also focused on the use of UAVs in

an industrial setting using ensembles [31]. They also use probabilistic path planning to

prevent traffic jams within a warehouse [31].

Controlling the social aspect of UAVs is also relevant in recent work. Foina et al. [38]

investigate a method for individual airspace owners to allow or disallow flights through

their airspace, mitigating annoyance by property owners. While not directly related to

a study of social acceptance of UAVs in the airspace, our approach addresses the issue

of density of UAVs in the airspace. The safety and potential noise issues associated with

a dense swarm of UAVs could be a significant limiting factor in long-term acceptance of

UAV delivery.

2.5.2 UAV Swarms

UAV swarming behavior has become a popular topic of research. The main problem

with practical UAV swarming behavior is positional awareness of other elements of the

UAV swarm. Several centralized methods of communication have been used for early

demonstrations, such as Vicon motion capture [49, 71, 98]. However, there is an increased

focus on the use of UAV swarms in outdoor environments without the need for complex

systems to be in place. We will focus primarily on these applications, as outdoor use is

20

the desired application for our work.

Swarms as mobile networking platforms have been a particularly active topic of re-

search, termed flying ad-hoc networks, or ‘FANETs’ [16]. FANETs are distinct from

MANETS (multiagent ad-hoc networks) and VANETs (vehicular ad-hoc networks) be-

cause of the higher degree of mobility in FANETS because of the generally higher speed

and maneuverability that UAV platforms offer. This also means that typical distances

in a FANET problem will be larger than with a VANET or MANET.

Broadly, swarming behavior to maintaining network connectivity can be divided into

two categories: networks formed with the intention of providing informational relays

between multiple ground stations, and networks formed with the intention of maintaining

connectivity for multi-UAV coordination purposes. Sivakumar et al. investigated UAV

swarms as a wireless communication relay solution [79]. This was proposed with the

intention of UAVs autonomously detecting signal sources and self-organizing to connect

points of communication. Physical UAV experiments toward wireless UAV networking

were conducted by the SUAAVE project using a small UAV platform with a target

payload under 100g [23].

Toward the goal of maintaining network connectivity within a self-assembling team,

Teacy et al. focused on the problem of maintaining connectivity between elements of

a swarm by incorporating this restriction into the multi-UAV planning algorithm used

by the swarm elements [90]. Hsieh et al. proposed a reactive controller designed for

communication link maintenance, which incorporates radio connectivity into a position

controller’s input [50]. The physical communication aspects of swarming behavior are

also under research. Chlestil et al. investigated sensors that would enable networking

between swarms of UAVs, and found that Free Space Optics (FSOs) provided the most

feasible solution for fast transfer of data between swarming UAVs [25].

UAVs first gained popularity in the military sector, with large, complex drones.

With the miniaturization of UAVs and an increase in focus on coordination of many

less-complex elements, military applications of swarming UAVs have been proposed.

Typical operation of a UAV requires more than one human per UAV, but with swarming

algorithms the control of several UAVs simultaneously could become feasible for a single

human. Swarm control for up to 50,000 UAVs using the pheremone control algorithm

ADAPTIV was investigated for expanding the ability of a single soldier to control large

swarms of UAVs [11]. Lamont et al. also investigated military mission planning for

21

swarms of UAVs using a swarm path planning algorithm, considering multiple mission

objectives using an evolutionary algorithm [54]. Multiagent swarming behavior has also

been investigated with the intent to automatically identify targets using computer vision

and multiple cooperating UAVs [29].

2.5.3 Scaling with Large State or Action Descriptions

A common flaw of multiagent systems is known as the ‘curse of dimensionality’. This

refers to the fact that as the number of players in a system increases, the size of the

problem formulation may grow nonlinearly.

Agent modeling suffers particularly when a set of agents must coordinate a joint ac-

tion or consider state information of other agents. The number of joint actions and

observations increases exponentially with the number of agents. Monte Carlo Tree

Search (MCTS) has been used in cojunction with Bayesian reinforcement learning in

large POMDPs [10]. ‘Large’ in this context, however, consisted of four agents and 15

states. It is not clear whether this approach would scale to hundreds of agents or states.

Not only joint action spaces, but large state spaces can slow learning performance.

When a problem is framed as a multiagent Markov decision process (MMDP) the state

becomes exponential in the number of agents. Use of generalized learning automata has

been explored in order to reduce this dimensionality

2.5.4 Difference Rewards

Large multiagent systems have many challenges when performing a coordination task. A

common approach to coordinating agents is to give the same score to the entire system

for a task that they are completing. This gives some indication of how well the task is

being completed as a result of their joint effort, but does not directly assess how each

of them contributed. The amount of the reward that they are not responsible for is

considered ‘noise’.

A major focus in multiagent learning is reducing the effects of this ‘noise’ while

ensuring a reward for working toward a group objective. Noise reduction has been

approached by using difference rewards with counterfactuals to suppress the effects of

other agents’ actions [2, 3, 4, 6, 7, 95, 96]. Difference rewards reduce the noise in the

22

system by subtracting off the impact of other agents in the reward. This reward works

by comparing the evaluation of the system to a system in which an agent is not there,

or replaced by some counterfactual which refers to an alternative solution that the agent

may have taken [4]. Mathematically this is expressed as:

Di(z) = G(z)−G(z−i + c) (2.3)

where G(z) is the full-system reward and G(z−i + c) is the reward for a hypothetical

system in which agent i was removed and its impact was replaced by a counterfactual c.

The difference reward approach to reward shaping has been shown to improve the

speed of convergence and performance in many multiagent problems, such as the El

Farol Bar Problem and Multiagent Gridworld [5, 97]. The difference reward analytically

decomposes a system-level reward via a trade-off between alignment and sensitivity.

Alignment refers to the amount of information carried in the reward signal that indicates

how an agents’ actions impact the system as a whole. Sensitivity measures the reward

signal information that is directly impacted by an agent [5].

23

Chapter 3: UAV Traffic Modeling

Consider an airspace containing multiple UAVs traveling between various start and goal

locations. Each UAV plans its trajectory through the airspace according to a cost-based

path planner aimed at minimizing travel cost. Furthermore, apart from low-level collision

avoidance procedures, each UAV acts independently and does not coordinate with any

other UAV in the airspace. Our goal in this work is to monitor and reduce congestion

in the airspace by employing high-level UTM agents that manage the cost of traveling

through sectors of the airspace.

We use an iterative simulation that consists of UAVs traveling across a graph by using

A∗ . The graph consists of randomly-placed nodes, which represent sector centers. Edges

are then constructed randomly between these nodes to make a planar graph. These edges

represent the boundaries between the sectors. We construct our graph to have random

planar edges, so that it can represent a connection map between physical areas of space.

A schematic of the problem setup is shown in Fig. 3.1a. The airspace is divided

into sectors that are each controlled by a single UAV Traffic Management (UTM) agent,

which controls traffic by assigning costs to travel through its portion of the airspace.

UAVs flying from start to goal locations in the world must travel through a series of

sectors and are subject to the cost of travel between each sector. The sector agents act

as a decentralized system of traffic controllers that each individually learn to apply the

appropriate costs in their respective sectors to minimize overall congestion in the entire

airspace. The obstacle map shown in Fig. 3.1a is a section of San Francisco’s South of

Market area, with data gathered from a building height map [22].

Given the potentially large number of UAVs in each sector, it is impractical to specif-

ically account for the individual states and policies of all UAVs. This motivates the use

of a learning algorithm for developing costing strategies that map a reduced state space

to the travel costs applied in the sector. In this work, we formulate the control policy

of each sector agent as a single-layer neural network where the weights of the neural

networks are learned using an evolutionary algorithm.

Each UAV generates an initial path to the goal by querying the initial cost to travel

24

(a) (b)

Figure 3.1: (a) The UAV traffic management problem. The airspace is divided intodiscrete sectors with a single UTM agent controlling the cost of travel through eachsector. UAVs traveling from start (triangle) to goal (circle) locations plan and executepaths to reduce their individual travel cost without explicitly coordinating with otherUAVs in the airspace. This can lead to potentially dangerous conflict incidents withsignificant cascading effects. The goal of the sector agents is to learn appropriate costingstrategies to alleviate bottlenecks in the traffic and minimize congestion in the airspace.(b) Graph Gh of sector agent connections used in the experiments. Agent policies assignedge costs based on the current sector state s, and the discretized cardinal directionfollowed on the edge. Different colors on this graph refer to different sector membershipsof areas of the space.

along all edges of the graph and performing a high-level A∗ search to determine the

lowest-cost sequence of nodes to travel through to reach their goal location. Once a UAV

reaches its goal location, it disappears from the airspace. The UAVs then independently

calculate their trajectories through the space with this information. This approach for

UAV traffic control manipulates the costs perceived by the UAVs rather than directly

coordinating their trajectories, and therefore most of the computational cost can be

distributed across the entire set of UAVs rather than at a single point.

25

3.1 Airspace Construction

There were two ways in which we generated map; generating a map from a human

analysis of a pixel-level physical map, and a random map generation algorithm. We

wanted to examine both of these methods because each had a distinct advantage. The

human-created map has a clearly-explainable logic behind control, and represents a real

physical space. However, construction of the airspace by a human is labor-intensive and

subject to bias. The randomly-created map allowed us to generate any complexity of

planar map with little time cost. However, the random maps do not directly map to an

obstacle-filled space, and do not necessarily reflect a distinct physical system. In this

section, we will explain the method behind constructing each of these maps.

3.1.1 Physical Map

We first looked at constructing an airspace across a physical obstacle map drawn from

buildings in a real city. We looked up height maps for San Francisco, and identified a

downtown area in the South of Market (SoMa) district that we would use to generate our

obstacle map. We then applied a threshold to the height map data to obtain a binary

obstacle map.

We partitioned the space by visually identifying intersection points in the space. We

determined 15 suitable intersection points, which would define 15 different agents across

the space. We then used a Voronoi partitioning in order to divide the space into sectors of

control for the agents. Traversal across each pixel in the 256x256-pixel took 1 simulated

step for each UAV.

This method depended highly on human factors; for one, the map area was hand-

selected from a larger map of San Francisco. This was not only labor-intensive, but

involved selecting an area that was biased toward a map that would provide a human-

identifiable traffic scenario. Additionally, the points were human-selected rather than

randomly-selected, which shows us control for a scenario that is based on what ‘looks

like’ intersections to a human observer. Because it is not certain that intersections are a

good analogy for this problem, and we wanted a less labor-intensive way of generating

maps, we also explored a random map generation method. We will discuss this in detail

in the next section.

26

3.1.2 Random Map

We wanted to have a testing airspace that could be arbitrarily complex, and independent

of human design. To do this, we developed an algorithm to generate a random map for us

to test. We made the additional restriction that the airspace be planar, so that it could

be a valid abstraction of a 2-dimensional pixel-level obstacle map, as we investigated

using the previous physical map airspace definition. Though 3-dimensional travel is a

potential option for UAVs, we restricted ourselves the planar case so that we would

have a valid abstraction of the human-generated height map method as explained in the

previous section.

Algorithm 3 GenerateRandomAirspace(nodes)

1: N = |nodes|2: Initialize graph G with nodes

3: possible connections =(N2

)4: randomize(possible connections)5: for connection ∈ possible connections do6: if connection does not overlap any edge ∈ G then7: Add connection to edge list in G8: end if9: end for

10: Return G as high-level airspace graph

Though we selected a planar map generation method here, this is not strictly neces-

sary for our control algorithms. We merely desired this to connect more firmly with the

physical space that we were abstracting. If we permitted non-planar graph generation,

we could model a space with three dimensions. This would effectively enable UAVs to

pass above or below one another in the space. Because this does not change the nature of

the learning algorithms that we apply in this work, we do not address here the possibility

of having more than two dimensions of movement in the airspace.

3.2 Traffic Generation and Elimination

In this section, we will go over the different parameters regarding the generation and

elimination of traffic in the system. This section deals with how traffic originates in the

system, how its endpoint is defined, and what happens to the traffic when it has reached

27

this endpoint.

Deterministic traffic (ρ): Traffic was generated at every point in the system, at a pre-

set interval. This generation rate, ρ, determined how frequently traffic was generated,

with a lower value of ρ corresponding to a higher rate of traffic generation. The units of

ρ are UAV s/s.

Multiple generation: Under the deterministic method of traffic generation, a node

had the potential of generating more than one traffic element at the specified interval.

In later experiments, this was increased from one to several in order to show the effects

of a high amount of traffic entering the system all at once.

Probabilistic traffic: Traffic was generated probabilistically at each step, with prob-

ability pgen. This differs from deterministic generation because there could be asyn-

chronous introduction of new traffic by each node in the system.

Static destinations: Destinations were chosen from a list of pre-defined destinations.

Traffic would be sent to these destinations as it was generated, cycling through this

destination list. This provided even traffic flow across the entire map, and allowed us to

reduce randomness in the system if desired.

Random destinations: Destinations were selected randomly at the time of generation.

A UAV could select any node in the airspace, apart from the node at which it had just

appeared.

Removed at destination: When a UAV reached its destination, it could be removed

from the simulation, and would no longer contribute to traffic in the system. This would

correspond to a package delivery application where a UAV landed and was stored or

recharged at its destination.

Rerouted at destination: Alternatively, a UAV could reach its destination and be

rerouted to a different node immediately. This models a physical system where a UAV

makes several sequential deliveries in the airspace.

28

3.3 Performance Metrics

There are various ways to report traffic; we report on three different metrics. The first,

conflict, describes a physical proximity of more than one UAV in the space (Section 3.3.1).

The next metric,delay, assumes enforcement of a certain restriction of travel across the

airspace, and counts the number of times a UAV must wait in order to accommodate this

enforcement (Section 3.3.2). The final metric, travel time, assumes the same restriction

of travel, but aims to measure the average time that a UAV takes to traverse the system

(Section 3.3.3).

3.3.1 Conflict

We would like to train the agents to reduce congestion across the airspace. However,

there are many definitions of ‘congestion’ in traffic research. We draw one definition from

the field of air traffic management. In air traffic management, ‘conflict’ events indicate

that two or more aircraft have come within unsafe proximity to one another.

As the UAVs execute their planned trajectories, conflicts occur when two or more

UAVs pass within some distance, dconflict, of one another. The number of conflicts

detected in the entire airspace over a single learning epoch is then used to calculate the

team fitness evaluation. There are two different ways of viewing the severity of a conflict

event. One way of viewing conflict is to look at the number of UAVs that come into

close proximity with one another. This is our linear conflict penalty. Another way is to

penalize the severity of the conflict, or number of UAVs involved in a conflict instant.

We do this by adding a squared term to calculate the penalty for co-located conflicts,

which we call our quadratic conflict penalty. The differences between these two methods

are shown in Figure 3.2

This approach models concerns in air traffic management, but does not model possible

propagation of physical delays caused by conflict-avoidance maneuvers. We consider

it bad for a UAV to come too close, but assume that lower-level conflict-avoidance

maneuvers will handle these interactions if there are sufficiently few of them. Later, in

Section 3.3.2, we will introduce a performance metric that models the physical delay

propagation caused by UAVs moving in the system, but the results presented in this

section focus only on conflict as the reported metric.

29

(a) (b)

Figure 3.2: Two examples of conflict situations. Figure 3.2a shows three UAVs cominginto proximity with one another simultaneously. With our linear conflict count, thefitness penalty would be 3. With our quadratic count, the fitness penalty would increaseby 9 due to the squared term. In Figure 3.2b there are two areas of conflict with onlytwo UAVs in conflict. Under a linear count, the fitness penalty would increase by 4, andunder a quadratic count the fitness penalty would increase by 8. The situation in Figure3.2a is preferable under the linear conflict count method, and the situation in Figure3.2b is preferable under the quadratic method.

30

3.3.2 Delay

Removing our pixel-level simulator removes also our ability to count ‘conflict’ penalties

as they occur in the space. Instead of counting this metric, we limit the number of UAVs

that may travel along an edge at once. The objective is to reduce the cumulative delay

experienced by all UAVs in the system. In our abstracted problem, we arbitrarily select

the capacity of each edge i to be a constant κi = 2. This edge capacity would be a

calculable feature of the physical space in a real-world implementation. When an edge

has ζi ≥ κi UAVs traveling across it, it is unsafe to add any more UAVs to that edge.

The UAVs that are ready to enter that edge must wait until the current traffic on the

edge drops below capacity before they may continue.

The number of UAVs that travel across an edge is a function of the UAVs’ starting

locations, ending locations, generation rate, and edge costs along the map. Since UAVs

are using cost-based planners to find the cheapest path, and agents may only apply an

edge cost that is greater than zero, UAVs will plan cycle-free paths. This prevents trivial

solutions where UAVs develop ‘holding patterns’ that do not incur delay but drastically

increase the time that it takes for the UAV to get to its destination.

The cumulative delay in the system is the sum of the number of UAVs waiting to

move to the next edge in their path at each timestep. The number of UAVs in the

system, as well as the paths they take, affects the quantity of delay. The costs assigned

to the edges of the graph by the learning agents directly affect the planned UAV paths.

Thus, the goal of the multiagent UTM team is to learn costing strategies that minimize

the cumulative delay.

In this problem, each agent independently learns a control policy for assigning costs

to incentivize UAV planning behavior given the current traffic profile. Training uses the

global system evaluation signal, which is the cumulative delay in the system over a single

learning epoch. The optimal team policy Π∗ is given by:

Π∗ = argminΠ

G (Π, ζ) (3.1)

where G (·, ·) is the total system delay, and ζ is the traffic.

31

3.3.3 Travel Time

Though conflict and delay counting provide two useful metrics for accounting for traffic,

they don’t necessarily measure potential learned behavioral problems in the system. Both

of these promote taking diverse paths in the system, which is desirable for managing

traffic, but they do not promote taking short paths in the system. This can be managed

in some respects because the A∗ paths that the UAVs select are guaranteed to include

the destination node. However, if replanning is allowed sufficiently often, there is a

possibility that a UAV could end in an infinite cycle.

The total travel time metric allows us to measure the total time an average UAV

exists in the system. There are three modes a UAV can occupy in the system:

Waiting to launch (W): A UAV is generated at a node, but must wait for the first

link in its path to become open.

In transit (T ): A UAV is currently traveling on a link.

Arrived (A): A UAV has arrived at its destination, and exits the system.

Given the function t(x), which returns the time since introduction in the system until

the time of exit from the system (or the time of simulation end), our average total travel

time is given by:

t(W) + t(T ) + t(A)

|W|+ |T |+ |A|(3.2)

We consider UAVs waiting to launch onto a link as having entered the system, though

they do not contribute to the link capacity. This is why t(W) is included in the count

of UAV time. This metric effectively represents the average wait time for any UAV

generated into the system, including the grounded time of those that wait initially.

3.4 Physical Space Assumptions

There are many ways to model an airspace, and we explore the effects of learning while

modeling it with varying assumptions. In this section, we go over some key physical

assumptions that we make when modeling, as well as identify assumptions that we vary

32

between modeling methods.

Collision Modeling: In this work we do not model collisions between UAVs in a

physical space, even for our pixel-based map navigation. There is a lot of research done

on exactly how to avoid collision in a one-on-one setting, and therefore we make the

assumption that a physical implementation of this algorithm would have some onboard

collision-avoidance built into their path-planning algorithm. Furthermore, we assume

that this would have little impact on the total path length taken by the UAV, and we

therefore neglect to model this interaction in our pixel-level simulations.

Safe Operating Capacities: As a feature of the environment, the airspace will have a

certain number of UAVs that it can reliably allow through a space at once. This feature,

which we term capacity, would rely on many real-world elements; weather, flight speed,

sophistication of UAV navigation algorithms, and obstacle topology would all contribute

to this safe operating parameter. In this work, because we are largely working on random

maps, we use a flat value for safe operating capacity on each link. However if this were

applied to a real-world domain, this value would need to either analytically or empirically

be determined.

Launch Waiting: There are two approaches we take to introducing UAVs into the

system. One mode is having an immediate launch, where links have no ability to prevent

UAVs from being generated into the system, disregarding the current traffic status of a

link. In this way, UAVs are generated directly on a link, and immediately contribute

to the traffic in the system. The second mode is where UAVs wait to launch until their

desired first link is open. UAVs will exist in a state where they do not actively contribute

to the traffic in that moment, but they will be the next in line to enter the system. This

prevents links from exceeding their designated capacity. This effectively enforces safety

requirements in the system.

3.5 Simulation

First, each fix will determine on whether to generate a UAV (Line 3). This is different

based on whether it has deterministic traffic generation or probabilistic traffic generation,

as described in Section 3.2. It then generates a UAV into the system (Line 4). This has

33

a destination set either randomly or from a static list, as also described in Section 3.2.

Each agent will then observe their own state (Line 8), and from this state, they will

compute their action output based on this state (Line 9). At this point, they modify the

graph available to all agents to reflect the new costs across the high-level graph.

Every UAV in the system with then take this newly-changed graph into account, and

plan its high-level path across the airspace (Line 13). Each UAV then plans a pixel-level

path across the space that has the same sector ordering as given in their high-level path

(Line 14). The UAVs then execute their paths (Line 15).

After all UAVs have moved in the system, conflicts are detected by identifying and

UAVs that have come within a distance of dconflict with one another (Line 17). This is

then counted up for the episode for later use in the fitness evaluation. After conflicts

have been identified, any UAV that has made it to its destination is then either removed

from the system or rerouted (Line 19), according the the methods explained in Section

3.2. Once this is complete, the number of conflicts are either summed up across all time

steps or squared and summed (Line 22).

3.6 Control Methodology

In this section, we discuss our method of hierarchical path planning, with two different

map levels for A∗ planning. We then analyze how effective this control methodology can

be dependinding on the topology of a set of graphs.

3.6.1 Hierarchical Path Planning

In this work, UAVs use two levels of path planners; a sector-level planner and a pixel-

level planner. The role of the sector-level planner was to find the lowest-cost sequence of

sectors to visit according to the costs assigned by the sector agents (this cost-assignment

process explained in detail in Section 4.1.1). The A∗ search was performed over the

spatial connection graph, Gh, generated by identifying the neighboring Voronoi sectors

shown in Fig. 3.1b. This graph did not consider obstacles across the map, and it was

assumed that a clear path existed from points in one sector to the edge of another.

The pixel-level planner represented the connectivity of the unit cell graph with the

obstacle map, Gobs. An 8-connected graph was constructed with unit traversal costs in

34

Algorithm 4 Simulate(agents)

1: while t < T do2: for each f ∈ fixes do3: if ShouldGenerateUAV(t) then4: generate a UAV5: end if6: end for7: for each m ∈ agents do8: m.s← countUAV s()9: m.a← getActions(m.s)

10: end for11: Turn agent actions into edge costs and update graph Gh

12: for each n ∈ UAV do13: n.pathhigh = sectorPath(Gh,m.a)14: n.pathlow = traversalPath(n.pathhigh, Gobs)15: n.x = executePath(n.pathlow)16: end for17: conflicts += UAVs within a distance dconflict of others18: if any UAVs have reached goal then19: Remove respective UAVs from airspace20: end if21: end while22: F = f(conflicts)23: return F

Algorithm 5 ShouldGenerateUAV(t)

if probabilistic and random < pgen thenreturn true

else if deterministic and t%ρ = 0 thenreturn true

elsereturn false

end if

35

any direction. The connectivity of the graph was further restricted by the high-level path.

The pixel-level path was restricted to only those sectors present in the high-level plan.

Furthermore, sectors could only be traversed in a particular order and the connectivity

of the graph reflected this fact. For example, if a high-level plan specified a path visiting

the sequence of sectors {1, 5}, there would be connections between cells along the border

of sectors 1 and 5, but only in the direction of sector 5. Backward traversal into sectors

in the high-level map was not permitted.

Path failure occurred when it was impossible to satisfy the high-level plan using a

low-level path. This could occur when the path to the specified sector was blocked by

an object. In the case that no path was available that satisfied the high-level planner,

the UAV did not move in the airspace. Multiple occasions for replanning were available,

due to the fact that high-level edge costs changed each time a UAV changed sector

membership, so the UAV may have a chance to generate a different high-level plan if a

particular one was infeasible.

3.6.2 Impactfulness

In order to influence traffic, a node must be able to pull or push traffic away. This

depends on three things: the graph topology, traffic profile, and the costs of the other

traffic elements. Here we present an impactfulness metric, which identifies the amount

that an agent can influence paths in the space. By identifying how much agents within

the system can impact generated paths, we can verify that we have an effective control

methodology for this system. If none of the paths can be impacted, our multiagent

system will not be able to take meaningful actions to influence traffic.

We experiment with two measures of impactfulness; blocking impactfulness, and at-

tracting impactfulness. Blocking impactfulness refers to the degree to which an edge can

impact traffic if it is ‘blocked’, or has a disproportionately high cost of travel. Attracting

impactfulness refers to the degree to which an edge can pull traffic toward itself if it

offers essentially free cost of travel.

Assume that each link on the graph has a unit cost. There is an A∗ optimal path

calculated from each node to each other node in the graph. We call this set of optimal

paths assuming unit costs U∗. For each edge i in the graph, we then raise the cost to

an arbitrarily high amount. We then calculate the optimal paths for each edge in the

36

graph, and call this H∗i . For each edge i we also lower the cost to 0, then calculate the

optimal paths for each edge in the graph, and call this L∗i

There is a ‘blocking impactfulness’ metric for each edge, Bi, which is based on the

difference between the optimal paths with unit costs and the optimal paths with a high

replacement cost, divided by the number of possible paths.

Bi =|U∗ ∩H∗

i ||U∗|

(3.3)

This replacement of a link’s cost with a high value is given on the left figure in Figures

3.3-3.5 and termed ‘high replacement’. This represents the number of paths that would

be impacted if a link charged an exorbitant amount to travel through it.

There is also a ‘attracting impactfulness’ metric Ai for each edge, which is based on

the difference between unit cost optimal paths and the lower replacement edge cost:

Ai =|U∗ ∩ L∗

i ||U∗|

(3.4)

This replacement of a link’s cost with a low value is given on the right figure in Figures

3.3-3.5, and termed ‘low replacement’. This represents the number of paths that would

be impacted if a link allowed free travel through it.

In some cases, the agents are not able to influence traffic. When a single agent

primarily influences the traffic, we expect local rewards or global rewards to perform

well. When no agents have influence over the traffic, or they have little influence, we

don’t expect learning to produce any improvement.

We analyzed three main graphs to see the effect that a learning system could have

on the graph. Figure 3.3 shows a graph topology hand-designed graph will 11 nodes,

arranged in a way that allows for varying numbers of hops between nodes. If traffic is

generated only traveling back and forth between node 1 and node 2, there is a clear

shortest-path option between them. The cost can then be adjusted by agents in the

system in order to incentivize or disincentize longer route alternatives, moving through

the ‘suburb’ nodes in the grid in the bottom right. We see that when a blocking im-

pactfulness is calculated for this graph, the ‘suburb’ links have the maximum number of

impacted paths, with the maximum being 10 and the minimum being 2 impacted paths.

We see conversely that when an ‘attracting’ impactfulness is calculated, the long direct-

37

connection link impacts many more paths, with a maximum number of paths impacted

of 10, and a minimum number of paths impacted being 4. This indicates that the long

link can have a greater effect on the system if it attempts to attract UAVs toward it

rather than block UAVs from going through it.

Figure 3.3: Impactfulness metric for test graph topology with 11 nodes, 38 edges. Thisdemonstrates a symmetrical graph with a direct route between two nodes, with manyoptional alternative routes through the grid graph at the bottom right.

Figure 3.4 shows the impactfulness calculated for a graph topology with 25 nodes.

This graph was designed to show paths where there was an area with no alternative

routes, which is the middle multi-hop connection between the two grid-style graphs. In

both graphs, the connecting links between nodes {2,6,13,20,24} have an impactfulness of

zero. This means that a learning agent would have no power in the system, and could set

its values arbitrarily high or arbitrarily low and still not impact the traffic flow through

the system. This is because if a UAV must travel from one side of the graph to the other

side of the graph, it has no alternative but to follow the connecting path between them.

Additionally, if a UAV must travel within one of the subgraphs, there is no permissible

incentive that would make it cost-optimal to travel through the connecting links.

We also analyzed the the impactfulness of a map that we generated randomly, given

Algorithm 3 given in Section 3.1.2. Figure 3.5 shows a map generated using this random

algorithm. Largely, for either a high or low cost replacement, we see that most links can

38

Figure 3.4: Impactfulness metric for test graph topology with 25 nodes, 96 edges.This represents two symmetrical subgraphs connected by links connecting nodes{2,6,13,20,24}. The connecting links can have no impact on the paths planned through-out the graph.

have some impact on the paths in the system. This validates us using a random map

generation technique for creating maps, as it indicates that the maps that we generate

with this method will likely respond to control by agents setting costs across the map.

This indicates the relative ability of different agents to influence traffic based on their

position on the graph.

3.7 UTM Agent Control

In this section we discussed how we can model UTM traffic as a system of robots all using

A∗ to plan their routes. We also discussed how a UTM agent can plan a hierarchical

route through the space, using A∗ to plan its route at the sector level rather than only

across the pixel level. This setup was in preparation for the introduction of agents to

control the system. In this work we use agents to manipulate the high-level cost map

across which UAVs plan.

This approach is unique in that it does not directly dictate paths for UAVs to take

in the system. Instead, it chances the appearance of optimality of paths. Agents learn

39

Figure 3.5: Impactfulness metric for test graph topology with 20 nodes. This graph wasgenerated using the random graph generation algorithm as defined by Algorithm 3. Thisshows a fairly even level of impactfulness across the map, although some nodes are notable to change more than one path. This graph indicates that most agents have at leastsome power to change paths generated in the system.

costing strategies based on current traffic conditions, and get performance metrics as

feedback, as discussed in Section 3.3. This enables them to adapt their policies to mini-

mize traffic. Because we change only the appearance of optimal paths, our computational

complexity scales with the size of the high-level graph, not with the amount of traffic

that goes through it.

In Section 3.6.2 we test the possibility of altering optimal paths in the system by

adjusting the cost map. This is the same thing that our agents do; based on a current

traffic profile, they adjust the appearance of optimal paths in the system. Our investi-

gation in impactfulness verified that this control approach would be effective for several

graphs, as the appearance of optimality can be controlled for many edges of the graph.

In the next chapter, we will go over the exact method of our UTM agent construction,

including two different approaches to a UTM agent state and action space.

40

Chapter 4: Learning UAV Traffic Management Agents

Given the high-level graph structure that we have outlined in the previous section, we

now must control UAV movement between the sectors to reduce traffic. In this section we

introduce a conceptual framework to connect learning UAV traffic management agents

into the planning process of UAVs. We then define an agent that controls the apparent

cost of UAV movement entering a sector in Section 4.1.1, and we define an agent based

on edges of the graph in Section 4.1.2. Finally, we present simulated results and a brief

discussion in 4.2.1 and 4.2.2 respectively.

4.1 Agent Definition

We had two definition of agents in our system. One grouped the control of agents around

the nodes, and the other looked at individual edges as control elements. This section

explains each agent definition, and how they control and learn in the system. Then, we

explain how neuro-evolution is implemented using both of these agent definitions.

The sector agent formulation groups edge traversal costing strategies together based

on their connected nodes. This may simplify agent coordination, but it also imposes

potentially artificial structure onto the problem. We explore the impact of using link

agents that control costs only along a single directed edge. This simplifies the agent

structure and the individual agent’s control strategy. However, the multiagent coordi-

nation aspect becomes much more difficult as there are many more agents in the system

with a reduced perception of the state space.

4.1.1 Sector-Based Learning Agents

Sector agents are defined by nodes on a graph. They control the cost of traversal of all

edges that are connected at the node and pointed at that node. Sector agents assign

cost of travel across these edges, discretized by direction, such that each sector assigns

a cost of travel in each cardinal direction.

41

Given the potentially large number of UAVs in each sector, it is impractical to specif-

ically account for the individual states and policies of all UAVs. We therefore define a

sector state as the count of UAVs traveling in a particular direction. This allows our

approach to scale independently of the amount of traffic that passes through it. In this

work, we formulate the control policy of each sector agent as a single-layer neural network

where the weights of the neural networks are learned using an evolutionary algorithm.

The following section describes the neuro-evolutionary algorithm used to develop the

control policies for each sector agent.

As in the previous work, sector agents control the cost of travel across a sector of

the airspace. In the abstraction of the domain to the graph representation, sector agents

are defined as the nodes on the graph and have control over the cost of traversal of all

outgoing edges. To maintain generality across all sector agents, regardless of the number

of connected edges, all incoming edges are discretized by direction, such that each sector

agent assigns a cost of travel for traffic arriving from each cardinal direction.

Sector agents assign costs to the sector-level graph that UAVs use to plan their sector

traversal order. Sector agents represent the nodes on the graph and have control over

the cost of traversal for all outgoing edges. The state of each sector agent is a discretized

representation of all potentially incoming UAVs. Agents consider UAVs that are cur-

rently managed by neighboring sectors to be in their state. To maintain generality across

all agents, regardless of the number of neighbors for each agent, UAVs are discretized

and summed according to the bearing of each neighbor with respect to four cardinal

directions.

The control policy of each sector agent is modeled by a single layer neural network.

For the neural network controller of agent i, the four-dimensional input state is defined

as,

sisector = [ζNi , ζEi , ζSi , ζWi ] . (4.1)

where ζ represents the traffic on a particular edge. Note that sectors only count traffic

traveling into their node, so this directional aspect is important. The output action is

also a four-dimensional vector specifying the cost of travel for UAVs on outgoing edges,

again discretized into each of the cardinal directions,

aisector = [cNi , cEi , cSi , cWi ] . (4.2)

42

Figure 4.1: The structure of the neural network for sector agents. There are four inputs,one for the traffic count in each direction, and there are four outputs, one to denote thecost the sector will assign to each direction.

where c represents the cost attributed to a particular edge, specifically those that are

traveling into the node representing the sector. Therefore the neural network has a

structure as shown in Figure 4.1

This cost is assigned simultaneously for all agents, and UAVs are permitted to change

their high-level plan at any point during the simulation.

4.1.2 Link-Based Learning Agents

Link agents control the cost of travel along each directed edge of the graph. Since there

is an agent per link, there is no need to discretize edges into cardinal directions for this

agent formulation. Thus, the state space for the link agent is much simpler than for the

sector agent and simply consists of the current traffic traveling along the designated edge

in the link agent’s direction of control. This is ζi for link i, such that,

silink = ζi. (4.3)

43

Furthermore, the agent’s action is to assign a single cost to its designated link,

ailink = cilink . (4.4)

where cilink is the cost assigned to a particular link i. This is broadcast to the UAVs as

the cost of traversal on each edge of the abstract graph.

4.1.3 Neuro-Evolution

This is suitable for the UTM problem considered in this work since it allows a simplifi-

cation of the system state to include local information available to each agent; a count

of UAVs traveling into the sector.

The agent takes this state information, which is given as a count of UAVs traveling

toward the sector in the case of sector agents (explained further in Section 4.1.1) or on

an individual sector transition (as explained in Section 4.1.2). A forward pass through

the agent’s neural network maps the given state to a cost that will be applied to each of

the directed edges in the graph. The edge costs are broadcast to UAVs in the airspace

that then recompute their flight trajectories to reduce their individual cost of travel.

The weights of the neural network controllers of each sector agent are trained using

a cooperative coevolutionary algorithm (CCEA). This allows multiple populations to

evolve in parallel such that each population develops a policy for a single sector agent

in the UTM system. In the following experiments, there are M coevolving populations

of NN controllers, one for each sector agent and each with a population size of k. For

mutation of the NNs, 50% of the network weights are chosen at random and values drawn

from a normal distribution (zero mean and unit variance) are added to the weights.

4.2 Experimental Results for Learning Hierarchical UTM

We tested our UTM learning algorithm on a simulated urban airspace derived from a

256×256 unit cell map of San Francisco, which is shown in Fig. 3.1a. We discretized sim-

ulation timesteps into 1s time intervals with each learning epoch defined as an evaluation

of each random team in the population over 100 simulated timesteps.

We divided the tested airspace into 15 Voronoi partitions with each sector controlled

44

Figure 4.2: Initial planned paths for 27 UAVs in the airspace travelling from start (solidgreen triangle) to goal (solid green circle) locations. It can be seen that high levels ofcongestion may be expected in some thoroughfare regions of the airspace, such as thebottom right area where there are a group of close fixes.

by a single agent in the UTM system. Voronoi partitioning was hand-selected to place

sector agents in locations where congestion was expected due to the obstacle layout of

the region. The airspace sectors are shown in Fig. 3.1a by the dashed lines.

At the start of learning, each sector agent initializes with a population of k = 10 ran-

dom NN control policies. As described in Equation (4.1), UAV headings are discretized

to the nearest cardinal direction such that at each timestep and in each sector, the total

north-, south-, east- and west-bound UAV traffic is tallied to form the input state vector

of the NN policy. The output action of each agent, that is, the costs to travel in each

direction as shown in Equation (4.2), were restricted to lie within c = [0, 1].

At each timestep and at each start location, or fix , a new UAV is generated with

probability pgen = 5% with a randomly-selected goal fix and executes an A∗ plan over

the sector-level graph and over the pixel-level graph. Figure 4.2 shows an example of

planned paths over the pixel-level graph. This value was selected to produce about 100

UAVs in the airspace in a single run, demonstrating the effectiveness of the algorithm

in a high-traffic situation. Once a UAV reaches its goal location, it disappears from

the airspace. Note that in this implementation the fixes that generate UAVs across the

airspace are not necessarily colocated with sector centers.

45

0 1 2 3 4

Con.icts

(a)

0 1 2 3 4

Con.icts

(b)

Figure 4.3: Change in congestion observed over one run of the linear conflict evaluationtrials. The overlaid heat map shows the number of conflicts experienced in an area for(a) the best evaluation trial for agents with random initialized sector travel costs and(b) the best evaluation trial for agents with evolved sector travel costs after 100 epochs.The overall number of conflicts has reduced with some high congestion regions removedcompletely.

In the following experiments, each UAV traveled at a constant speed of 1 cell per

timestep and had a conflict radius dconflict = 2 unit cells.

4.2.1 Results

We performed three sets of simulation experiments, the first tested the proposed UTM

learning algorithm trained using the total number of conflicts in the airspace (linear

conflict evaluation). The second set of trials tested the same algorithm, however in this

instance, policies were evaluated according to the sum of squared collocated conflicts

in the airspace (quadratic conflict evaluation). As a baseline, we performed a set of

simulations using uniform costs for travel between the sectors. This meant that the

UAVs took the path that traveled through the smallest number of sectors. Each set of

experiments contained 20 runs of 100 learning epochs. The algorithm was implemented

46

0 1 2 3 4

Con.icts

(a)

0 1 2 3 4

Con.icts

(b)

Figure 4.4: Change in congestion observed over one run of the quadratic conflict evalua-tion trials. The overlaid heat map shows the number of conflicts experienced in an areafor (a) the best evaluation trial for agents with random initialized sector travel costs and(b) the best evaluation trial for agents with evolved sector travel costs after 100 epochs.As with the linear trials, the overall number of conflicts has been reduced.

in C++ with the high-level and low-level planners using the A∗ search algorithm provided

by the Boost Graph Library [75]. Planners re-planned whenever the cost map generated

by the agents changed.

Figure 4.3a shows the congestion observed in the first learning epoch of one linear

trial. Sector agents initially assign travel costs randomly, so there are areas of high

congestion that occur. Figure 4.3b shows the performance at the end of evolution.

Comparing Fig. 4.3a and Fig. 4.3b we can see a reduction in the number of conflicts.

In a realistic UAV routing scenario, the number of close calls doesn’t necessarily

matter as much as the severity of the congestion in the system. For example, if two

UAVs get within a dangerous distance of one another, a backup safety mechanism may

be able to divert one and avoid a collision. However, if more UAVs get within dangerous

distance of one another, the backup safeties must then deal with the cascading effects of

the collision-avoidance maneuver. This could lead to a failure of the lower-level collision-

avoidance systems if the conflict density is too high. To address this fact, we also tested

47

Figure 4.5: Comparison of conflict occurrence over epochs using fixed costs versus evolvedcosts with two different fitness functions. The linear performance metric reduces conflictsubstantially better than the quadratic method, although both outperform the fixed-costmethod.

a quadratic fitness evaluation function that summed the squared number of conflicts for

each cell, which correlates to a pixel value on the map. This attempts to decrease the

likelihood that there will be large numbers of conflicts that the UAV’s low-level planner

cannot handle. This resulted in a larger reduction in the severity of congestion across

the map. By comparing Fig. 4.4a and Fig. 4.4b we see a larger reduction in the severity

of congestion across the map, that is, there are fewer conflicts in each hotspot.

We also quantitatively compare the reduction of conflicts using evolution with the

linear and quadratic fitness evaluations to the conflicts arising in the fixed-cost map.

We see in Fig. 4.5 that the number of conflicts using the fixed-cost map is higher than

the conflicts using the learned costs after evolution has been performed for 100 epochs.

The number of conflicts in the baseline comparison algorithm remains high, with an

overall average of 1023 conflicts, while the neuro-evolutionary system is able to reduce

the number of conflicts seen in the system. Note that the evolution process does not

monotonically decrease the occurrence of conflicts due to the fact that there is random-

ness in the population generation as well as in the goal selection of the UAVs. However

evolution still significantly decreases the occurrence of conflicts in the system, partic-

48

ularly compared with the fixed-cost method. Comparing to the fixed-cost method, we

see an average of 856 conflicts at the end of learning with the linear fitness evaluation,

which represents a 16.4% reduction in total conflicts after 100 epochs using the linear

fitness evaluation with evolution. Learning with the quadratic fitness evaluation does not

outperform the linear fitness evaluation in terms of total number of conflicts, however

this is to be expected since it is not what the fitness function uses to determine survival.

Nonetheless we see an average performance of 950 conflicts after 100 epochs, giving us

a 7.1% improvement over the fixed-cost method. Our heat map results indicate addi-

tionally that these are less severe than the conflicts that are present in the learning that

does not consider co-location of conflict events (Figure 4.4).

In a realistic UAV routing scenario, the number of close calls doesn’t necessarily

matter as much as the severity of the congestion in the system. For example, if two

UAVs get within a dangerous distance of one another, a backup safety mechanism may

be able to divert one and avoid a collision. However, if more UAVs get within dangerous

distance of one another, the backup safeties must then deal with the cascading effects of

the collision-avoidance maneuver. This could lead to a failure of the lower-level collision-

avoidance systems if the conflict density is too high. To address this fact, we also tested

a quadratic fitness evaluation function that summed the squared number of conflicts for

each cell. This attempts to decrease the likelihood that there will be large numbers of

conflicts that the UAV’s low-level planner cannot handle. In Fig. 4.4b we see a larger

reduction in the severity of congestion across the map, that is, there are fewer conflicts

in each hotspot. However Fig. 4.5 shows that we do not outperform the linear fitness

evaluation in terms of total number of conflicts, because this is not what the fitness

function uses to determine survival. Nonetheless we see an average performance of 950

conflicts after 100 epochs, giving us a 7.1% improvement over the fixed-cost method.

The number of UAVs in the system varied during the simulation. Because of the

probabilistic generation process, there could be a variable number of UAVs present in

the system, but generally this number was close to 100. If the generation rate is lowered,

it is possible to achieve zero conflicts in the system without learning intervention, with

the tradeoff of severely reduced throughput. If the generation rate is increased, it is also

possible to saturate the system with UAVs such that a learning policy cannot effectively

reduce the number of conflicts. However, these extremes would not demonstrate realistic

scenarios, and would not show the improvements that the algorithm offers. We explore

49

the scalability of this approach in the next section.

4.2.2 Discussion

The results presented in this work show that a distributed UTM system can learn appro-

priate costs to apply to UAVs traveling in the airspace to incentivize implicit cooperation

between the UAV planners. Using our method, we achieved an 16.4% reduction in con-

flicts compared to the baseline method with uniform sector costs. We also showed a

reduction in severity of conflict occurrences using a quadratic fitness function. The

agents were able to take in directional sector congestion information and appropriately

weight the cost of travel to promote safety in the system. It is also worth noting here

that agents do not require a model of the available airspace or knowledge of the obstacles

in the sector, the number of conflicts experienced in the sector was sufficient information

by which to evaluate the performance of the current policy.

This structure gives a method by which neural networks may be trained for managing

conflict occurrences through an obstacle-filled area. Because we introduced random

traffic throughout the run, the policies generalize to a variety of different traffic scenarios.

The computation time in the evolved policy is therefore a combination of the time to

calculate the output of a neural network. The UAVs then independently calculate their

trajectories through the space with this information.

The ability to manipulate high-level planners allows us to reduce occurrences of

potentially dangerous congestion in the system. UAVs in the real world can handle en-

counters with other UAVs using low-level collision avoidance procedures, but by reducing

the congestion in the airspace at a high level we can permit safer travel by avoiding many

of these conflicts before they occur.

This work forms an important step toward the formulation and exploration of the

UAV Traffic Management (UTM) problem. Handling the challenges associated with

free flight in an obstacle-filled environment will become critical as UAVs become more

prolific in the airspace. Shortest-path algorithms for UAV routing may lead to unsafe

UAV density. Intelligent high-level coordination must occur in conjunction with low-level

collision-avoidance in order to ensure safety in the airspace.

50

Chapter 5: Learned Abstract Policies Applied to Real Robots

Our approach has the benefit that it is platform-agnostic, meaning that it should be

able to be used by any robot navigating the space that implements a cost-based planner.

Our routing method is specifically designed to route robots through a physical space. We

wish to show that our evolved policies will provide incentives for robots to take diversified

paths, but still get to their destinations. In this section, we demonstrate an end-to-end

application of our multiagent system, demonstrating its real-world feasibility.

We set up a traffic management system using a ROS framework to interface with

Pioneer-3DX ground robots. We trained neural network agents and to reduce traffic on

our high-level simulator, and then saved the converged neural network weights. We then

wrote a ROS node to take in the number of robots in a particular area of the space, feed

it to its respective agent, and this agent will adjust a broadcast map. We then tested

this in a high-fidelity simulator, as discussed in Section 5.3. Then we applied this ROS

framework on robots in a physical maze, which we will present in Section 5.4.

5.1 Neural Network Controller Evolution

Our research into traffic routing through a simulated airspace suggests that we can reduce

traffic with many robots in the system. However, we can also develop policies for smaller

numbers of robots. In order to train the neural networks, we generated four robots

in the system, and had them randomly select destinations. When they reached their

destination, they rerouted to another node in the simulation. This caused continuous

traffic for the duration of the simulation. For training the neural networks, we used a

delay-based performance metric.

Robots provide the underlying dynamics of the system on which our controller agents

act. Because our approach focuses on platform-agnostic controller agents, and we are

routing on a 2-dimensional plane, we substitute pioneers for UAVs in our Gazebo ex-

periments. UAVs may have a different speed or reaction time, but the substitution of

Pioneers for UAVs does not change the core density management problem. Toward this,

51

Figure 5.1: The high-level map used for training the neural network policies the Pio-neer experiments. Shown for reference are barriers (square brackets) that would createthis high-level topology. For training, all traffic moves between N1 and N4, requiringcontroller agents to make non distance-optimal routes look attractive to some traffic.

we set up a physical room with obstacles and four points of interest (Figure 5.4). This

graph structure has 5 physical edges; 4 edges at the edges of the room, and a fifth form-

ing a connection between the two obstacles in the room. This corresponds to 10 links.

One pioneer running moved between these points of interest in order to build a map of

the room. This map was then copied over to the other robots.

The map was used to create 10 link agents in a simulation, with graph topology

identical to the physical system. Neural networks then evolved across 200 epochs on this

small system. Figure 5.2 shows the reduction in delay over 100 epochs of training, for

10 statistical runs. A difference between this simulation and the scalability simulations

were that the traffic was kept in the system continuously rather than removed and

re-generated randomly, and assigned new destination points when they arrived at their

target. Additionally, the capacity of each link in the system was set to 1, as the corridors

in the physical space could only safely accommodate one Pioneer at a time.

52

0 10 20 30 40 50 60 70 80 90 100

Epoch

0

1

2

3

4

5

6

7

Delay

Figure 5.2: Learning performance using a graph with 4 nodes and 10 edges. 10 runs arshown for statistical significance. Agents are able to completely eliminate delay in thesystem.

53

UTM Agents

Robots

Define cost of travel in each sector/link according to current UAV density

Plans across UTM cost graph High level planner

Plans across obstacle map according to high level graph traversal plan

Low level planner

Figure 5.3: System architecture of the ROS experiments. The UTM agents publish acost graph over which the robots plan their high level paths. The low level planning andnavigation is handled by the standard ROS navigation packages.

5.2 ROS UTM Architecture

We use the Robot Operating System (ROS) architecture to add our UTM algorithm onto

a real robot platform [70]. ROS offers a communication protocol wherein computational

entities called nodes communicate with each other by publishing strictly-typed messages.

ROS nodes can also subscribe to other nodes in the system, which passes this information

through the computational step.

This framework allows for a large degree of abstraction away from fundamental hard-

ware programming. Because of this level of abstraction, as well as a high level of commu-

nity support, the ROS website is able to offer a wide variety of highly modular packages

to handle standard algorithms. Additionally, ROS packages are available for many stan-

dard hardware elements. The basic controllers for the Pioneer-3DX ground robots used

in our experiments, as well as the sensors used, are available without needing additional

programming.

5.3 Gazebo Experiments

We tailored this approach to coordination of robots in a physical space. The target of

our evolved control policies are to incentivize robots to take diversified paths but still

54

get to their destinations. In our scalability tests, we demonstrate that we can learn

to reduce delay in a large system with many robots moving around. However we also

must demonstrate that this could apply to a smaller physical system with imperfect

information. Toward this end, we use Gazebo, a high-fidelity physics simulator.

Figure 5.3 describes the system architecture. UTM agents, which are a set of neural

networks, take in information about the positions of the robots in the system. These

UTM agents then define the cost at the high-level connection map. Robots then compute

A∗ paths using their high-level planners to plan across the UTM cost graph. Robots use

this cost graph to then plan across the obstacle map. This is also shown in the RQT

graph in Firuge 5.8.

Our implementation creates artificial walls between the spaces to enforce travel

through the optimal link path. We do this by first having the robot calculate the its

optimal high-level path, then a node takes this high level path and modifies the walls

perceived by the robot. This enables us to use pre-built waypoint commands in ROS

without directly modifying them.

This domain is extremely constrained for the set of pioneers. There are many points

at which the pioneers can get ‘stuck’ because they are too big to physically pass each

other in the space. The simulation that we train our agents on is more permissive of

allowing traffic through the space, because there are two directional links connecting each

point, as shown in Figure 5.1. However we are still able to see a incentive strategy that

reduces the congestion in the space. A simulation that is closer to the true constraints

of the test problem would likely provide better learned results.

5.4 Real Routing in a Maze

We used Pioneers running rocon [83] to demonstrate the physical end-to-end application

of this work. Pioneers received a rtm graph object that allowed them to plan shortest

paths across the space. The pioneers communicated their positions on the ‘airspace

graph’

Map Design: We had a directional graph, which means that we need the corridors to

be able to accommodate two robots at once. We selected a distance of at least 1m to

design the map. We constructed a map for the robot (Figure 5.6) which would result in

55

Figure 5.4: A simulated Gazebo environment with two obstacles. This map was createdby driving a Pioneer around a real environment with two square obstacles, althoughmultiple robots are introduced in simulation. This graph has four nodes, each at acorner of the graph, and the arrows denote current target points of the Pioneers.

the same topology that has been trained on.

Traffic Design: In order to demonstrate the impact of our algorithm, we set it so that

all traffic would move from the top left node to the bottom right node. This makes the

direct route distance-optimal, but if all robots take this path there will be congestion.

The high-level graph is shown in Figure 5.1

For this section, we will manage ground robot traffic rather than UAV traffic. There-

fore here we will refer to our traffic management multiagent system as robot traffic

management (RTM), rather than UAV traffic management (UTM), which is used for

the other sections of this work.

Figure 5.6 shows the SLAM map that was generated by driving a robot through the

56

Figure 5.5: A Pioneer-3DX robot.

environment and this map was shared across each Pioneer robot. We used the amcl [41]

localization package to feed into the ROS navigation stack. This allowed us to localize

from any point in the room. We first moved the robots a short distance to allow them

to localize within the saved map. Nodes were located at the corners of the map. These

were given as destination points for the robots to route to. Robots would autonomously

sense when they had reached a waypoint, and would choose a new waypoint at random

from one of the other nodes.

Robots coordinated using a central computer as the base of communications. Each

robot was initialized using a script that began robot networking and connected to the

hub. To initialize the planning, a (unit) cost graph was published to all robots by the

RTM. An initial waypoint was then sent to the each robot, which defined which link it

belonged to. Membership on a link was defined by calculating which sector center was

closest to the originating point, and then which sector center was closest to the final

point. This created an online Voronoi partitioning of the space to define the graph. The

traffic on this link was communicated back to the RTM agent node in ROS. This allowed

the RTM agent node to adjust the cost map and re-broadcast it to all of the robots.

The robots ran an A∗ planning algorithm across the 4-node 10-link high-level map.

57

Figure 5.6: The map generated from running SLAM across a room. This was sharedamong all the robots.

They used this algorithm to plan a trajectory across space designed for certain sectors.

Artificial ‘walls’ were then constructed in their map to force the low-level path-planning in

ROS to find a path that adhered to the sector ordering recommended by the A∗ planning

algorithm. After walling off the sectors that were not in the given sector ordering, the

robots then ran Dijkstra’s algorithm to plan a path across the nearby space. Robots

moving using paths planned under this architecture are shown in Figure 5.7.

Our setup allowed the robots to observe cost maps with higher costs on occupied

links, which impacted their path planning decisions. However, hardware testing also

demonstrated the necessity to combine this approach with accurate low-level sense-and-

avoid systems.

5.5 Discussion

In this section we have demonstrated an end-to-end application of our algorithm on real

robots. This demonstration is important to this work because it shows the feasibility

of our approach to a real system. The neural network, trained prior to the application

58

Figure 5.7: Robots moving between waypoints in the space using the RTM framework.This map was generated by driving a robot around a physical obstacle maze. Robotsthen used the amcl package to localize within this maze. The robots then autonomouslymoved throughout the maze, planning their routes considering the high-level RTM graph.

59

Figure 5.8: RQT graph for three pioneers in a Gazebo simulation. Pioneers communi-cate their position in the map with respect to which agent controls their motion (the/membership) topic). This traffic information is then processed by the utm agent, whichis a group of evolved neural networks. This information is then used to update the/utm graph topic, which is the high-level cost map set by agents in the system. Thisis then used to update the appearance of walls for each robot topic according to its A∗

optimal plan.

60

here on a simulated traffic profile, can be used by constructing a ROS take in traffic

information and use the trained neural network algorithm to assign weights.

We have also demonstrated one way of implementing a hierarchical control structure

in a real robot system. When a robot plans its path across the high-level map, we

can introduce artificial walls around the edges of sectors that do not fall into the high-

level path plan. This is a small modification to our navigation method, and allows us

to still take advantage of many of the navigation functions available through the ROS

community.

We demonstrate also that the communication requirements between robots is prac-

tical. Robots must be aware of and communicate their location to UTM agents. This

is not an undue restriction, particularly because of the assumption that in an aircraft-

based system, all elements would be equipped with ADS-B technology [1]. This enables

accurate calculation of a traffic profile, which serves as an input to our neural networks.

Self-reporting of position based on localization in a map additionally means that there are

no external structures, such as Vicon, required for monitoring robot movement through

the space. This puts outdoor experiments within the realm of possibility.

We additionally show that we can manage manage traffic even on a small scale with

this approach. Our Gazebo results show traffic routing for 4 robots under an ideal sensing

environment. Gazebo allowed us to test the predicted motion of a ROS infrastructure

without using a physical system. Gazebo also gave us insight into the structure of our

simulated network structure; in many cases, robots would interact in the space while

trying to pass one another. Because these robots were considered to be on separate

‘links’ they were not combined into the same traffic count. In the Gazebo simulation,

this frequently meant that robots would get stuck in too-small corridors while attempting

to pass one another. A way of solving this is to either modify the graph so that it has

only one directional link through a space, or to apply this same topology to a space that

is large enough to accommodate robots passing one another.

Our real robot experiments showed this control structure working for 3 robots mov-

ing through the space. In this case, we had sufficiently large passages to allow movement

of robots past one another, as would be allowed in the high-level graph topology. Addi-

tionally, we observed that UTM agents were able to take feedback from localized robots

moving in the physical space, and compute a recommended cost map. The only down-

side of these experiments was that many of them were truncated due to sensor error.

61

Sensors were placed high on the robot, and therefore did not sense the main body of

other robots.

62

Chapter 6: Fast UTM Learning

We develop an abstraction of the UTM problem that preserves the core learning elements

for agents controlling costs on a sector-level graph (Section 6.1). We then present exper-

imental results for extensive testing of the UTM learning system under different traffic

conditions, as well as different agent formulations, in Section 6.2. Finally, we discuss the

implications of these results and some trends that we observe in Section 6.3.

6.1 Abstraction of the UTM Coordination Problem

In the problem we consider here, we remove the pixel-level planning across the obstacle

map. We instead introduce an abstraction of these dynamics by having a ‘capacity’

parameter for each edge on the graph, as illustrated in Figure 6.1. This abstracts away

the dynamics of traversing across the map because we assume that conflicts are a function

of the density of the UAVs in a particular area. We preserve the core routing problem,

with a significant reduction in simulation time.

We represent the airspace using a graph, where UAVs generate at (and travel to)

nodes of the graph, and graph edges represent their physical path through the airspace.

We construct a planar graph with random connections between these nodes and edges.

Nodes have a physical xy-location in the world, and the time that it takes to travel across

the edge is the Euclidean distance between the connected nodes. A UAV occupies some

amount of space while it is traveling across an edge, and this contributes to the traffic

ζi on that edge.

In this work we refer to agents controlling the cost of travel on edges surrounding a

node as sector agents. Agents that control the cost of travel across an individual directed

edge are link agents. Figure 6.1 shows the two multiagent formulations.

63

Capacity

Queue to enter blue sector: While wai.ng in queue UAVs count towards capacity/state

of outgoing sector

Capacity

Sector Agents si = # of outgoing UAVs

ai = cost through sector Ci

UAVs Plan and update paths to

minimize total traversal costs

Capacity

(a)

Queue to enter purple link: While wai)ng in queue UAVs count towards capacity/state

of outgoing link

Link Costs Ci = π(si)

Link Agents si = # of UAVs on link ai = assign link cost Ci

Ci

Capacity

Capacity UAVs

Plan and update paths to minimize total traversal costs

Capacity

(b)

Figure 6.1: The sector agent (a) and link agent (b) problem formulation. Sector agentscontrol groups of edges rather than a single edge. We divide these groups of edges intofour cardinal directions, and determine traffic and costs based on this discretization.

6.2 Experimental Results for Varying UTM Traffic Profiles

There are two definitions of agents that we test: sector agents (Section 4.1.1) and link

agents (Section 4.1.2). Simulation timesteps were discretized into 1s time intervals with

each learning epoch defined as an evaluation of each random team in the population over

100 or 200 simulated timesteps. The algorithm was implemented in C++ with the UAVs

using the A∗ search algorithm provided by the Boost Graph Library. We repeated each

experiment 10 times for statistical significance.

Figure 6.2 shows the delay reduction in a system with 20 different nodes, which cor-

responds to 102 different edges. We tested both sector agents and link agents in this

domain, with a generation rates of 20 UAVs every {10, 20, 50} s. This means that in the

high-traffic scenario we could have as many as 400 UAVs generated. Both agent formu-

lations were able to reduce the delay across the system as learning progressed. As the

generation rate increases, we see a superlinear relationship with number of delays accu-

mulating in the system. In a system with lighter traffic (such as a system that generates

new UAVs every 50s), the delay can be reduced to zero. In scenarios with heavier traffic,

both the sector agents and link agents produced significant delay reductions. However,

the minimum delay still converged to a nonzero value, suggesting an upper bound on

the system throughput with respect to the traffic density.

The number of timesteps for each learning epoch also plays a role in the delay per-

64

Epoch

0 100 200 300 400 500

Sec

ond

sof

del

ay

0

200

400

600

800

1000

1200Progression of Total System Delay for 100s Learning Epochs

sector, ; = 20 UAVs/ 10ssector, ; = 20 UAVs/ 20ssector, ; = 20 UAVs/ 50slink, ; = 20 UAVs/ 10slink, ; = 20 UAVs/ 20slink, ; = 20 UAVs/ 50s

Figure 6.2: Sector and link agent performance on a map with 20 nodes, 102 edges. Wetested three different generation rates for learning epochs of length 100s. The low-trafficscenarios (ρ = 20UAV s/50s) is zero in most of the graph.

formance. When the steps for simulation increase by a factor of 2 there is a nonlinear

increase in the delay. This is because there is nonlinear delay in the system that rolls over

and continues to cause congestion in the system. The results in Figure 6.3 demonstrate

that the sector and link agents are capable of learning policies for reducing system delay

to zero for a UAV generation rate of 20 new UAVs every 50s. However, traffic congestion

compounds for both the {10, 20} s generation rates.

Table 6.1 shows the percentage delay reduction across several different maps and

generation parameters. Percentage improvement compares the first epoch’s system delay

performance to the last epoch’s performance. The improvement varies greatly with the

length of the learning epoch and the generation rate. These results also include analysis

for a sector agent learning system with 100 nodes and 580 edges to compare the effect of

a finer discretization of the sector space versus applying a similar number of link agents

along graph edges. The traffic generation rate in the 100 node graph was higher for

two of the three tested cases (100 UAVs every {10, 20, 50} s) and the reduced percentage

improvement for those traffic profiles reflects this. However, comparing the 100 UAVs

per 50s case to the link agent scenario with 20 UAVs per 10s, which both generate

the same total number of UAVs, it is clear that the link agent formulation provides a

65

Epoch

0 100 200 300 400 500

Sec

ond

sof

del

ay

0

2000

4000

6000

8000

10000

12000Progression of Total System Delay for 200s Learning Epochs

sector, ; = 20 UAVs/ 10ssector, ; = 20 UAVs/ 20ssector, ; = 20 UAVs/ 50slink, ; = 20 UAVs/ 10slink, ; = 20 UAVs/ 20slink, ; = 20 UAVs/ 50s

Figure 6.3: Sector and link agent performance on a map with 20 nodes, 102 edges. Wetested three different generation rates for learning epochs of length 200s.

Epoch

0 100 200 300 400 500

Secondsofdelay

#104

0

2

4

6

8

10

12

14100 Sector Agents, 200s Learning Epochs

sector, ; = 100 UAVs/ 10ssector, ; = 100 UAVs/ 20ssector, ; = 100 UAVs/ 50s

Figure 6.4: Sector agent performance on a map with 100 nodes, 580 edges. We testedthree different generation rates for learning epochs of length 100s.

66

Epoch

0 100 200 300 400 500

Secondsofdelay

#104

0

2

4

6

8

10

12

14100 Sector Agents, 200s Learning Epochs

sector, ; = 100 UAVs/ 10ssector, ; = 100 UAVs/ 20ssector, ; = 100 UAVs/ 50s

Figure 6.5: Sector agent performance on a map with 100 nodes, 580 edges. We testedthree different generation rates for learning epochs of length 200s.

greater improvement in the overall delay reduction.The link agents improve system delay

by {67.78%, 41.47%} for the {100, 200} s learning epochs, respectively, while the sector

agent formulation provides a delay reduction of {53.37%, 24.42%}.The results presented in this work show that a distributed UTM system can learn

appropriate costs to apply to UAVs traveling in the airspace to incentivize implicit co-

operation between the UAV planners. Using our method, we achieved a 68% and 67%

reduction in delays for a high-traffic scenario using sector and link agents respectively.

The agents were able to take in directional sector congestion information and appropri-

ately weight the cost of travel to promote safety in the system.

6.3 Discussion

In order to do more extensive testing of our UTM system, we had to develop a control

structure that did not require motion across a low-level map to perform simulation.

Because the UAV motion was divided hierarchically, we abstracted the lower-level motion

away by restricting edge capacities rather than counting up spatially close interactions

between UAVs. This allowed us to massively reduce computation, as A∗ no longer needed

to be computed by agents for the lower-level graph.

67

Table 6.1: Comparison of performance for different agent control, graphs, traffic gen-eration rates, and epoch length (T). Percentage improvement is calculated between thedelay experienced in the first and last learning epoch.

Agent Graph ρ T Improvement

Sector

20 Nodes102 Edges

20 UAVs/10s100s 68.93%200s 41.18%

20 UAVs/20s100s 81.39%200s 61.71%

20 UAVs/50s100s 100%200s 87.28%

100 Nodes580 Edges

100 UAVs/10s100s 22.16%200s 12.11%

100 UAVs/20s100s 29.95%200s 21.80%

100 UAVs/50s100s 53.37%200s 24.42%

Link20 Nodes102 Edges

20 UAVs/10s100s 67.78%200s 41.47%

20 UAVs/20s100s 92.27%200s 53.12%

20 UAVs/50s100s 100%200s 75.44%

This fast simulation enabled us to investigate the effects of two different parameters;

the traffic profile and the agent structure. We were able to substantially increase the

tested graph size by generating a random graph with up to 100 different nodes and 580

edges.

In terms of a percentage of traffic reduction over the beginning traffic, the lower the

generation rate was typically the better the reduction of traffic. In two cases (sector

and link formulations on the 20-node graph with UAV generation every 50 seconds) all

traffic was able to be reduced in the system. However, looking at the absolute reduction

of delay rather than as a percentage, we see that the more traffic that exists in the system

the more impact our algorithm is able to have in reducing it.

This set of experiments suggests that our approach to learning traffic reduction can

apply in cases where there is high amounts of traffic as well as low. Our agents can learn

to reduce traffic in the system even in systems where a routing solution cannot eliminate

68

delay entirely. This scalability to handle increased amounts of traffic without becoming

computationally intractable was one of our original design requirements.

In addition to comparing different traffic profiles, we also compare different agent

formulations. Sector agents essentially control collections of edges, discretizing costing

strategies into different directions. This had a natural parallel to the area-based sector

distinctions in traditional air traffic management. However, we hypothesized that we

could gain finer control over the policies of agents if we allowed each edge to learn. This

would enable more specific policies to be developed independently.

Contrary to our assumption, link agents performed comparably to sector agents, and

in some cases underperformed. We identified key differences in the two formulations;

the first was that the link agents had a single traffic input and a single traffic output,

whereas the sector agents had four inputs and four outputs, each for a cardinal direction.

This lower state lost some information about other UAVs traveling nearby the link agent.

However, the greater issue appeared to be the disparity between the number of agents

in each case; for a random map, the number of links tended to be four to five times the

number of sectors on the map.

The introduction of link agents makes this a much more complex multiagent coordi-

nation problem. There are many more agents, with much less state information available

to them. Using link agents should allow us a finer degree of control over traffic routing,

but this benefit is overshadowed by the noisiness in this large multiagent problem. The

fact that we obtained comparable performance with five times more agents indicates that

if this noise were to be mitigated, we could get better performance.

69

Chapter 7: Scalable UTM Learning

In the previous abstract experiments, we found similar performance with different agent

definitions, even though there was a significant increase in the number of learning agents

present. We expected to have a finer degree of control using the link agents, and thus

obtain better performance overall when more agents were present in the system. Because

this was not the case, we suspect that multiagent noise played a part in this.

7.1 The Multiagent Noise Problem

Large learning multiagent systems have many challenges when performing a coordination

task. A common approach to coordinating agents is to give the same score to the entire

system for a task that they are completing. This gives some indication of how well the

task is being completed as a result of their joint effort, but does not directly assess how

each of them contributed. The amount of the reward that they are not responsible for

is considered ‘noise’.

A major focus in multiagent learning is reducing the effects of this ‘noise’ while

ensuring a reward for working toward a group objective. Noise reduction has been

approached by using difference rewards with counterfactuals to suppress the effects of

other agents’ actions [2, 3, 4, 6, 7, 95, 96]. Difference rewards reduce the noise in the

system by subtracting off the impact of other agents in the reward.

Another kind of noise which has recently been explored in large multiagent systems

is exploratory noise. This noise is caused by agents taking off-policy moves in order to

explore their environment. This can be ignored as an infrequent environmental noise in

many cases and learned around, but this off-policy selection can introduce a constant

source of noise in a large system of learning agents. CLEAN (Coordinated Learning

without Exploratory Action Noise) was developed in a similar manner to difference

rewards, in order to suppress noise from agents taking their exploratory actions [46].

This approach uses the difference reward framework in that it subtracts off the impact

of other agents on the global reward. However, CLEAN specifies that the global rewards

70

be calculated based on the agents following their policy-specified action and a potential

counterfactual action that they could have taken.

Larger multiagent systems have many problems, but sometimes the large number of

agents can be an advantage. D’Ambrosio and Stanley point out the problem of reinven-

tion, which refers to the fact that many agents in the system must re-learn policies that

may be similar or identical to other policies developed in the system [28]. They suggest

approaching this by using an algorithm that represents the pattern of policies, which can

train much larger teams by extracting roles from these policy patterns.

7.2 Learning with Subsets

One way of reducing the problem of noise in a highly coupled multiagent system is to

reduce the variability of actions in the multiagent system. CLEAN accomplishes this by

making agents take their policy-optimal actions rather than exploratory actions. We take

this one step further; we look at cases where only subsets of agents are actually learning

in the system, and other agents take a base action. In this section, we experimentally

identify which agent is the best single agent to perform in the system alone, and we

identify which agent is the worst agent in the system by removing agents individually

and comparing global performance.

7.2.1 Include-One

To test whether noise could be reduced by excluding agents, we performed several tests

isolating a single agent. This agent would learn while the actions of the other agents

were kept static, outputting the baseline value of 1. This would be allowed to learn for

100 epochs and the maximum value obtained would be computed.

Figure 7.1 shows the performance of all agents learning at once compared to learning

with a single agent (which we discuss in this section), as well as learning excluding a

single agent (which we discuss in Section 7.2.2). The traffic conditions for this domain

were a generation rate of ρ = 5UAV/s, on an 11-node and 38-edge domain with the

graph topology shown in Figure 3.3.

We ran 38 experiments focusing on single-agent performance; each where an individ-

ual link was learning and active in the domain. The single active agent learned its policy

71

Figure 7.1: Learning with all agents, learning using only the best individual agent, andlearning excluding the worst individual agent.

using neuro-evolution, while the other 37 inactive agents generated a flat output of 1 at

each timestep. Inactive agents did not learn, and did not respond to traffic conditions.

Each experiment had a different single agent active in the system. We then identified

the single agent that attained the best global performance in the system. This was the

best-performing agent, and its performance is shown on Figure 7.1.

Figure 7.1 indicates that the value of a single agent learning can be higher than all

agents learning at once. The single agent trial shows a fast convergence, compared to all

agents learning simultaneously, because there is no noise from other agents within the

system. It converges to its final value around epoch 18, while it takes all agents learning

simultaneously the full 100 epochs to approach a similar performance.

Performance at the beginning of the episode starts better for the single-agent test

than with the all-agent test. This is because there is not random policies of agents at

the beginning, but instead output a static policy of 1. This baseline provides better than

random performance, but converged learned performance is ultimately preferable with

any subset of learning agents.

The success of the single-agent testing gives us a strong indication that the multiagent

72

0 0.02 0.04 0.06 0.08 0.1Impactfulness (blocking)

6

8

10

12

14

16

18

Ave

rage

Tra

velTim

esingle agent max linear

Figure 7.2: The maximum performance of a single agent graphed against that agent’sblocking impactfulness. The linear trend line shows that an agent’s blocking impact-fulness has a negative correlation with increasing average travel time, indicating agentsthat can block traffic tend to have better performance.

system has difficulty learning quickly because other agents cannot tolerate randomness

introduced by concurrent agent learning. However selecting a single agent to learn cuts

out the possible contribution of other agents learning in the state space. Though the

noise in the system is eliminated, and results in fast learning, selecting a single agent to

learn could hurt the overall long-term converged performance of the system.

Testing with a single agent at a time requires a full simulation for each individual to

find the best performing agent. In this case, this caused us to run 38 separate tests in

order to determine the best agent in the system. This is an exhaustive pre-processing

step that may not be feasible for larger systems. To avoid this exhaustive step, we looked

at the impactfulness of an agent, and attempted to correlate this metric, which does not

require a full simulation, with the converged performance of a full simulation.

73

In the case of blocking impactfulness, we see in Figure 7.2 that there is a weak linear

correlation with better performance given a higher impactfulness. However, Figure 7.3

indicates that attracting impactfulness has a weak linear correlation with worse perfor-

mance given a higher impactfulness. This lack of a strong correlation between either

impactfulness metric indicates that we cannot use it as a substitute for an exhaustive

simulation to find which agent will perform the best acting alone.

The deeper insight is that the measure of ability to change optimal paths from

distance-optimal does not necessarily indicate better performance. This means that

due to the topology of the graph, some of the agents are situated such that they should

not try to attract or deflect traffic in the simulation, and instead should output a value

that will not pull or push traffic from it. This is related to the policies that the agents

developed based on the topology of the map.

7.2.2 Exclude-One

The previous section showed that learning with a single agent could perform better than

all agents together. This presented a static learning problem which was entirely decou-

pled from the policies of other agents. However, we wanted to investigate whether sub-

stantial gains could be acquired by excluding an agent that contributed a large amount

of noise to the system.

In a similar way to how we performed the single-agent tests, we selected a single

agent to output a value of 1 instead of learning a policy. The 37 active agents would

then learn their policy as normal. We would then investigate the converged performance

of the system without this agent. The same graph topology provided in Figure 3.3 was

used in this case as well. We again ran 38 experiments, each deactivating a single agent

at a time.

Figure 7.1 shows that when the correct agent is deactivated, a substantial amount of

noise can be reduced in the system, and the converged performance can improve. We

call this agent the worst agent, as it has the worst performance in the system.

74

0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1

Impactfulness (attracting)

6

8

10

12

14

16

18

Aver

age

Tra

vel

Tim

e

single agent maxlinear

Figure 7.3: The maximum performance of a single agent graphed against that agent’sattracting impactfulness. The linear trend line shows that an agent’s attracting impact-fulness has a positive correlation with increasing average travel time, indicating agentsthat can attract traffic tend to have worse performance.

75

Figure 7.4: Learning with 3 randomly-selected agents learning sequentially in the system.At step 100 the second agent is activated, and the first agent is deactivated. At step 200the third agent is activated and the second agent is deactivated.

7.2.3 Sequential Learning

In order to reduce the largest amount of noise, but still get adaptation from several

elements in the system, we looked at using subsets of agents learning. Similar to the

include-one trials, we looked at what happened when a single agent was added into the

system and all others output a value of 1. After a 100 epochs, this agent would stop

learning, fix its policy to the best it had found, and then another agent would begin to

learn in the system. The policy found by the first agent would remain fixed (but not 1)

while the second agent learned in the system. Though the actions of this agent in the

system was no longer static, its policy is static, and therefore it presents a more learnable

problem for the other agents in the system.

For these trials, we initialized the neural networks to have zero weights instead of

random. This would mean that they would output a constant amount before mutation,

and would only select a mutation if it were better than outputting a value of 1.

Figure 7.4 shows the effect of selecting 3 agents randomly from the system for learn-

76

Figure 7.5: Sequential activation of three agents in the system. We see a slight decreasein performance after the third agent is added into the system.

ing, and having each learn sequentially. The graph shows spikes of poor performance

during the transition phases, as a neural network is introduced with zero policy. Having

a zero policy is slightly different than having a 1-output, which causes some increase in

total travel time, until the agent can learn to handle this. We see here that we have

quick convergence when learning with a single agent at once.

This approach depends entirely on the agents that are activated in the system. We see

that when three different agents are active, as in Figure 7.5, the converged performance

is above 6 aerate steps of total travel time, as opposed to the converged performance in

Figure 7.4, which is near 5. In fact, this marks an increase from the global minimum

travel time in the graph, which is found before the third agent begins learning. This

indicates that sometimes the agent does not learn to shut off entirely when it is not

actually needed.

Figure 7.6 shows a sequence of three agents that do not provide substantial im-

provement over the first agent learning alone. The performance does not substantially

decrease when the other agents are added, but it does not decrease either. When the

second agent is added, it adds a substantial decrease in performance, performing much

77

Figure 7.6: Sequential activation of three agents in the system. We see that in this case,there is largely no improvement when more than one agent learns in the system.

worse than the first agent without training. This is a negative result for this approach.

However, it should be notes that all agents still find solutions that outperform the entire

system working together.

Our results from sequentially activating agents in the system show that we can sub-

stantially increase convergence speed in the system by reducing the number of agents that

act. This is one way of mitigating the multiagent noise that the full system experiences.

We also look at the effect of learning with varying numbers of agents in the system.

To perform these tests, we sorted the agents by impactfulness, then performed tests to

find which subset of n agents with top impactfulness would perform well. Figure 7.7

shows that there is an improvement in performance up to adding between 2-5 agents

into the system. Beyond this, however, the improvement decreases, as multiagent noise

becomes a problem. The multiagent noise means that the problem takes much longer to

converge to a solution, and the agents are given only 100 epochs to learn. If the learning

process were to continue for longer, there might be a shift to favor more agents in the

system.

A natural extension of this approach would be to identify convergence of an agent, and

78

0 5 10 15 20 25 30 35 40Number of agents activeordered by impactfulness

7

8

9

10

11

12

13

14

Aver

age

Tra

vel

Tim

e

Figure 7.7: Testing with different numbers of agents, by impactfulness. We see anincrease in total travel time as more agents are added as the multiagent learning problembecomes harder.

79

then switch to learning with another single agent when convergence has been reached.

This could lead to much faster discovery of a good routing solution. Additionally, our

results indicate that there is a specific subset and ordering of agents learning in the

system that produce good results. Being able to detect these subsets a priori would be

a beneficial extension to this work.

7.3 Discussion

We have investigated the problem of multiagent noise in the UAV traffic management

domain. Multiagent noise occurs when there are many agents acting at once in a system.

In the UAV traffic management domain, we would like to be able to scale to systems,

and therefore would like to mitigate this multiagent noise problem as much as possible.

Multiagent noise can be reduced if fewer agents act in the system. Our experiments

with impactfulness indicated that some agents had more ability to influence the system

than others. By excluding some agents from the learning process aimed to reduce the

noise in the multiagent system. This could potentially be done with minimal effect on

system performance if the agents being removed had a small ability to influence the

system anyway.

We started by isolating single agents in the system, and assess their learning perfor-

mance in the absence of other agents acting in the system. This represents how an agent

would act without any multiagent noise. We found that in some cases, an agent act-

ing along in the system could have better learning performance than all agents learning

concurrently. This is significant because it illustrates the degree to which other agents

can confuse the learning process. The advantages of having other agents with the abil-

ity to adapt better policies in the system is overshadowed by the noise that interferes

with the learning process. This is counterintuitive; the baseline policy of outputting one

tends to be highly suboptimal. It would seem that the advantages of learning to output

something different should outweigh a noisy reward signal.

Impactfulness did not have a strictly linear correlation with the performance of a

single agent acting alone in the system. We expected that if an agent could impact

more paths of UAVs in the system, it would correlate to having a greater reward in

the system. By definition, an agent with an impactfulness of zero should have zero

improvement over baseline when learning, because it would not be able to affect the

80

performance of the system in any way, no matter how high or low it set its value. We do

see a loose correlation of impactfulness with global reward attained by a single learning

agent, but impactfulness turns out not to be a good predictor of received reward.

This indicates that there is another parameter besides impactfulness that matters in

the ability to produce a good global reward; whether an agent should attract or deflect

paths through it. The topology of the map may be such that it is counterproductive to

shift the flow of traffic from an agent. This would not be captured by our impactfulness

metric, and if it is not a good policy for an agent to use its ability to attract or deflect

traffic, it will still have little impact on the system. The highest global value score would

therefore correspond to a high impactfulness, and a general positioning in the system

where it was beneficial to draw or deflect many different UAV paths. As of now, we have

no way of quantifying this position in the graph, and this could be a source for future

investigation.

Similar to the formulation of Difference Rewards, which exclude an agent during the

calculation of a reward, we also experimented with excluding single agents in the system.

This experiment gave us an insight into the working of the multiagent system; some

individual agents produce much more noise than others in the learning process. These

deceptive agents can signficantly reduce performance. We saw substantial improvement

when the correct agent was removed from the learning process.

The exclusion trials gave us the best performance in the system compared with all

agents learning or a single agent learning. This validates our multiagent approach; when

we remove a noisy agent, we can get the benefit of all agents adapting in the system,

while substantially reducing the noise from learning.

Excluding one agent also did not strictly linearly correlate to the impactfulness.

Again we expected higher impactfulness to correspond to higher performance when an

agent was removed, for the exact opposite reason of the include-one trials; we assumed

that an impactful agent would have the most ability to vary the learning signal in the

system. Again, we find a loose correlation, but this is not sufficient to predict the global

reward value that an agent will produce when acting in the system. The reason for this

is slightly more obvious: an impactful agent may be needed in the system to provide

routing. If it is able to learn a good policy, removing it will be detrimental for the system,

and not remove a substantial amount of noise.

Finally, we looked at sequentially adding agents into the system. In these experiments

81

we aimed to get the fast convergence observed in the single-agent trials, but to add

adaptability for other agents once that single agent had learned. This provided mixed

results. Certain subsets of agents would perform better than others when learning in

sequence. In some cases, we found fast perfomance to good values when the correct

sequence of agents was introduced.

Our expectation was that if agents learned sequentially, and an unhelpful agent was

added in later, it would be able to learn to output a flat value as it had been forced to

output before. In some cases we observed this behavior, but it appeared that sometimes

the agents could learn to reduce their effect, but not completely learn the optimal output

value.

We successfully showed that we can reduce the agent noise substantially, and attain

better performance than all agents learning together. In this section we identified these

subsets of agents experimentally, and were able to achieve substantial reduction in total

travel time. Our goal was to find the optimal mixture of agents working in the system.

We moved toward this by isolating the impact of individual agents, and by introducing

agents sequentially, but we did not find an a priori method that would allow us to

design a smaller multiagent system. Impactfulness was a promising metric, but it did

not correlate entirely to the global reward that an agent could attain (or detract from)

in the system.

82

Chapter 8: Conclusion

In this work we present an end-to-end construction of a UAV traffic management prob-

lem. We first propose a method to manage UAV traffic at a high level by adding a

two-step process to a UAV’s planning strategy, forcing a UAV to plan the sector order

that it visits before planning a finer-grained physical path across the space. We then

use this planning over the sector order to incentivize or disincentivize travel of UAVs

through sections of the airspace, using a learning multiagent system to assign edge costs.

First we outline the UAV traffic management problem, a new problem that mixes

elements of traditional ground traffic modeling with airspace-based traffic management

approaches. We then develop several performance metrics, and a control methodology

to manipulate air traffic in our problem. We verify that our control methodology is

effective by quantifying the impactfulness of elements in a graph. We then outline two

agent formulations to provide multiagent traffic reduction in our problem.

Our multiagent framework has many advantages. The first is that it is platform-

agnostic. We coordinate UAVs by incentivizing alternate routes when a UAV’s path

may become congested. The UAVs independently calculate their trajectories through

the space with this information. This frees us from assuming anything about the UAV’s

internal structure. Our management strategy is also dynamic. Agents evolve policies that

convert observed traffic to air travel costs. The costs will therefore change throughout the

simulation as different numbers of UAVs travel through the space. A converged solution

represents a real-time costing strategy that responds dynamically to traffic conditions.

We also consider time-extended effects. Our agents learn to reduce delays over an

entire run. This means that downstream delay propagation is captured in the final

learned costing policy. Agents can thus respond proactively to observed traffic patterns

in order to curtail future traffic. Finally, computing a forward pass through a neural

network is fast. The training time may be significant to produce these policies, but an

evolved policy can be used in real time.

Experiments using robots demonstrated an end-to-end application of our routing

approach. Using the ROS framework we were able to implement the UAV traffic man-

83

agement system on Pioneer 3-DX robots on a real maze. In doing this, we show to

feasibility of this approach toward a real application. The application presented largely

in this work is geared toward UAVs. However, in showing our work on a Pioneer, we

demonstrate that our approach is platform- and planner-independent at the low level.

In this work we find a tradeoff between the two different UAV traffic management

agent definitions. Sector agents are able to use a much larger amount of the information

in the system, but are unable to provide fine-grained control. Link agents offer a way

to modify individual transitions between the sectors, but observe much less of the state

space and suffer from multiagent noise. The next step in this research is to develop new

multiagent coordination techniques (e.g. difference rewards combined with deep learning

approaches) that can provide effective reward shaping in this complex domain.

Though the link agents performed had similar performance to the sector agents, it

is contrary to intuition that they were able to learn so effectively. In the 20-node case,

there were 20 agents in the sector case and 102 agents in the link agent case. The link

agent case presents a much larger multiagent system, with agents that could observe

much less of the environment than the sector agent case. There are many techniques

in the field of multiagent learning that can improve the coordination between agents in

large systems. This indicates that there is a promising area of future work in gaining

further system improvement by developing multiagent coordination techniques for the

link agent controlled UTM domain.

We also show that we can gain substantial convergence improvements by looking at

subsets of agents in the multiagent system. We found that a single agent acting alone in

the multiagent system can provide faster convergence and better performance than the

entire multiagent system acting together. By examining a single agent in the system, we

can reduce all multiagent noise. Even better performance is attained when we exclude

a particularly detrimental agent from the system. In this work, we find these subsets

experimentally, but we leave it to future work to develop an autonomous method of

identifying these subsets a priori.

The policies developed at this high level are useful, but we are ultimately testing

on an abstraction of a physical domain. This abstraction may not capture some of

the dynamics that could occur when modeling the airspace. Previous work included a

simulation of pixel-level paths across an obstacle-filled airspace. It may be valuable to

investigate more complex system dynamics using the algorithms presented in this work,

84

as the presence of obstacles may introduce further challenges in coordination.

85

Chapter 9: Future Work

There are many extensions that were out of the scope of our current investigation, but

could provide a research opportunity to others looking to use this multiagent system.

Though we have investigated the variance of many different parameters including graph

structure, traffic levels, fitness definitions, and agent structure, there are numerous ex-

tensions in the direction of better traffic modeling or agent construction. In this section

we provide an overview of some possible extensions to the work presented in this disser-

tation.

9.1 Airspace Construction

There is a significant amount of work available in setting up the optimal architecture

for the airspace of a multiagent system. In particular, our results showed a significant

disparity between the routing ability of different agents in the system based on the graph

topology at a high level.

We began our investigation of UAV traffic management by looking at a height map,

and then hand-picking intersections to define areas of agent control. We then used

Voronoi partitioning to define the bounds of agent control. Both the hand-selection of

points and the points and the Voronoi partitioning are aspects that invite future work into

this topic, as they were chosen arbitrarily. For example, if the Voronoi partitioning is the

ideal method of sectioning the space for agent control, is there a mathematical method of

choosing points in the space that yields optimal learning performance? Finding a better

partitioning of the space could result in a reduction of conflicts.

The approach of using a Voronoi partition not be well-tailored to this domain. It

provided an impartial way of dividing the space, which reduced the human element of

the state partitioning. However, focusing on an approach that would provide a more

even area assigned to each space could result in performance improvements.

86

9.2 Physical Trials

Future work in the real robot domain will include improving the ability of robots to

sense one another in the space. Additionally, scaling this to a much larger map, with

a potentially more complex topology, could better demonstrate the feasibility of this

method on a large scale for a real-world application. Making a larger map will also

necessitate creating more traffic to manage. Future work will look at adding ‘ghost’

traffic in the system, to study the behavior of robots when they perceive many other

robots in the system.

Another important direction for future work is outdoor trials. Outdoor trials mean a

much less controlled environment, which could cause a problem for laser scan data and

affect localization. However, a large advantage of outdoor experiments is the possibility

of using GPS equipment. This could potentially give a more reliable position signal that

could be communicated to UTM agents. Outdoor trials would also enable the use of

actual UAV testing, which would be an important step for a UAV traffic management

framework.

9.3 Heterogeneous Planners

Our UAV traffic management system manipulates a cost-based planner to route traffic.

The only specification for using the learning approach for this work was that we assume

that UAVs rely on cost-based planners. In this work we selected A∗ planners to plan

paths for the UAVs. This was due to the widespread use and optimality guarantees of

an A∗ planner, the availability of a fast implementation on Boost, and the suitability of

A∗ for use on a pixel-based graph.

There are other cost-based planning algorithms that are feasible for use by UAVs.

RRT∗ is a fast, sampling-based path planner that has gained popularity in robot appli-

cations. This provides paths that are not necessarily optimal, but have fast generation.

This algorithm could be used to replace motion planning at the pixel level to give us

faster simulations for physical conflict testing.

87

9.4 Tolerance Testing

Our learning approach has been tested in the accommodation of different traffic condi-

tions, different graph scales, and different agent architectures. However in each of these,

our focus was on learning with rational UAV motion. However this is not always the

case in real-world scenarios. A UAV, particularly if it is remotely operated and not

autonomously guided, may take suboptimal paths through the space. Investigating the

gracefulness of the performance of our agents when in the presence of non-compliant

agents is an important step toward actual application of this approach as a regulatory

method.

There may also be substantial differences in the models of UAVs encountered in

the real world. Investigating the effect of heterogeneous mixtures of UAVs would be a

possible extension to this work. Heterogeneous speed, for example, would provide a less

predictable element to routing UAVs. This could complicate the learning of policies by

agents in the system. Additionally, the safety and predictability of some UAVs could

differ. Different platforms mean that there will be different sensing capabilities, and it

could be that reaction times allow some UAVs to require much more distance than others

to safely interact in the space. An important extension of this work would be to find a

way to incorporate this heterogeneity into the learning process.

A∗ is also not the only or necessarily the most common solution to path planning in an

obstacle-filled environment. Different graph-based planning algorithms may be explored,

using the same evolutionary infrastructure for setting the graph weights. Additionally,

the routing algorithm currently does not account for possible heterogeneity in the system.

Further improvements could be made to this algorithm if heterogeneity in UAV type or

priority could be accommodated.

88

Bibliography

[1] ADS-B Technologies. What Is ADS-B? http://www.ads-b.com/.

[2] A. Agogino. Evaluating evolution and monte carlo for controlling air traffic flow.In Proceedings of the 11th Annual Conference Companion on Genetic and Evolu-tionary Computation Conference: Late Breaking Papers, GECCO ’09, pages 1957–1962, New York, NY, USA, 2009. ACM.

[3] A. Agogino and K. Tumer. Regulating Air Traffic Flow with Coupled Agents. InProceedings of the Seventh International Joint Conference on Autonomous Agentsand Multiagent Systems, Estoril, Portugal, May 2008.

[4] A. K. Agogino and K. Tumer. Reinforcement Learning in Large Multi-agent Sys-tems. In AAMAS-05 Workshop on Coordination of Large Scale Multiagent Systems.Springer-Verlag, Utrecht, Netherlands, July 2005.

[5] A. K. Agogino and K. Tumer. Analyzing and Visualizing Multiagent Rewardsin Dynamic and Stochastic Environments. Journal of Autonomous Agents andMulti-Agent Systems, 17(2):320–338, 2008.

[6] A. K. Agogino and K. Tumer. Learning Indirect Actions in Complex Domains:Action Suggestions for Air Traffic Control. Advances in Complex Systems, 12(4-5):493–512, 2009.

[7] A. K. Agogino and K. Tumer. A Multiagent Approach to Managing Air TrafficFlow. Autonomous Agents and MultiAgent Systems, 24:1–25, 2012.

[8] A. Albert, F. S. Leira, and L. Imsland. Uav path planning using milp with exper-iments. Modeling, Identification and Control, 38(1):21, 2017.

[9] F. C. Allaire, M. Tarbouchi, G. Labonte, and G. Fusina. FPGA implementation ofgenetic algorithm for UAV real-time path planning. In Unmanned Aircraft Systems,pages 495–510. Springer, 2008.

[10] C. Amato and F. A. Oliehoek. Scalable planning and learning for multiagentPOMDPs. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.

[11] B. and Walter, A. Sannier, D. Reiners, and J. Oliver. Uav swarm control: Cal-culating digital pheromone fields with the gpu. The Journal of Defense Modelingand Simulation, 3(3):167–176, 2006.

89

[12] J. K. Archibald, J. C. Hill, N. A. Jepsen, W. C. Stirling, and R. L. Frost. ASatisficing Approach to Aircraft Conflict Resolution. Trans. Sys. Man Cyber PartC, 38(4):510–521, July 2008.

[13] M. C. Aubert, S. zmc, A. R. Hutchins, and M. L. Cummings. Toward the de-velopment of a low-altitude air traffic control paradigm for networks of small,autonomous unmanned aerial vehicles. In AIAA Infotech @ Aerosapce, SciTechConference, 2015.

[14] M. A. Azzopardi and J. F. Whidborne. Computational Air Traffic Manage-ment. In Proceedings of the 30th AIAA/IEEE Digital Avionics Systems Conference(DASC2011), Best Paper Session, pages 1.B.5–1, Seattle, WA, USA, 2011.

[15] T. Back, D. B. Fogel, and Z. Michalewicz. Handbook of evolutionary computation.New York: Oxford, 1997.

[16] I. Bekmezci, O. K. Sahingoz, and S. Temel. Flying ad-hoc networks (fanets): Asurvey. Ad Hoc Networks, 11(3):1254–1270, 2013.

[17] D. Bertsimas and S. Patterson. The Air Traffic Flow Management Problem withEnroute Capacities. In May-June 1998, pp. 406-422, 1998.

[18] C. M. Bishop. Pattern recognition. Machine Learning, 128, 2006.

[19] S. Bleuler, M. Brack, L. Thiele, and E. Zitzler. Multiobjective genetic program-ming: Reducing bloat using spea2. In Evolutionary Computation, 2001. Proceed-ings of the 2001 Congress on, volume 1, pages 536–543. IEEE, 2001.

[20] C. Bongiorno, G. Gurtner, F. Lillo, L. VAlori, M. Ducci, B. Monechi, and S. Pozzi.An agent based model of air traffic management. In Proceedings of the ThirdSESAR Innovation Days, 2013.

[21] F. Bullo, E. Frazzoli, M. Pavone, K. Savla, and S. L. Smith. Dynamic vehiclerouting for robotic systems. Proceedings of the IEEE, 99(9):1482–1504, 2011.

[22] D. Burgett. SF Building Heights, 2014.

[23] S. Cameron, S. Hailes, S. Julier, S. McClean, G. Parr, N. Trigoni, M. Ahmed,G. McPhillips, R. De Nardi, J. Nie, et al. Suaave: Combining aerial robots andwireless networking. In 25th Bristol International UAV Systems Conference, 2010.

[24] D. L. Chen and R. J. Mooney. Learning to interpret natural language navigationinstructions from observations. In AAAI, volume 2, pages 1–2, 2011.

90

[25] C. Chlestil, E. Leitgeb, N. Schmitt, S. S. Muhammad, K. Zettl, and W. Rehm.Reliable optical wireless links within uav swarms. In Transparent Optical Networks,2006 International Conference on, volume 4, pages 39–42. IEEE, 2006.

[26] M. A. Christodoulou and S. G. Kodaxakis. Automatic commercial aircraft-collisionavoidance in free flight: the three-dimensional problem. IEEE Transactions onIntelligent Transportation Systems, 7(2):242–249, 2006.

[27] W. J. Curran, A. Agogino, and K. Tumer. Addressing hard constraints in the airtraffic problem through partitioning and difference rewards. In Proceedings of the2013 International Conference on Autonomous Agents and Multi-Agent Systems,pages 1281–1282. International Foundation for Autonomous Agents and MultiagentSystems, 2013.

[28] D. B. DAmbrosio and K. O. Stanley. Scalable multiagent learning through indirectencoding of policy geometry. Evolutionary Intelligence, 6(1):1–26, 2013.

[29] P. Dasgupta. A multiagent swarming system for distributed automatic targetrecognition using unmanned aerial vehicles. IEEE Transactions on Systems, Man,and Cybernetics-Part A: Systems and Humans, 38(3):549–563, 2008.

[30] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition,2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.

[31] V. Digani, L. Sabattini, and C. Secchi. A Probabilistic Eulerian Traffic Model forthe Coordination of Multiple AGVs in Automatic Warehouses. IEEE Robotics andAutomation Letters, 1(1):26–32, 2016.

[32] E. W. Dijkstra. A note on two problems in connexion with graphs. Numerischemathematik, 1(1):269–271, 1959.

[33] X. C. Ding, A. R. Rahmani, and M. Egerstedt. Multi-uav convoy protection:An optimal approach to path planning and coordination. IEEE Transactions onRobotics, 26(2):256–268, 2010.

[34] H. Emami and F. Derakhshan. An overview on conflict detection and resolutionmethods in air traffic management using multi agent systems. In 16th CSI Inter-national Symposium on Artificial Intelligence and Signal Processing (AISP), pages293–298. IEEE, 2012.

[35] Federal Aviation Administration. NextGen Implementation Plan, 2016.

91

[36] Federal Aviation Administration. Operation and Certification of Small UnmannedAircraft Systems, June 2016.

[37] S. G. Ficici, O. Melnik, and J. B. Pollack. A game-theoretic and dynamical-systems analysis of selection methods in coevolution. Evolutionary Computation,IEEE Transactions on, 9(6):580–602, 2005.

[38] A. G. Foina, C. Krainer, and R. Sengupta. An Unmanned Aerial Traffic Manage-ment solution for cities using an air parcel model. In 2015 International Conferenceon Unmanned Aircraft Systems, pages 1295–1300, 2015.

[39] A. Fraser. Simulation of genetic systems by automatic digital computers vii. effectsof reproductive ra’l’e, and intensity of selection, on genetic structure. AustralianJournal of Biological Sciences, 13(3):344–350, 1960.

[40] E. Gawrilow, E. Khler, R. H. Mhring, and B. Stenzel. Dynamic routing of auto-mated guided vehicles in real-time. In H.-J. Krebs and W. Jger, editors, Mathe-matics - Key Technology for the Future, pages 165–177. Springer, 2008.

[41] B. P. Gerkey. amcl. http://wiki.ros.org/amcl.

[42] S. Gupta, J. Davidson, S. Levine, R. Sukthankar, and J. Malik. Cognitive mappingand planning for visual navigation. arXiv preprint arXiv:1702.03920, 2017.

[43] M. Haklay and P. Weber. Openstreetmap: User-generated street maps. IEEEPervasive Computing, 7(4):12–18, 2008.

[44] P. E. Hart, N. J. Nilsson, and B. Raphael. A formal basis for the heuristic de-termination of minimum cost paths. IEEE transactions on Systems Science andCybernetics, 4(2):100–107, 1968.

[45] J. C. Hill, F. R. Johnson, J. K. Archibald, R. L. Frost, and W. C. Stirling. Acooperative multi-agent approach to free flight. In Proceedings of the fourth inter-national joint conference on Autonomous agents and multiagent systems, AAMAS’05, pages 1083–1090, New York, NY, USA, 2005. ACM.

[46] C. HolmesParker, A. Agogino, and K. Tumer. CLEAN Rewards for ImprovingMultiagent Coordination in the Presence of Exploration (extended abstract). InProceedings of the Twelveth International Joint Conference on Autonomous Agentsand Multiagent Systems, Minneapolis, MN, May 2013.

[47] S. P. Hoogendoorn and P. H. Bovy. State-of-the-art of vehicular traffic flow mod-elling. Proceedings of the Institution of Mechanical Engineers, Part I: Journal ofSystems and Control Engineering, 215(4):283–303, 2001.

92

[48] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks areuniversal approximators. Neural networks, 2(5):359–366, 1989.

[49] J. P. How, B. BEHIHKE, A. Frank, D. Dale, and J. Vian. Real-time indoorautonomous vehicle test environment. IEEE control systems, 28(2):51–64, 2008.

[50] M. A. Hsieh, A. Cowley, V. Kumar, and C. J. Taylor. Maintaining network connec-tivity and performance in robot teams. Journal of Field Robotics, 25(1-2):111–131,2008.

[51] J. R. Kok, N. Vlassis, et al. Sparse tabular multiagent q-learning. In AnnualMachine Learning Conference of Belgium and the Netherlands, pages 65–71, 2004.

[52] P. Kopardekar. Safely Enabling Low-Altitude Airspace Operations: UnmannedAerial System Traffic Management (UTM). Technical report, NASA Ames Re-search Center, Mountain View, CA, Apr. 2015.

[53] M. Kothari, I. Postlethwaite, and D.-W. Gu. Multi-uav path planning in obstaclerich environments using rapidly-exploring random trees. In Decision and Control,2009 held jointly with the 2009 28th Chinese Control Conference. CDC/CCC 2009.Proceedings of the 48th IEEE Conference on, pages 3069–3074. IEEE, 2009.

[54] G. B. Lamont, J. N. Slear, and K. Melendez. Uav swarm mission planning androuting using multi-objective evolutionary algorithms. In Computational Intelli-gence in Multicriteria Decision Making, IEEE Symposium on, pages 10–20. IEEE,2007.

[55] S. M. LaValle. Rapidly-exploring random trees: A new tool for path planning. TR98-11, October 1998.

[56] P. Long, W. Liu, and J. Pan. Deep-learned collision avoidance policy for distributedmultiagent navigation. IEEE Robotics and Automation Letters, 2(2):656–663, 2017.

[57] B. D. Luders, S. Karaman, and J. P. How. Robust sampling-based motion plan-ning with asymptotic optimality guarantees. In AIAA Guidance, Navigation, andControl (GNC) Conference, page 5097, 2013.

[58] C. Matuszek, E. Herbst, L. Zettlemoyer, and D. Fox. Learning to parse naturallanguage commands to a robot control system. In Experimental Robotics, pages403–415. Springer, 2013.

[59] J. D. R. Millan, D. Posenato, and E. Dedieu. Continuous-action q-learning. Ma-chine Learning, 49(2):247–265, 2002.

93

[60] C. Miyajima, Y. Nishiwaki, K. Ozawa, T. Wakita, K. Itou, K. Takeda, andF. Itakura. Driver modeling based on driving behavior and its evaluation in driveridentification. Proceedings of the IEEE, 95(2):427–437, 2007.

[61] P. Munjal and J. Pahl. An analysis of the boltzmann-type statistical models formulti-lane traffic flow. Transportation Research, 3(1):151–163, 1969.

[62] N. Ozalp and O. K. Sahingoz. Optimal uav path planning in a 3d threat envi-ronment by using parallel evolutionary algorithms. In Unmanned Aircraft Systems(ICUAS), 2013 International Conference on, pages 308–317. IEEE, 2013.

[63] R. A. Paielli. Automated Generation of Air Traffic Encounters for Testing ConflictResolution Software. AIAA Journal of Aerospace Information Systems, 10(5), May2013.

[64] M. Pechoucek and D. Sislak. Agent-based approach to free-flight planning, control,and simulation. Intelligent Systems, IEEE, 24(1):14–17, 2009.

[65] N. Perez-Higueras, R. Ramon-Vigo, F. Caballero, and L. Merino. Robot local nav-igation with learned social cost functions. In Informatics in Control, Automationand Robotics (ICINCO), 2014 11th International Conference on, volume 2, pages618–625. IEEE, 2014.

[66] M. A. Potter and K. A. De Jong. Cooperative coevolution: An architecture forevolving coadapted subcomponents. Evolutionary computation, 8(1):1–29, 2000.

[67] A. Provodin, L. Torabi, B. Flepp, Y. LeCun, M. Sergio, L. D. Jackel, U. Muller,and J. Zbontar. Fast incremental learning for off-road robot navigation. arXivpreprint arXiv:1606.08057, 2016.

[68] L. Qiu, W.-J. Hsu, S.-Y. Huang, and H. Wang. Scheduling and routing algorithmsfor AGVs: a survey. International Journal of Production Research, 40(3):745–760,2002.

[69] Y.-h. Qu, Q. Pan, and J.-g. Yan. Flight path planning of uav based on heuristicallysearch and genetic algorithms. In Industrial Electronics Society, 2005. IECON2005. 31st Annual Conference of IEEE, pages 5–pp. IEEE, 2005.

[70] M. Quigley, K. Conley, B. P. Gerkey, J. Faust, T. Foote, J. Leibs, R. Wheeler, andA. Y. Ng. ROS: An open-source robot operating system. In ICRA Workshop onOpen Source Software, 2009.

[71] E. Saad, J. Vian, G. Clark, and S. Bieniawski. Vehicle swarm rapid prototyp-ing testbed. In AIAA Infotech@ Aerospace Conference and AIAA Unmanned...Unlimited Conference, page 1824, 2009.

94

[72] J. Samek, D. Sislak, P. Volf, and M. Pechoucek. Multi-party Collision Avoidanceamong Unmanned Aerial Vehicles. In Intelligent Agent Technology, 2007. IAT ’07.IEEE/WIC/ACM International Conference on, pages 403–406, 2007.

[73] M. Shanmugavel, A. Tsourdos, B. White, and R. Zbikowski. Co-operative pathplanning of multiple uavs using dubins paths with clothoid arcs. Control Engineer-ing Practice, 18(9):1084–1092, 2010.

[74] A. A. Sherstov and P. Stone. Function approximation via tile coding: Automatingparameter choice. In International Symposium on Abstraction, Reformulation, andApproximation, pages 194–205. Springer, 2005.

[75] J. G. Siek, L.-Q. Lee, and A. Lumsdaine. Boost Graph Library: User Guide andReference Manual, The. Pearson Education, 2001.

[76] A. Singh, A. Krause, C. Guestrin, W. J. Kaiser, and M. A. Batalin. Efficientplanning of informative paths for multiple robots. In Proceedings of IJCAI 2007,pages 2204–2211, 2007.

[77] D. Sislak, P. Volf, and M. Pechoucek. Agent-Based Cooperative DecentralizedAirplane-Collision Avoidance. Intelligent Transportation Systems, IEEE Transac-tions on, 12(1):36–46, 2011.

[78] D. Sislak, P. Volv, M. Pechoucek, and N. Suri. Automated Conflict ResolutionUtilizing Probability Collectives Optimizer. IEEE Transactions on Systems, Man,and Cybernetics, 41(3):365–375, May 2011.

[79] A. Sivakumar and C. K.-Y. Tan. Uav swarm coordination using cooperative controlfor establishing a wireless communications backbone. In Proceedings of the 9thInternational Conference on Autonomous Agents and Multiagent Systems: volume3-Volume 3, pages 1157–1164. International Foundation for Autonomous Agentsand Multiagent Systems, 2010.

[80] SkyRadar Radenna LLC. Ads-b technology (tis-b, fis-b). http://www.skyradar.net/skyradar-system/adsbtechnology.html.

[81] W. D. Smart and L. P. Kaelbling. Practical reinforcement learning in continuousspaces. In ICML, pages 903–910, 2000.

[82] K. O. Stanley and R. Miikkulainen. Evolving neural networks through augmentingtopologies. Evolutionary computation, 10(2):99–127, 2002.

[83] D. Stonier, J. Lee, and H. Kim. rocon. http://wiki.ros.org/rocon.

95

[84] R. Stranders, A. Farinelli, A. Rogers, and N. Jennings. Decentralised coordinationof mobile sensors using the max-sum algorithm. In Proceedings of IJCAI 2009,pages 299–304, 2009.

[85] F. Su, Y. Li, H. Peng, and L. Shen. Multi-ucav cooperative path planning usingimproved coevolutionary multi-ant-colony algorithm. Emerging Intelligent Com-puting Technology and Applications, pages 834–845, 2009.

[86] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MITPress, 1998.

[87] F. Taghaboni-Dutta and J. M. A. Tanchoco. Comparison of dynamic routeing tech-niques for automated guided vehicle system. International Journal of ProductionResearch, 33(10):2653–2669, 1995.

[88] N. Tao, J. Baxter, and L. Weaver. A multi-agent, policy-gradient approach tonetwork routing. In In: Proc. of the 18th Int. Conf. on Machine Learning. Citeseer,2001.

[89] T. Tarnopolskaya and N. Fulton. Synthesis of Optimal Control for CooperativeCollision Avoidance for Aircraft (Ships) with Unequal Turn Capabilities. Springer,2009. DOI: 10.1007/s10957-009-9597-1.

[90] W. L. Teacy, J. Nie, S. McClean, and G. Parr. Maintaining connectivity in uavswarm sensing. In GLOBECOM Workshops (GC Wkshps), 2010 IEEE, pages1771–1776. IEEE, 2010.

[91] G. Tesauro. Temporal difference learning and td-gammon. Communications of theACM, 38(3):58–68, 1995.

[92] B. Toledo, V. Munoz, J. Rogan, C. Tenreiro, and J. A. Valdivia. Modeling trafficthrough a sequence of traffic lights. Physical Review E, 70(1):016107, 2004.

[93] C. Tomlin, G. J. Pappas, and S. Sastry. Conflict resolution for air traffic manage-ment: A study in multiagent hybrid systems. Automatic Control,IEEE Transac-tions on, 43(4):509–521, 1998.

[94] F. H. Tseng, T. T. Liang, C. H. Lee, L. Der Chou, and H. C. Chao. A starsearch algorithm for civil uav path planning with 3g communication. In Intelli-gent Information Hiding and Multimedia Signal Processing (IIH-MSP), 2014 TenthInternational Conference on, pages 942–945. IEEE, 2014.

[95] K. Tumer and A. Agogino. Distributed Agent-Based Air Traffic Flow Management.In Proceedings of the Sixth International Joint Conference on Autonomous Agentsand Multiagent Systems, pages 330–337, Honolulu, HI, May 2007.

96

[96] K. Tumer and A. Agogino. Adaptive Management of Air Traffic Flow: A Multia-gent Coordination Approach. In Proceedings of the Twenty Third AAAI Conferenceon Artificial Intelligence, Nectar Track, Chicago, IL, July 2008.

[97] K. Tumer and N. Khani. Learning from Actions Not Taken in Multiagent Systems.Advances in Complex Systems, 12(4-5):455–473, 2009.

[98] M. Valenti, B. Bethke, D. Dale, A. Frank, J. McGrew, S. Ahrens, J. P. How, andJ. Vian. The mit indoor multi-vehicle flight testbed. In Robotics and Automation,2007 IEEE International Conference on, pages 2758–2759. IEEE, 2007.

[99] M. Vinyals, J. A. Rodriguez-Aguilar, and J. Cerquides. A survey on sensor net-works from a multiagent perspective. The Computer Journal, page bxq018, 2010.

[100] H. Wang, J. Zhou, G. Zheng, and Y. Liang. Has: Hierarchical a-star algorithm forbig map navigation in special areas. In Digital Home (ICDH), 2014 5th Interna-tional Conference on, pages 222–225. IEEE, 2014.

[101] C. J. Watkins and P. Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992.

[102] C. J. C. H. Watkins. Learning from delayed rewards. PhD thesis, University ofCambridge England, 1989.

[103] S. Whiteson, P. Stone, K. O. Stanley, R. Miikkulainen, and N. Kohl. Automaticfeature selection in neuroevolution. In Proceedings of the 7th annual conference onGenetic and evolutionary computation, pages 1225–1232. ACM, 2005.

[104] P. Yan, M.-Y. Ding, and C.-P. Zhou. Game-theoretic route planning for team ofuavs. In Proceedings of the 2004 International Conference on Machine Learningand Cybernetics, volume 2, pages 723–728. IEEE, 2004.

[105] K. Yang, S. Keat Gan, and S. Sukkarieh. A Gaussian process-based RRT plan-ner for the exploration of an unknown and cluttered environment with a UAV.Advanced Robotics, 27(6):431–443, 2013.

[106] L. Yliniemi, A. K. Agogino, and K. Tumer. Evolutionary agent-based simulationof the introduction of new technologies in air traffic management. In Proceedings ofthe 2014 Conference on Genetic and Evolutionary Computation, pages 1215–1222.ACM, 2014.

[107] H. Yserentant. A new class of particle methods. Numerische Mathematik, 76(1):87–109, 1997.

97

[108] H. Yu, K. Meier, M. Argyle, and R. W. Beard. Cooperative path planning fortarget tracking in urban environments using unmanned air and ground vehicles.IEEE/ASME Transactions on Mechatronics, 20(2):541–552, 2015.

[109] M. Zhang, R. Batta, and R. Nagi. Modeling of Workflow Congestion and Opti-mization of Flow Routing in a Manufacturing/Warehouse Facility. ManagementScience, 55(2):267–280, Feb. 2009.

[110] M. Zhang, C. Su, Y. Liu, M. Hu, and Y. Zhu. Unmanned aerial vehicle routeplanning in the presence of a threat environment based on a virtual globe platform.ISPRS International Journal of Geo-Information, 5(10):184, 2016.

[111] C. Zheng, M. Ding, C. Zhou, and L. Li. Coevolving and cooperating path plannerfor multiple unmanned air vehicles. Engineering Applications of Artificial Intelli-gence, 17(8):887–896, 2004.

[112] E. Zitzler, M. Laumanns, L. Thiele, et al. SPEA2: Improving the strength Paretoevolutionary algorithm, 2001.