Agent Capability in Persistent Mission Planning using ...acl.mit.edu/papers/acc-2010-bethke.pdf · Fig. 2. System architecture with n agents If y i = Y b, then the vehicle is at base

Agent Capability in Persistent Mission Planningusing Approximate Dynamic Programming

Brett Bethke, Joshua Redding and Jonathan P. HowMatthew A. Vavrina and John Vian

Abstract— This paper presents an extension of our previouswork on the persistent surveillance problem. An extendedproblem formulation incorporates real-time changes in agentcapabilities as estimated by an onboard health monitoringsystem in addition to the existing communication constraints,stochastic sensor failure and fuel flow models, and the basicconstraints of providing surveillance coverage using a team ofautonomous agents. An approximate policy for the persistentsurveillance problem is computed using a parallel, distributedimplementation of the approximate dynamic programmingalgorithm known as Bellman Residual Elimination. This pa-per also presents flight test results which demonstrate thatthis approximate policy correctly coordinates the team tosimultaneously provide reliable surveillance coverage and acommunications link for the duration of the mission andappropriately retasks agents to maintain these services in theevent of agent capability degradation.

I. INTRODUCTION

In the context of multiple coordinating agents, manymission scenarios of interest are inherently long-durationand require a high level of agent autonomy due to theexpense and logistical complexity of direct human controlover individual agents. Hence, autonomous mission planningand control for multi-agent systems is an active area ofresearch [1], [2].

Long-duration missions are indeed practical scenarios thatcan show well the benefits of agent cooperation. However,such persistent missions can accelerate mechanical wearand tear on an agent’s hardware platform, increasing thelikelihood of related failures. Additionally, unpredictablefailures such as the loss of a critical sensor or perhapsdamage sustained during the mission may lead to sub-optimalmission performance. For these reasons, it is important thatthe planning system accounts for the possibility of suchfailures when devising a mission plan. In general, planningproblems that coordinate the actions of multiple agents,where each of which is subject to failures are referred toas multi-agent health management problems [3], [4].

In general, there are two approaches to dealing withthe uncertainties associated with possible agent failuremodes [5]. The first is to construct a plan based on adeterministic model of nominal agent performance (in effect,

B. Bethke is the CTO of Feder Aerospace and Ph.D. MIT ACLJ. Redding is a Ph.D. Candidate in the Aerospace Controls Lab, MITJ. How is a Professor of Aeronautics and Astronautics, MIT{bbethke,jredding,jhow}@mit.eduM. Vavrina is a Research Engineer at Boeing R&T, Seattle, WAJ. Vian is a Technical Fellow at Boeing R&T, Seattle, WA{matthew.a.vavrina,john.vian}@boeing.com

Fig. 1. Persistent surveillance mission with communication requirements

this approach simply ignores the possibility of failures).Then, if failures do occur, a new plan is computed inresponse to the failure. Since the system does not anticipatethe possibility of failures, but rather responds to them afterthey have already occurred, this approach is referred to asa reactive planner. In contrast, the second approach is toconstruct a plan based on a stochastic model which capturesthe inherent possibility of failures. Using a stochastic modelincreases the complexity of computing the plan, since it maybe necessary to optimize a measure of expected performance,where the expectation is taken over all possible scenariosthat might occur. However, since this proactive planningapproach can anticipate the possibility of failure occurringin the future and take actions to mitigate the consequences,the resulting mission performance may be much better thanthe performance that can be achieved with a reactive planner.For example, if a UAV tracking a high value target is knownto have a high probability of failure in the future, a proactiveplanner may assign a backup UAV to also track the target,thus ensuring un-interrupted tracking if one of the vehicleswere to fail. In contrast, a reactive planner would only havesent a replacement for an already-failed UAV, likely resultingin the loss of the tracked target, at least for some period oftime.

In Ref [6], the use of dynamic programming techniques ismotivated when formulating and solving a persistent surveil-lance planning problem. In this problem, there is a groupof UAVs equipped with cameras or other types of sensors.The UAVs are initially located at a base location, which isseparated by a large distance distance from the surveillancelocation. The objective of the problem is to maintain aspecified number of UAVs over the surveillance location

at all times. Furthermore, the fuel usage dynamics of theUAVs are stochastic, requiring the use of health managementtechniques to achieve good surveillance coverage. In theapproach taken in [6], the planning problem is formulatedas a Markov Decision Process (MDP) [7] and solved usingvalue iteration. The resulting optimal control policy forthe persistent surveillance problem exhibits a number ofdesirable properties, including the ability to proactively recallvehicles to base with an extra, reserve quantity of fuel,resulting in fewer vehicle losses and a higher average missionperformance. The persistent surveillance MDP formulationrelies on an accurate estimate of the system model (i.e.state transition probabilities) in order to guarantee optimalbehavior [8], [9]. If the model used to solve the MDP differssignificantly from the true system model, suboptimal perfor-mance may result when the control policy is implementedon the real system. To deal with this issue, [10] exploredan adaptive framework for estimating the system modelonline using observations of the real system as it operates,and simultaneously re-solving the MDP using the updatedmodel. Using this framework, improved performance wasdemonstrated over the standard approach of simply solvingthe MDP off-line, especially in cases where the initial modelestimate is inaccurate and/or the true system model varieswith time.

Previous formulations of the persistent mission problemrequired at least one UAV to loiter in between the base andsurveillance area in order to provide a communications linkbetween the base and the UAVs performing the surveillance.Also, a failure model of the sensors onboard each UAVallowed the planner to recognize that the UAVs’ sensors mayfail during the mission [5], [6], [10]. This research extendsthese previous formulations to represent a slightly morecomplex mission scenario as well as giving flight results ofthe whole system in operation in the Boeing vehicle swarmrapid prototyping testbed [11], [12]. The extended scenarioincludes the connection of an agent-level Health Monitoringand Diagnostics System (HMDS) with the planner to accountfor estimated agent capabilities due to such failures asmentioned previously. As in [5], the approximate dynamicprogramming technique called Bellman Residual Elimination(BRE) [13], [14] is implemented to compute an approximatecontrol policy.

The remainder of the paper is outlined as follows: Sec-tion II details the formulation of the persistency planningproblem with integrated estimates of agent capability. Thehealth monitoring system is described in Section III followedby some flight results given and discussed in Section IV.Finally, conclusions and future work are offered in Section V

II. PROBLEM FORMULATION

Given the qualitative description of the persistent surveil-lance problem, a precise MDP can now be formu-lated. An infinite-horizon, discounted MDP is specified by(S,A, P, g), where S is the state space,A is the action space,Pij(u) gives the transition probability from state i to statej under action u, and g(i, u) gives the cost of taking action

u in state i. We assume that the MDP model is known.Future costs are discounted by a factor 0 < α < 1. Apolicy of the MDP is denoted by µ : S → A. Given theMDP specification, the problem is to minimize the cost-to-go function Jµ over the set of admissible policies Π:

minµ∈Π

Jµ(i0) = minµ∈Π

E

[ ∞∑k=0

αkg(ik, µ(ik))

]. (1)

For notational convenience, the cost and state transitionfunctions for a fixed policy µ are defined as

gµi ≡ g(i, µ(i)) (2)Pµij ≡ Pij(µ(i)), (3)

respectively. The cost-to-go for a fixed policy µ satisfies theBellman equation [15]

Jµ(i) = gµi + α∑j∈S

PµijJµ(j) ∀i ∈ S, (4)

which can also be expressed compactly as Jµ = TµJµ whereTµ is the (fixed-policy) dynamic programming operator.

A. Persistent Mission Problem

This section gives the specific formulation of the persistentmission MDP from the formation of the problem state spacein section II-A.1, to the cost function in section II-A.4.

1) State Space S: The state of each UAV is given by threescalar variables describing the vehicle’s flight status, fuelremaining and sensor status. The flight status yi describesthe UAV location,

yi ∈ {Yb, Y0, Y1, . . . , Ys, Yc}

where Yb is the base location, Ys is the surveillance location,{Y0, Y1, . . . , Ys−1} are transition states between the base andsurveillance locations (capturing the fact that it takes finitetime to fly between the two locations), and Yc is a specialstate denoting that the vehicle has crashed.

Similarly, the fuel state fi is described by a discrete setof possible fuel quantities,

fi ∈ {0,∆f, 2∆f, . . . , Fmax −∆f, Fmax}

where ∆f is an appropriate discrete fuel quantity.The sensor state si is described by a discrete set of

sensor attributes, si ∈ {0, 1, 2} representing each sensor asfunctional, impaired and disabled respectively.

The total system state vector x is thus given by the statesyi, fi and si for each UAV, along with r, the number ofrequested vehicles:

x = (y1, y2, . . . , yn; f1, f2, . . . , fn; s1, s2, . . . , sn; r)T

2) Control Space A: The controls ui available for the ith

UAV depend on the UAV’s current flight status yi.• If yi ∈ {Y0, . . . , Ys − 1}, then the vehicle is in the

transition area and may either move away from base ortoward base: ui ∈ {“ + ”, “− ”}

• If yi = Yc, then the vehicle has crashed and no actionfor that vehicle can be taken: ui = ∅

Fig. 2. System architecture with n agents

• If yi = Yb, then the vehicle is at base and may eithertake off or remain at base: ui ∈ { “take off”, “remainat base” }

• If yi = Ys, then the vehicle is at the surveillancelocation and may loiter there or move toward base:ui ∈ {“loiter”,”− ”}

The full control vector u is thus given by the controls foreach UAV:

u = (u1, . . . , un)T (5)

3) State Transition Model P : The state transition modelP captures the qualitative description of the dynamics givenat the start of this section. The model can be partitioned intodynamics for each individual UAV.

The dynamics for the flight status yi are described by thefollowing rules:• If yi ∈ {Y0, . . . , Ys− 1}, then the UAV moves one unit

away from or toward base as specified by the actionui ∈ {“ + ”, “− ”}.

• If yi = Yc, then the vehicle has crashed and remains inthe crashed state forever afterward.

• If yi = Yb, then the UAV remains at the base locationif the action “remain at base” is selected. If the action“take off” is selected, it moves to state Y0.

• If yi = Ys, then if the action “loiter” is selected, theUAV remains at the surveillance location. Otherwise, ifthe action “−” is selected, it moves one unit towardbase.

• If at any time the UAV’s fuel level fi reaches zero, theUAV transitions to the crashed state (yi = Yc).

The dynamics for the fuel state fi are described by thefollowing rules:• If yi = Yb, then fi increases at the rate Frefuel (the

vehicle refuels).• If yi = Yc, then the fuel state remains the same (the

vehicle is crashed).• Otherwise, the vehicle is in a flying state and burns fuel

at a stochastically modeled rate: fi decreases by Fburnwith probability pnom and decreases by 2Fburn withprobability (1− pnom).

4) Cost Function g: The cost function g(x,u) penalizesseveral undesirable outcomes in the persistent surveillancemission. First, any gaps in surveillance coverage (i.e. timeswhen fewer vehicles are in the surveillance area than wererequested) are penalized with a high cost. Second, a smallpenalty is incurred for vehicles in the surveillance areawith inadequate sensor capability. An additional small costis associated with each unit of fuel used, which preventsthe system from simply launching every UAV on hand;this approach would certainly result in good surveillancecoverage but is undesirable from an efficiency standpoint.Finally, a high cost is associated with any vehicle crashes.The cost function can be expressed as

g(x,u) = Clocδn + Cdsnds(x) + Ccnc(x) + Cfnf (x)

where:

• δn = max{0, (r − ns(x))}• ns(x): number of UAVs in surveillance area in state x,• nds(x): number of UAVs in surveillance area with

degraded sensor in state x,• nc(x): number of crashed UAVs in state x,• nf (x): total number of fuel units burned in state x,

and Cloc, Cds, Cc, and Cf are respectively the relativecosts of loss of coverage events, vehicles in the surveillancelocation with a degraded sensor, crashes, and fuel usage.

5) Approximate Solutions to Dynamic Programming:Bellman Residual Elimination (BRE) is a approximate pol-icy iteration technique that is closely related to the classof Bellman residual methods [16]–[18]. Bellman residualmethods attempt to perform policy evaluation by minimizingan objective function of the form∑

i∈S

|Jµ(i)− TµJµ(i)|21/2

(6)

where Jµ is an approximation to the true cost function Jµand S ⊂ S is a set of representative sample states. BRE usesa flexible kernel-based cost approximation architecture to

construct Jµ such that the objective function given by Eq. (6)is identically zero. Details outlining the implementation ofthe BRE algorithm can be found in [5]. As implemented,the BRE solver runs continuously, being updated with agentstates in real-time and queried for a policy as needed.

III. AGENT HEALTH MONITORING& CAPABILITY ASSESSMENT

An onboard health monitoring system collects local dataand provides some measure of agent health. The contents ofthis measure can vary greatly, depending on the applicationdomain. In this research, the onboard health monitor collectsavailable state information and provides a metric indicatingthe health, or the deviation of actual from nominal efficiency,of each actuator, including onboard sensors. This provideskey insight into the capability of the mobile agent, and itsability to perform its advertised range of tasks.

A. Capability Assessment

The information provided by a health monitor can bevery detailed. However, for the sake of communication andplanner efficiency, it is desirable to abstract the detailedinformation down to a sufficient statistic or minimal coveringset of parameters and pass these up from the agent- to thesystem-level. For example, it is not difficult to imagine thata quadrotor vehicle with a failing left/right motor will notbe able to perform intense rolling or yawing maneuversnominally well. The question then becomes, what level ofdetail relating to agent health do we wish to pass along tothe planner? The abstraction process used in this researchrepresents an obvious first choice: a single scalar value,roughly encapsulating the vehicle capability with respect tonominal actuator performance.

In this formulation of the agent capability assessmentmodule, three possible capability levels are defined basedon observed quadrotor performance with a single degradedmotor. A quadrotor flying with motors/sensors operatingnominally is assigned a capability value of 0 and can executeany tracking or surveillance task or serve as a communicationrelay. A vehicle with a moderately degraded motor/sensor hasa capability value of 1 and is deemed unable of performingtracking or surveillance, but can be allocated as a commu-nication relay. Lastly, a capability value of 2 is given anyvehicle with severe motor/sensor failures such that it cannotperform any task and must return immediately to base forrepairs. The approximate control policy is then computedaccounting for this assessment of vehicle capability.

IV. FLIGHT RESULTS

A. Experimental Setup

Boeing Research and Technology has developed the Vehi-cle Swarm Technology Laboratory (VSTL), an environmentfor testing a variety of vehicles in an indoor, controlledenvironment [11]. VSTL is capable of simultaneously sup-porting a large number of both air and ground vehicles, thusproviding a significant advantage over traditional flight testmethods in terms of flight hours logged. As seen in Figure 3,

Cameras

Processing

Position reference system

Command and control

Ground computers

Vehicles

Fig. 3. Boeing Vehicle Swarm Technology Laboratory [11]

−8 −6 −4 −2 0 2 4 6−8

−6

−4

−2

0

2

4

6

8

10

Yb

Y0

Ys

X (m)

Y (

m)

Agent 1

Agent 2Agent 3

Agent 4

Agent 5 Agent 6

Threat

Threat

Threat

Fig. 4. Mission setup with six autonomous agents and three unknownthreats

the primary components of the VSTL are: 1) A camera-basedmotion capture system for reference positions, velocities,attitudes and attitude rates; 2) A cluster of off-board comput-ers for processing the reference data and calculating controlinputs; 3) Operator interface software for providing high-level commands to individual and/or teams of agents. Thesecomponents are networked within a systematic, modulararchitectureto support rapid development and prototyping ofmulti-agent algorithms [11].

B. Mission Scenario

The mission scenario implemented is a persistent surveil-lance mission with static search and dynamic track elements.We employ a heterogeneous team of six agents who begin atsome base location Yb and are tasked to persistently searchthe surveillance area. As threats are discovered via search,additional agents are called out to provide persistent trackingof the dynamic threat. Figure 4 shows the initial layout ofthe mission. Agents 1−6 are shown in the base location Yb,while threats are located in the surveillance region, Ys.

C. Results

In this section, we present the results of flight testsof the multi-agent persistent mission planner in both theBoeing vehicle swarm rapid prototyping testbed [11] and the

Fig. 5. Persistent search and track flight results with six agents and two dynamic threats. Agents rotate in a fuel-optimal manner, (given stochastic models)using a BRE generated policy to keep a sufficient number of agents in the surveillance area to (i) track any known threats and (ii) continually search thesurveillance area. Vertical dash-dot lines indicate threat detection events.

Fig. 6. Persistent search and track flight results with six agents and two dynamic threats. In addition to threat detection events, induced failures andassociated recoveries are also indicated with annotated vertical dash-dot lines.

MIT RAVEN facility [19]. Flight tests are divided into twoscenarios: With and without induced failures.

Figure 5 presents the results for the no-failure scenario,where a persistent surveillance mission is initiated withknown agent capabilities and two unknown, dynamic threats.Figure 5 shows the flight location of each of the agentsas well as the persistent coverage of the surveillance andcommunication regions. In this case, two of the threats weredetected early, around 20 & 50 time steps into the mission

by UAV1. These detection events cause UGV1 and UAV2

to be called out to the surveillance region. At around 60steps, UAV3 launches from base to relay communicationsand ultimately take the place of UAV1 before it runs outof fuel. Similarly, UAV4 is proactively launched at roughly110 steps to replace UAV2. In addition to aerial agents, theground vehicles also swap in a fuel-optimal manner giventhe stochastic fuel model detailed in Section II-A.3. It isimportant to note that “track” tasks are generated by the

Fig. 7. Comparison of MDP objective function throughout the flightexperiment reconstructed from logged data. The dashed line shows thisvalue averaged over several test flights, while the solid line is for a singleno-failure scenario.

discovery of threats whereas the search task is continuous.It can be seen in Figure 5 that the surveillance region ispersistently surveyed over the duration of the mission. Alsoof note in Figure 5, the annotated vertical black dash-dotlines indicate important events during the flight test, such asthreat discovery.

Figure 6 presents the results of a second scenario wherea similar mission is initiated with initially known agentcapabilities and unknown, dynamic threats. In this casehowever, sensor degradations are induced at the agent-levelsuch that the onboard health monitoring module detects(reactive) a decrease in performance and begins analyzingthe extent of the “failure”. A new measure for the capabilityof the agent is then estimated and sent to the planner so itcan be factored into future plans (proactive). In each instanceof sensor degradation, the planner receives this capabilitymeasure and initiates plans to swap the degraded vehiclewith one that is fully-functional. Again, the annotated verticalblack dash-dot lines indicate the occurrence of importantevents during the mission, such as sensor failure and recoveryevents.

Figure 7 compares the cost function g associated with theMDP planner, as formulated in Section II-A.4, throughout thetest flights. The test missions including sensor degradationswere averaged to account for inherent stochasticity in theinduced failure times. As expected, introducing such eventscauses an increase in cost, but only slightly over the nominal,no-failure case. This slight increase is due to additional fuelspent swapping the incapable vehicles out for capable onesas well as the additional cost incurred by the degraded sensorin the surveillance region.

V. CONCLUSIONS

This paper has presented an extended formulation of thepersistent surveillance problem, which incorporates com-munication constraints, a stochastic sensor failure model,stochastic fuel flow dynamics, agent capability constraintsand the basic constraints of providing surveillance coverageusing a team of unmanned vehicles. A parallel implemen-tation of the BRE approximate dynamic programming algo-rithm was used to approximate a policy. An analysis of this

policy via actual flight results indicates that it correctly co-ordinates the actions of the team of UAVs to simultaneouslyprovide surveillance coverage and communications over thecourse of the mission, even in the event of sensor failuresand reduced agent capabilities.

ACKNOWLEDGMENTS

Research supported by Boeing Research & Technology, Seat-tle; AFOSR FA9550-08-1-0086; and the Hertz Foundation.

REFERENCES

[1] R. Murray, “Recent research in cooperative control of multi-vehiclesystems,” ASME Journal of Dynamic Systems, Measurement, andControl, 2007.

[2] M. J. Hirsch, P. M. Pardalos, R. Murphey, and D. Grundel, Eds.,Advances in Cooperative Control and Optimization. Proceedings ofthe 7th International Conference on Cooperative Control and Op-timization, ser. Lecture Notes in Control and Information Sciences.Springer, Nov 2007, vol. 369.

[3] M. Valenti, B. Bethke, J. How, D. Pucci de Farias, J. Vian, “Embed-ding Health Management into Mission Tasking for UAV Teams,” inAmerican Controls Conference, New York, NY, June 2007.

[4] M. Valenti, D. Dale, J. How, and J. Vian, “Mission Health Manage-ment for 24/7 Persistent Surveillance Operations,” in AIAA Guidance,Navigation, and Control Conference and Exhibit, Myrtle Beach, SC,August 2007.

[5] B. Bethke, J. P. How, and J. Vian, “Multi-UAV Persistent Surveil-lance With Communication Constraints and Health Management,” inGuidance Navigation and Control Conference, August 2009.

[6] B. Bethke and J. P. How and J. Vian, “Group Health Managementof UAV Teams With Applications to Persistent Surveillance,” inAmerican Control Conference, June 2008, pp. 3145–3150.

[7] M. Puterman, Markov Decision Processes: Discrete Stochastic Dy-namic Programming. Wiley, 1994.

[8] G. Iyengar, “Robust Dynamic Programming,” Math. Oper. Res.,vol. 30, no. 2, pp. 257–280, 2005.

[9] A. Nilim and L. E. Ghaoui, “Robust Solutions to Markov DecisionProblems with Uncertain Transition Matrices,” OR, vol. 53, no. 5,2005.

[10] B. Bethke, L. F. Bertuccelli, and J. P. How, “Experimental Demon-stration of Adaptive MDP-Based Planning with Model Uncertainty.”in AIAA Guidance Navigation and Control, Honolulu, Hawaii, 2008.

[11] E. Saad, J. Vian, G.J. Clark and S. Bieniawski, “Vehicle SwarmRapid Prototyping Testbed,” in AIAA Infotech@Aerospace, Seattle,WA, 2009.

[12] S. R. Bieniawski, P. Pigg, J. Vian, B. Bethke, and J. How, “ExploringHealth-Enabled Mission Concepts in the Vehicle Swarm TechnologyLaboratory,” in AIAA Infotech@Aerospace, Seattle, WA, 2009.

[13] B. Bethke, J. How, A. Ozdaglar, “Approximate Dynamic ProgrammingUsing Support Vector Regression,” in Conference on Decision andControl, Cancun, Mexico, 2008.

[14] B. Bethke and J. How, “Approximate Dynamic Programming UsingBellman Residual Elimination and Gaussian Process Regression,” inProceedings of the American Control Conference, St. Louis, MO,2009.

[15] D. Bertsekas, Dynamic Programming and Optimal Control. Belmont,MA: Athena Scientific, 2007.

[16] P. Schweitzer and A. Seidman, “Generalized polynomial approxi-mation in Markovian decision processes,” Journal of mathematicalanalysis and applications, vol. 110, pp. 568–582, 1985.

[17] L. C. Baird, “Residual algorithms: Reinforcement learning with func-tion approximation.” in ICML, 1995, pp. 30–37.

[18] R. Munos and C. Szepesvari, “Finite-time bounds for fitted valueiteration,” Journal of Machine Learning Research, vol. 1, pp. 815–857, 2008.

[19] M. Valenti, B. Bethke, G. Fiore, J. How, and E. Feron, “Indoor Multi-Vehicle Flight Testbed for Fault Detection, Isolation, and Recovery,”in Proceedings of the AIAA Guidance, Navigation, and Control Con-ference and Exhibit, Keystone, CO, August 2006.

Documents

Agent Capability in Persistent Mission Planning using ...acl.mit.edu/papers/acc-2010-bethke.pdf · Fig. 2. System architecture with n agents If y i = Y b, then the vehicle is at base