Control and Decision Making in Uncertain Multi-agent Hierarchical SystemsA Case Study in Learning and Approximate Dynamic Programming
PI Meeting August 1st, 2002
Shankar Sastry
University of California, Berkeley
2
Outline
Hierarchical architecture for multiagent operations
Confronting uncertainty
Partial observation Markov games (POMgame)
Model predictive techniques for dynamic replanning
3
Partial-observation Probabilistic Pursuit-Evasion Game(PEG) with 4 UGVs and 1 UAV
Fully autonomous operation
4
Uncertainty pervades every layer!
Hierarchy in Berkeley Platform
actuatorpositions
inertialpositions
height over
terrain
• obstacles detected• targets detectedcontrol
signals
INS GPSultrasonic altimeter
vision
state of agents
obstacles detected
targetsdetected
obstaclesdetected
agentspositions
desiredagentsactions
Tactical Planner& Regulation
Vehicle-level sensor fusion
Strategy Planner Map Builder
• position of targets • position of obstacles • positions of agents
Communications Network
tacticalplanner
trajectoryplanner
regulation
•lin. accel.•ang. vel.
Targets
Exogenousdisturbance
UAV
dynamics
Terrain
actuatorencoders
UGV dynamics
6
Representing and Managing Uncertainty
Uncertainty is introduced in various channels– Sensing -> unable to determine the current state of world– Prediction -> unable to infer the future state of world– Actuation -> unable to make the desired action to properly
affect the state of world
Different types of uncertainty can be addressed by different approaches – Nondeterministic uncertainty : Robust Control– Probabilistic uncertainty :
(Partially Observable) Markov Decision Processes– Adversarial uncertainty : Game Theory
POMGAME
7
Markov Games
Framework for sequential multiagent interaction in an Markov environment
8
Policy for Markov Games
The policy of agent i at time t is a mapping from the current state to probability distribution over its action set.
Agent i wants to maximize – the expected infinite sum of a reward that the agent will gain
by executing the optimal policy starting from that state– where is the discount factor, and is the reward
received at time t
Performance measure:
Every discounted Markov game has at least one stationary optimal policy, but not necessarily a deterministic one.
Special case : Markov decision processes (MDP)– Can be solved by dynamic programming
9
Partial Observation Markov Games (POMGame)
10
Policy for POMGames
The agent i wants to receive at least
Poorly understood: analysis exists only for very specially structured games such as a game with a complete information on one side
Special case : partially observable Markov decision processes (POMDP)
23
Experimental Results: Pursuit Evasion Games with 4UGVs (Spring’ 01)
24
Experimental Results: Pursuit Evasion Games with 4UGVs and 1 UAV (Spring’ 01)
25
Pursuit-Evasion Game Experiment
PEG with four UGVs• Global-Max pursuit policy• Simulated camera view
(radius 7.5m with 50degree conic view)• Pursuer=0.3m/s Evader=0.5m/s MAX
26
Pursuit-Evasion Game Experiment
PEG with four UGVs• Global-Max pursuit policy• Simulated camera view
(radius 7.5m with 50degree conic view)• Pursuer=0.3m/s Evader=0.5m/s MAX
27
Experimental Results: Evaluation of Policies for different visibility
Global max policy performs better than greedy, since the greedy policy selects movements based only on local considerations.
Both policies perform better with the trapezoidal view, since the camera rotates fast enough to compensate the narrow field of view.
Capture time of greedy and glo-max for the different region of visibility
of pursuers
3 Pursuers with trapezoidal or omni-directional view
Randomly moving evader
28
Experimental Results: Evader’s Speed vs. Intelligence
• Having a more intelligent evader increases the capture time
• Harder to capture an intelligent evader at a higher speed
• The capture time of a fast random evader is shorter than that of a slower random evader, when the speed of evader is only slightly higher than that of pursuers.
Capture time for different speeds and levels of intelligence of the evader
3 Pursuers with trapezoidal view & global maximum policy
Max speed of pursuers: 0.3 m/s
29
Game-theoretic Policy Search Paradigm
Solving very small games with partial information, or games with full information, are sometimes computationally tractable
Many interesting games including pursuit-evasion are a large game with partial information, and finding optimal solutions is well outside the capability of current algorithms
Approximate solution is not necessarily bad. There might be simple policies with satisfactory performances
-> Choose a good policy from a restricted class of policies !
We can find approximately optimal solutions from restricted classes, using a sparse sampling and a provably convergent policy search algorithm
30
Constructing A Policy Class
Given a mission with specific goals, we – decompose the problem in terms of the functions that need to
be achieved for success and the means that are available– analyze how a human team would solve the problem– determine a list of important factors that complicate task
performance such as safety or physical constraints Maximize aerial coverage, Stay within a communications range, Penalize actions that lead an agent to a danger zone, Maximize the explored region, Minimize fuel usage, …
31
Policy Representation
Quantitize the above features and define a feature vector that consists of the estimate of above quantities for each action given agents’ history
Estimate the ‘goodness’ of each action by constructing
where is the weighting vector to be learned .
Choose an action that maximizes .
Or choose a randomized action according to the distribution
Degree of Exploration
32
Policy Search Paradigm
Searching for optimal policies is very difficult, even though there might be simple policies with satisfactory performances.
Choose a good policy from a restricted class of policies !
Policy Search Problem
33
PEGASUS (Ng & Jordan, 00)
Given a POMDP ,
Assuming a deterministic simulator, we can construct an equivalent POMDP with deterministic transitions .
For each policy 2 for we can construct an equivalent policy 0 2 0 for 0 such that they have the same value function, i.e. V () = V 0 (0) .
It suffices for us to find a good policy for the transformed POMDP 0 .
Value function can be approximated by a deterministic function , and ms samples are taken and reused to compute the value function for each candidate policy. --> Then we can use standard optimization techniques to search for approximately optimal policy.
34
Performance Guarantee & Scalability
Theorem
We are guaranteed to have a policy with the value close enough to the optimal value in the class
35
Acting under Partial Observations
Computing the value function is very difficult under partial observations.
Naïve approaches for dealing with partial observations:– State-free deterministic policy : mapping from observation to action
Ignores partial observability (i.e., treat observations as if they were the states of the environment)
Finding an optimal mapping is NP-hard. Even the best policy can have very poor performance or can cause a trap.
– State-free stochastic policy : mapping from observation to probability distribution over action Finding an optimal mapping is still NP-hard. Agents still cannot learn from the reward or penalty received in the past.
36
Example:Abstraction of Pursuit-Evasion Game
Consider a partial-observation stochastic pursuit-evasion game in a 2-D grid world, between (heterogeneous) teams of ne evaders and np pursuers .
At each time t, – Each evader and pursuer, located at and
respectively, – takes the observation over its visibility region– updates the belief state– chooses action from
Goal: capture of the evader, or survival
37
Example: Policy Feature
Maximize collective aerial coverage -> maximize the distance between agents
where is the location of pursuer that will be landed by taking action from
Try to visit an unexplored region with high possibility of detecting an evader
where is a position arrived by the action that maximizes the evader map value along the frontier
38
Prioritize actions that are more compatible with the dynamics of agents
Policy representation
Example: Policy Feature (Continued)
39
Benchmarking Experiments
Performance of two pursuit policies compared in terms of capture time
Experiment 1 : two pursuers against the evader who moves greedily with respect to the pursuers’ location
Experiment 2 : When we supposed the position of evader at each step is detected by the sensor network with only 10% accuracy, two optimized pursuers took 24.1 steps, while the one-step greedy pursuers took over 146 steps in average to capture the evader in 30 by 30 grid.
Grid size1-Greedy pursuers
Optimized pursuers
10 by 10 (7.3, 4.8) (5.1, 2.7)
20 by 20 (42.3, 19.2) (12.3, 4.3)
40
Modeling RUAV Dynamics
PositionSpatial velocitiesAnglesAngular rates
Ser
voin
pu
ts
throttle
longitudinal flappinglateral flapping
main rotor collective pitch tail rotor collective pitch
Body Velocities
Angular rates
Aerodynamic Analysis
Coordinate Transformation
Augmented Servodynamics
Tractable Nonlinear Model
41
Benchmarking Trajectory
PD controller
ExamplePD controller fails to achieve nose-in circle type trajectories.
Nonlinear, coupled dynamics are intrinsic characteristics in pirouette and nose-in circle trajectories.
42
Reinforcement Learning Policy Search Control Design
1. Aerodynamics/kinematics generates a model to identify.
2. Locally weighted Bayesian regression is used for nonlinear stochastic identification: we get the posterior distribution of parameters, and can easily simulate the posterior predictive distribution to check the fit and robustness.
3. A controller class is defined from the identification process and physical insights and we apply policy search algorithm .
4. We obtain approximately optimal controller parameters by reinforcement learning, I.e. training using the flight data and the reward function.
5. Considering the controller performance with a confidence interval of the identification process, we measure the safety and robustness of control system.
43
Performance of RL Controller
Manual vs. Autonomous Hover Assent & 360° x2 pirouette
44
pirouette
maneuver2maneuver1 maneuver3
Nose-inDuring circling
Heading kept the same
•Any variation of the following maneuvers in x-y direction •Any combination of the following maneuvers
Toughest Maneuvers for Rotorcraft UAVs
45
Demo of RL controller doing acrobatic maneuvers (Spring 02)
46
More Acrobatic Maneuvers (Spring 02)
47
From PEG to More Realistic Battlefield Scenarios
Adversarial attack – Reds just do not evade, but also attack -> Blues cannot blindly pursue
reds.
Unknown number/capability of adversary
-> Dynamic selection of the relevant red model from unstructured observation
Deconfliction between layers and teams
Increase number of feature
-> Diversify possible solutions when the uncertainty is high
48
Why General-sum Games?
"All too often in OR dealing with military problems, war is viewed as a zero-sum two-person game with perfect information. Here I must state as forcibly as I know that war is not a zero-sum two-person game with perfect information. Anybody who sincerely believes it is a fool. Anybody who reaches conclusions based on such an assumption and then tries to peddle these conclusions without revealing the quicksand they are constructed on is a charlatan....There is, in short, an urgent need to develop positive-sum game theory and to urge the acceptance of its precepts upon our leaders throughout the world."
Joseph H. Engel, Retiring Presidential Address to the Operations Research Society of America, October 1969
49
General-sum Games
Depending on the cooperation between the players,– Noncooperative– Cooperative
Depending on the least expected payoff that a player is willing to accept- Nash’s special/general bargaining solution
By restricting the blue and red policy class to be the finite size, we reduce the POMGame into the bimatrix game.
50
From POMGame To Bimatrix Game
Bimatrix game usually has multiple Nash equilibria, with different values.
51
Elucidating Adversarial Intention
The model posterior distribution can be used to predict the future observation, or select the model.
Then the blue team can employ the policy such that
Example Implemented : tracking unknown number of evaders with unknown dynamics with noisy sensors
52
Dynamic Bayesian Model Selection
• Dynamic Bayesian model selection (DBMS) is a generalized model selection approach to time series data of which the number of components can vary with time
• If K is the number of the components at any instance and T is the length of the time series, then there are O(2KT) possible models which demands an efficient algorithm
• The problem is formulated using Bayesian hierarchical modeling and solved using reversible jump MCMC methods suitably adapted.
53
DBMS
54
DBMS: Graphical Representation
– Dirichlet prior
A – Transition matrix for mt
t – Dirichlet prior
wt – component weights
zt – allocation variable
F – transition dynamics
55
DBMS
56
DBMS: Multi-target Tracking Example
57
Estimated target position+ True target trajectory Observation
58
Estimated target position+ True target trajectory Observation
59
Vision-based Landing of an Unmanned Aerial Vehicle
Berkeley Researchers: Rene Vidal, Omid Shakernia, Shankar Sastry
60
What we have accomplished
Real-time motion estimation algorithms– Algorithms: Linear & Nonlinear two-view, Multi-view
Fully autonomous vision-based control/landing
61
Image Processing
62
Vision Monitoring Station
63
Vision System Hardware Ampro embedded Little Board PC
– Pentium 233MHz running LINUX– Motion estimation, UAV high-level control– Pan/Tilt/Zoom camera tracks target
Motion estimation algorithms– Written C++ using LAPACK– Estimate relative position and orientation at 30 Hz– Sends control to navigation computer at 10 Hz
UAV Pan/Tilt Camera Onboard Computer
64
Flight Control System Experiments
Position+Heading Lock (Dec 1999)
Position+Heading Lock (May 2000)
Landing scenario with SAS (Dec 1999)
Attitude control with mu-syn (July 2000)
65
Semi-autonomous Landing (8/01)
66
Autonomous Landing (3/02)
67
Autonomous Landing (3/02)
68
Multi-body Motion Estimation and SegmentationVidal, Soatto, Sastry
69
Multi-body Motion Estimation
Motivation– Conflict Detection + Resolution + Formation Flight– Target Tracking
Given a set of image points and their flows obtain:– Number of independently moving objects– Segmentation: object to which each point belongs– Motion: rotation and translation of each object– Structure: depth of each point
Previous work– Orthographic projection camera (Costeira-Kanade’95)– Multiple points moving in straight line (Shashua-Levin’01)
This work considers full perspective projection, with multiple objects undergoing general motion
Motion not fooled by camouflage like other segmentation cues (texture, color, etc.)
70
Image Measurements
Form optical flow matrices
n= feature points, m= frames
Optical flow measurements live in a six dimensional space
71
Factorization
For one object one can factorize into motion and structure components
One can solve linearly for A and Z from
72
Multiple Moving Objects
For multiple independently moving objects
Obtain number of independent motions
73
Segmentation of the image points
Segmentation
2 4 6 8 10 12 14 16 18 20
2
4
6
8
10
12
14
16
18
20
2 4 6 8 10 12 14 16 18 20
2
4
6
8
10
12
14
16
18
20
74
Experimental Results
75
Experimental Results
77
A Roadmap for Cooperative Operation of Autonomous Vehicles
John Koo, Shannon Zelinski, Shankar Sastry
Department of EECS, UC Berkeley
78
Motivation
Multiple Autonomous Vehicle Applications– Unmanned aerial vehicles perform mission collectively – Satellites for distributed sensing– Autonomous underwater vehicles performing exploration– Autonomous cars forming platoons on roads
Enabling Technologies– Hierarchical control of multi-agents– Distributed Sensing and Actuation– Computation– Communication– Embedded Software
79
Formation Flight of Aerial Vehicles
Group Level– Formation Control– Conflict Resolution– Collision Avoidance
Vehicle Level– Vehicle Navigation– Envelope Protection
q1q2
q3
Design Challenges Different Levels of Centralization Multiple Modes of Operation Organization of Information Flow
80
Possible Formations for a UAV mission
Line Formation
Diamond Formation
Loose Formation
81
Components of Formation Flight
Formation Generation– Generate a set of feasible formations where each formation satisfies multiple
constraints including vehicle dynamics, communication, and sensing capabilities.
Formation Initialization– Given an initial and a final formation for a group of autonomous vehicles,
formation initialization problem is to generate collision-free and feasible trajectories and to derive control laws for the vehicles to track the given trajectories simultaneously in finite time.
Formation Control– Formation control of multiple autonomous vehicles focus on the control of
individual agents to keep them in a formation, while satisfying their dynamic equations and inter-agent formation constraints, for an underlying communication protocol being deployed.
82
Components of Formation Flight
Formation Generation– Generate a set of feasible formations and each formation satisfies multiple
constraints including vehicle dynamics, communication, and sensing capabilities.
Leader Trajectory Formation Constraints+ Dynamic Constraints
83
Components of Formation Flight
Formation Initialization– Given an initial and a final formation for a group of autonomous
vehicles, formation initialization problem is to generate collision-free and feasible trajectories and to derive control laws for the vehicles to track the given trajectories simultaneously in finite time.
Line FormationDiamond Formation
84
Components of Formation Flight
Formation Control– Formation control of multiple autonomous vehicles focus on the control
of individual agents to keep them in a formation, while satisfying their dynamic equations and inter-agent formation constraints, for an underlying communication protocol being deployed.
85
Formation Initialization
Virtual vehicles
Actual vehicles
86
Elements Of Formation Flight
Information Resources
– Wireless network
– Global Positioning System
– Inertial Navigation System
– Radar System (Local and Active)
– Vision System (Local and Passive)
87
Loose Formation Flight
GPS provides global positioning information to vehicles
Wireless network is used to distribute information between vehicles
Navigation computer on each vehicle calculates relative orientation, distance and velocities
GPS signals
Wireless Network
88
Tight Formation Flight
Vision system equipped with omni-directional camera can track neighboring vehicles
Structure from motion algorithms running on vision system provides estimates of relative orientation, distance and velocities to navigation computer
89
Hybrid Control Design for Formation Flight
– Construct a Formation Mode Graph by considering dynamic and formation constraints.– For each formation, information about the formation is computed offline and is stored in each node of the
graph. Feasible transition between formations are specified by edges.– Given an initial formation, any feasible formations can be efficiently searched on the graph.
90
Back Up Slides
91
Deconfliction between Layers
Each UAV is given a waypoint by high-level planner
Shortest trajectories to the waypoints may lead collision
How to dynamically replan the trajectory for the UAVs subject to input saturation and state constraints
92
(Nonlinear) Model Predictive Control
Find that minimizes
Common choice
93
Planning of Feasible Trajectories
State saturation
Collision avoidance
Magnitude of each cost element represents the priority of tasks/functionality, or the authority of layers
94
Hierarchy in Berkeley Platform
actuatorpositions
inertialpositions
height over
terrain
• obstacles detected• targets detectedcontrol
signals
INS GPSultrasonic altimeter
vision
state of agents
obstacles detected
targetsdetected
obstaclesdetected
agentspositions
desiredagentsactions
Tactical Planner& Regulation
Vehicle-level sensor fusion
Strategy Planner Map Builder
• position of targets • position of obstacles • positions of agents
Communications Network
tacticalplanner
trajectoryplanner
regulation
•lin. accel.•ang. vel.
Targets
Exogenousdisturbance
UAV
dynamics
Terrain
actuatorencoders
UGV dynamics
95
H1
H2
H0
Cooperative Path Planning & Control
Trajectories followed by 3 UAVs
Coordination based on priority
Example: Three UAVs are given straight line trajectories that will lead to collision. |Lin. Vel.|
< 16.7ft/s
|Ang| < pi/6 rad
|Control Inputs| < 1
Constraints supported
NMPPC dynamically replans and tracks the safe trajectory of H1 and H2 under input/state
constraints.
96
Unifying Trajectory Generation and Tracking Control
Nonlinear Model Predictive Planning & Control combines trajectory planning and control into a single problem, using ideas from
– Potential-field based navigation (real-time path planning)– Nonlinear model predictive control (optimal control of nonlinear multi-
input, multi-output systems with input/state constraints)
We incorporate a tracking performance, potential function, state constraints into the cost function to minimize, and use gradient-descent for on-line optimization.
Removes feasibility issues by considering the UAV dynamics from the trajectory planning
Robust to parameter uncertainties
Optimization can be done real-time
97
Modeling and Control of UAVs
A single, computationally tractable model cannot capture nonlinear UAV dynamics throughout the large flight envelope.
Real control systems are partially observed (noise, hidden variables).
It is impossible to have data for all parts of the high-dimensional state-space.
-> Model and Control algorithm must be robust to unmodeled dynamics and noise and handle MIMO nonlinearity.
Observation: Linear analysis and deterministic robust control techniques fail to do so.