Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST MHERST •• Department of Computer Science Department of Computer Science
Optimal Fixed-SizeControllers for
Decentralized POMDPsChristopher AmatoDaniel S. BernsteinShlomo Zilberstein
University of Massachusetts AmherstMay 9, 2006
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST MHERST •• Department of Computer Science Department of Computer Science 2
Overview DEC-POMDPs and their solutions Fixing memory with controllers Previous approaches Representing the optimal controller Some experimental results
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST MHERST •• Department of Computer Science Department of Computer Science 3
DEC-POMDPs
Decentralized partially observable Markov decisionprocess (DEC-POMDP)
Multiagent sequential decision making underuncertainty At each stage, each agent receives:
A local observation rather than the actual state A joint immediate reward
Environment
a1
o1a2
o2
r
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST MHERST •• Department of Computer Science Department of Computer Science 4
DEC-POMDP definition A two agent DEC-POMDP can be defined with
the tuple: M = 〈S, A1, A2, P, R, Ω1, Ω2, O〉 S, a finite set of states with designated initial state
distribution b0 A1 and A2, each agent’s finite set of actions P, the state transition model: P(s’ | s, a1, a2) R, the reward model: R(s, a1, a2) Ω1 and Ω2, each agent’s finite set of observations O, the observation model: O(o1, o2 | s', a1, a2)
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST MHERST •• Department of Computer Science Department of Computer Science 5
DEC-POMDP solutions A policy for each agent is a mapping from their
observation sequences to actions, Ω* → A ,allowing distributed execution
A joint policy is a policy for each agent Goal is to maximize expected discounted
reward over an infinite horizon Use a discount factor, γ, to calculate this
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST MHERST •• Department of Computer Science Department of Computer Science 6
Example: Grid World
States: grid cell pairs
Actions: move , , , ,stay
Transitions: noisy
Observations: red lines
Goal: share same square
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST MHERST •• Department of Computer Science Department of Computer Science 7
Previous work Optimal algorithms
Very large space and time requirements Can only solve small problems
Approximation algorithms provide weak optimality guarantees, if any
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST MHERST •• Department of Computer Science Department of Computer Science 8
Policies as controllers Finite state controller for each agent i
Fixed memory Randomness used to offset memory limitations Action selection, ψ : Qi → ΔAi Transitions, η : Qi × Ai × Oi → ΔQi
Value for a pair is given by the Bellmanequation:
!
V (q1,q2,s) = P(a1 |q1)P(a2 |q2)a1 ,a2
" R(s,a1,a2) +[
!
" P(s' | s,a1,a2) O(o1,o2 | s',a1,a2) P(q1' |q1,a1,o1)P(q2 ' |q2,a2,o2)q1 ',q2 '
#o1 ,o2
# V (q1',q2 ',s')s'
#$
% & &
Where the subscript denotes the agent and lowercasevalues are elements of the uppercase sets above
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST MHERST •• Department of Computer Science Department of Computer Science 9
a1
a1
a2
o1
Controller example Stochastic controller for a single agent
2 nodes, 2 actions, 2 obs Parameters
P(a|q) P(q’|q,a,o)
1 2
o20.50.5
0.750.25
o1
1.0
o2
1.0
1.0
o2
1.0
1.0
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST MHERST •• Department of Computer Science Department of Computer Science 10
Optimal controllers How do we set the parameters of the
controllers?
Deterministic controllers - traditionalmethods such as best-first search (Szerand Charpillet 05)
Stochastic controllers - continuousoptimization
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST MHERST •• Department of Computer Science Department of Computer Science 11
Decentralized BPI Decentralized Bounded Policy Iteration (DEC-
BPI) - (Bernstein, Hansen and Zilberstein 05)
Alternates between improvement andevaluation until convergence
Improvement: For each node of each agent’scontroller, find a probability distribution overone-step lookahead values that is greater thanthe current node’s value for all states andcontrollers for the other agents
Evaluation: Finds values of all nodes in allstates
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST MHERST •• Department of Computer Science Department of Computer Science 12
DEC-BPI - Linear programNEED TO FIX THIS SLIDE IF I WANT TO USE
IT!For a given node, qVariables: ε, P(ai, qi’|qi, oi)Objective: Maximize εImprovement Constraints: ∀s ∈ S, q–i ∈ Q–i
Probability constraints: ∀a ∈ A
Also, all probabilities must sum to 1 and begreater than 0
!
x(q',a,o) = x(a)q'
"
!
V (s,r q ) + " # P(
r a |
r q ) R(s,
r a ) + $ P(
r q ' |
r q ,
r a ,
r o )P(s',| s,
r a )P(
r o | s',
r a )V (s',
r q ')
s',r o ,
r q '
%&
' ( (
)
* + + r a
%
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST MHERST •• Department of Computer Science Department of Computer Science 13
Problems with DEC-BPI
Difficult to improve value for all states andother agents’ controllers
May require more nodes for a given start state Linear program (one step lookahead) results in
local optimality
Correlation device can somewhat improveperformance
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST MHERST •• Department of Computer Science Department of Computer Science 14
Optimal controllers Use nonlinear programming (NLP) Consider node value as a variable Improvement and evaluation all in one
step Add constraints to maintain valid values
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST MHERST •• Department of Computer Science Department of Computer Science 15
NLP intuition Value variable allows improvement and
evaluation at the same time (infinitelookahead)
While iterative process of DEC-BPI can“get stuck” the NLP does define theglobally optimal solution
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST MHERST •• Department of Computer Science Department of Computer Science 16
Variables: , ,Objective: Maximize
Value Constraints: ∀s ∈ S, ∈ Q
Linear constraints are needed to ensure controllersare independent
Also, all probabilities must sum to 1 and be greaterthan 0
NLP representation
!
z(r q ,s) = x(
r q ',
r a ) R(s,
r a ) + " P(s' | s,
r a ) O(
r o | s',
r a ) y(
r q ,
r a ,
r o ,
r q ')
r q '
#r o
# z(r q ',s')
s'
#$
% & &
'
( ) ) r a
#
!
b0(s)
s
" z(r q 0,s)
!
x(r q ,
r a ) = P(
r a |
r q )
!
y(r q ,
r a ,
r o ,
r q ') = P(
r q ' |
r q ,
r a ,
r o )
!
z(r q ,s) = V (
r q ,s)
!
r q
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST MHERST •• Department of Computer Science Department of Computer Science 17
Optimality
Theorem: An optimal solution of the NLPresults in optimal stochastic controllersfor the given size and initial statedistribution.
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST MHERST •• Department of Computer Science Department of Computer Science 18
Pros and cons of the NLP Pros
Retains fixed memory and efficient policyrepresentation
Represents optimal policy for given size Takes advantage of known start state
Cons Difficult to solve optimally
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST MHERST •• Department of Computer Science Department of Computer Science 19
Experiments Nonlinear programming algorithms (snopt and
filter) - sequential quadratic programming(SQP)
Guarantees locally optimal solution NEOS server 10 random initial controllers for a range of
sizes Compared the NLP with DEC-BPI
With and without a small correlation device
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST MHERST •• Department of Computer Science Department of Computer Science 20
Two agents share a broadcast channel (4states, 5 obs , 2 acts)
Very simple near-optimal policy
mean quality of the NLP and DEC-BPIimplementations
Results: Broadcast Channel
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST MHERST •• Department of Computer Science Department of Computer Science 21
Results: Recycling Robots
mean quality of the NLP and DEC-BPIimplementations on the recycling robot domain (4states, 2 obs, 3 acts)
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST MHERST •• Department of Computer Science Department of Computer Science 22
Results: Grid World
mean quality of the NLP and DEC-BPIimplementations on the meeting in a grid (16states, 2 obs, 5 acts)
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST MHERST •• Department of Computer Science Department of Computer Science 23
Results: Running time Running time mostly comparable to DEC-BPI corr The increase as controller size grows offset by
better performance
Broadcast
Recycle
Grid
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST MHERST •• Department of Computer Science Department of Computer Science 24
Conclusion
Defined the optimal fixed-size stochasticcontroller using NLP
Showed consistent improvement overDEC-BPI with locally optimal solvers
In general, the NLP may allow smalloptimal controllers to be found
Also, may provide concise near-optimalapproximations of large controllers
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST MHERST •• Department of Computer Science Department of Computer Science 25
Future Work Explore more efficient NLP formulations Investigate more specialized solution
techniques for NLP formulation Greater experimentation and
comparison with other methods