25
UNIVERSITY OF NIVERSITY OF MASSACHUSETTS ASSACHUSETTS, A , AMHERST MHERST • Department of Computer Science Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel S. Bernstein Shlomo Zilberstein University of Massachusetts Amherst May 9, 2006

Optimal Fixed-Size Controllers for Decentralized POMDPspeople.csail.mit.edu › camato › publications › AAMAS-06-Workshop-Talk.pdfOptimal Fixed-Size Controllers for Decentralized

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

  • UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST MHERST •• Department of Computer Science Department of Computer Science

    Optimal Fixed-SizeControllers for

    Decentralized POMDPsChristopher AmatoDaniel S. BernsteinShlomo Zilberstein

    University of Massachusetts AmherstMay 9, 2006

  • UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST MHERST •• Department of Computer Science Department of Computer Science 2

    Overview DEC-POMDPs and their solutions Fixing memory with controllers Previous approaches Representing the optimal controller Some experimental results

  • UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST MHERST •• Department of Computer Science Department of Computer Science 3

    DEC-POMDPs

    Decentralized partially observable Markov decisionprocess (DEC-POMDP)

    Multiagent sequential decision making underuncertainty At each stage, each agent receives:

    A local observation rather than the actual state A joint immediate reward

    Environment

    a1

    o1a2

    o2

    r

  • UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST MHERST •• Department of Computer Science Department of Computer Science 4

    DEC-POMDP definition A two agent DEC-POMDP can be defined with

    the tuple: M = 〈S, A1, A2, P, R, Ω1, Ω2, O〉 S, a finite set of states with designated initial state

    distribution b0 A1 and A2, each agent’s finite set of actions P, the state transition model: P(s’ | s, a1, a2) R, the reward model: R(s, a1, a2) Ω1 and Ω2, each agent’s finite set of observations O, the observation model: O(o1, o2 | s', a1, a2)

  • UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST MHERST •• Department of Computer Science Department of Computer Science 5

    DEC-POMDP solutions A policy for each agent is a mapping from their

    observation sequences to actions, Ω* → A ,allowing distributed execution

    A joint policy is a policy for each agent Goal is to maximize expected discounted

    reward over an infinite horizon Use a discount factor, γ, to calculate this

  • UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST MHERST •• Department of Computer Science Department of Computer Science 6

    Example: Grid World

    States: grid cell pairs

    Actions: move , , , ,stay

    Transitions: noisy

    Observations: red lines

    Goal: share same square

  • UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST MHERST •• Department of Computer Science Department of Computer Science 7

    Previous work Optimal algorithms

    Very large space and time requirements Can only solve small problems

    Approximation algorithms provide weak optimality guarantees, if any

  • UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST MHERST •• Department of Computer Science Department of Computer Science 8

    Policies as controllers Finite state controller for each agent i

    Fixed memory Randomness used to offset memory limitations Action selection, ψ : Qi → ΔAi Transitions, η : Qi × Ai × Oi → ΔQi

    Value for a pair is given by the Bellmanequation:

    !

    V (q1,q2,s) = P(a1 |q1)P(a2 |q2)a1 ,a2

    " R(s,a1,a2) +[

    !

    " P(s' | s,a1,a2) O(o1,o2 | s',a1,a2) P(q1' |q1,a1,o1)P(q2 ' |q2,a2,o2)q1 ',q2 '

    #o1 ,o2

    # V (q1',q2 ',s')s'

    #$

    % & &

    Where the subscript denotes the agent and lowercasevalues are elements of the uppercase sets above

  • UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST MHERST •• Department of Computer Science Department of Computer Science 9

    a1

    a1

    a2

    o1

    Controller example Stochastic controller for a single agent

    2 nodes, 2 actions, 2 obs Parameters

    P(a|q) P(q’|q,a,o)

    1 2

    o20.50.5

    0.750.25

    o1

    1.0

    o2

    1.0

    1.0

    o2

    1.0

    1.0

  • UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST MHERST •• Department of Computer Science Department of Computer Science 10

    Optimal controllers How do we set the parameters of the

    controllers?

    Deterministic controllers - traditionalmethods such as best-first search (Szerand Charpillet 05)

    Stochastic controllers - continuousoptimization

  • UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST MHERST •• Department of Computer Science Department of Computer Science 11

    Decentralized BPI Decentralized Bounded Policy Iteration (DEC-

    BPI) - (Bernstein, Hansen and Zilberstein 05)

    Alternates between improvement andevaluation until convergence

    Improvement: For each node of each agent’scontroller, find a probability distribution overone-step lookahead values that is greater thanthe current node’s value for all states andcontrollers for the other agents

    Evaluation: Finds values of all nodes in allstates

  • UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST MHERST •• Department of Computer Science Department of Computer Science 12

    DEC-BPI - Linear programNEED TO FIX THIS SLIDE IF I WANT TO USE

    IT!For a given node, qVariables: ε, P(ai, qi’|qi, oi)Objective: Maximize εImprovement Constraints: ∀s ∈ S, q–i ∈ Q–i

    Probability constraints: ∀a ∈ A

    Also, all probabilities must sum to 1 and begreater than 0

    !

    x(q',a,o) = x(a)q'

    "

    !

    V (s,r q ) + " # P(

    r a |

    r q ) R(s,

    r a ) + $ P(

    r q ' |

    r q ,

    r a ,

    r o )P(s',| s,

    r a )P(

    r o | s',

    r a )V (s',

    r q ')

    s',r o ,

    r q '

    %&

    ' ( (

    )

    * + + r a

    %

  • UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST MHERST •• Department of Computer Science Department of Computer Science 13

    Problems with DEC-BPI

    Difficult to improve value for all states andother agents’ controllers

    May require more nodes for a given start state Linear program (one step lookahead) results in

    local optimality

    Correlation device can somewhat improveperformance

  • UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST MHERST •• Department of Computer Science Department of Computer Science 14

    Optimal controllers Use nonlinear programming (NLP) Consider node value as a variable Improvement and evaluation all in one

    step Add constraints to maintain valid values

  • UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST MHERST •• Department of Computer Science Department of Computer Science 15

    NLP intuition Value variable allows improvement and

    evaluation at the same time (infinitelookahead)

    While iterative process of DEC-BPI can“get stuck” the NLP does define theglobally optimal solution

  • UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST MHERST •• Department of Computer Science Department of Computer Science 16

    Variables: , ,Objective: Maximize

    Value Constraints: ∀s ∈ S, ∈ Q

    Linear constraints are needed to ensure controllersare independent

    Also, all probabilities must sum to 1 and be greaterthan 0

    NLP representation

    !

    z(r q ,s) = x(

    r q ',

    r a ) R(s,

    r a ) + " P(s' | s,

    r a ) O(

    r o | s',

    r a ) y(

    r q ,

    r a ,

    r o ,

    r q ')

    r q '

    #r o

    # z(r q ',s')

    s'

    #$

    % & &

    '

    ( ) ) r a

    #

    !

    b0(s)

    s

    " z(r q 0,s)

    !

    x(r q ,

    r a ) = P(

    r a |

    r q )

    !

    y(r q ,

    r a ,

    r o ,

    r q ') = P(

    r q ' |

    r q ,

    r a ,

    r o )

    !

    z(r q ,s) = V (

    r q ,s)

    !

    r q

  • UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST MHERST •• Department of Computer Science Department of Computer Science 17

    Optimality

    Theorem: An optimal solution of the NLPresults in optimal stochastic controllersfor the given size and initial statedistribution.

  • UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST MHERST •• Department of Computer Science Department of Computer Science 18

    Pros and cons of the NLP Pros

    Retains fixed memory and efficient policyrepresentation

    Represents optimal policy for given size Takes advantage of known start state

    Cons Difficult to solve optimally

  • UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST MHERST •• Department of Computer Science Department of Computer Science 19

    Experiments Nonlinear programming algorithms (snopt and

    filter) - sequential quadratic programming(SQP)

    Guarantees locally optimal solution NEOS server 10 random initial controllers for a range of

    sizes Compared the NLP with DEC-BPI

    With and without a small correlation device

  • UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST MHERST •• Department of Computer Science Department of Computer Science 20

    Two agents share a broadcast channel (4states, 5 obs , 2 acts)

    Very simple near-optimal policy

    mean quality of the NLP and DEC-BPIimplementations

    Results: Broadcast Channel

  • UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST MHERST •• Department of Computer Science Department of Computer Science 21

    Results: Recycling Robots

    mean quality of the NLP and DEC-BPIimplementations on the recycling robot domain (4states, 2 obs, 3 acts)

  • UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST MHERST •• Department of Computer Science Department of Computer Science 22

    Results: Grid World

    mean quality of the NLP and DEC-BPIimplementations on the meeting in a grid (16states, 2 obs, 5 acts)

  • UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST MHERST •• Department of Computer Science Department of Computer Science 23

    Results: Running time Running time mostly comparable to DEC-BPI corr The increase as controller size grows offset by

    better performance

    Broadcast

    Recycle

    Grid

  • UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST MHERST •• Department of Computer Science Department of Computer Science 24

    Conclusion

    Defined the optimal fixed-size stochasticcontroller using NLP

    Showed consistent improvement overDEC-BPI with locally optimal solvers

    In general, the NLP may allow smalloptimal controllers to be found

    Also, may provide concise near-optimalapproximations of large controllers

  • UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST MHERST •• Department of Computer Science Department of Computer Science 25

    Future Work Explore more efficient NLP formulations Investigate more specialized solution

    techniques for NLP formulation Greater experimentation and

    comparison with other methods