Upload
rhoda-dalton
View
217
Download
1
Embed Size (px)
Citation preview
1
Symbolic Dynamic Programming
Alan Fern *
* Based in part on slides by Craig Boutilier
2
Planning in Large State Space MDPs
h You have learned algorithms for computing optimal policies5 Value Iteration5 Policy Iteration
h These algorithms explicitly enumerate the state space5 Often this is impractical
h Simulation-based planning and RL allowed for approximate planning in large MDPs5 Did not utilize an explicit model of the MDP. Only used a strong or
weak simulator.
h How can we get exact solutions to enormous MDPs?
3
Structured Representations
h Policy iteration and value iteration treat states as atomic entities with no internal structure.
h In most cases, states actually do have internal structure5 E.g. described by a set of state variables, or objects with properties
and relationships5 Humans exploit this structure to plan effectively
h What if we had a compact, structured representation for a large MDP and could efficiently plan with it?5 Would allow for exact solutions to very large MDPs
4
A Planning Problem
5
Logical or Feature-based Problems
h For most AI problems, states are not viewed as atomic entities.5 They contain structure. For example, they are
described by a set of boolean propositions/variables
5 |S| exponential in number of propositions
h Basic policy and value iteration do nothing to exploit the structure of the MDP when it is available
nXXXS 21
6
Solution?h Require structured representations in terms
of propositions5 compactly represent transition function5 compactly represent reward function5 compactly represent value functions and policies
h Require structured computation5 perform steps of PI or VI directly on structured
representations5 can avoid the need to enumerate state space
h We start by representing the transition structure as dynamic Bayesian networks
7
Propositional Representations
h States decomposable into state variables (we will assume boolean variables)
h Structured representations the norm in AI5 Decision diagrams, Bayesian networks, etc.5 Describe how actions affect/depend on features5 Natural, concise, can be exploited computationally
h Same ideas can be used for MDPs
nXXXS 21
8
Robot Domain as Propositional MDPh Propositional variables for single user version
5 Loc (robot’s locat’n): Office, Entrance5 T (lab is tidy): boolean5 CR (coffee request outstanding): boolean5 RHC (robot holding coffee): boolean5 RHM (robot holding mail): boolean5 M (mail waiting for pickup): boolean
h Actions/Events5 move to an adjacent location, pickup mail, get coffee, deliver
mail, deliver coffee, tidy lab5 mail arrival, coffee request issued, lab gets messy
h Rewards5 rewarded for tidy lab, satisfying a coffee request, delivering mail5 (or penalized for their negation)
9
State Spaceh State of MDP: assignment to these six
variables5 64 states5 grows exponentially with number of variables
h Transition matrices 5 4032 parameters required per matrix5 one matrix per action (6 or 7 or more actions)
h Reward function5 64 reward values needed
h Factored state and action descriptions will break this exponential dependence (generally)
10
Dynamic Bayesian Networks (DBNs)h Bayesian networks (BNs) a common
representation for probability distributions5 A graph (DAG) represents conditional
independence5 Conditional probability tables (CPTs) quantify local
probability distributions
h Dynamic Bayes net action representation5 one Bayes net for each action a, representing the
set of conditional distributions Pr(St+1|At,St)5 each state variable occurs at time t and t+15 dependence of t+1 variables on t variables
depicted by directed arcs
11
DBN Representation: deliver coffee
Tt
Lt
CRt
RHCt
Tt+1
Lt+1
CRt+1
RHCt+1
Pr(CRt+1 | Lt,CRt,RHCt)
Pr(Tt+1| Tt)
L CR RHC CR(t+1) CR(t+1)
O T T 0.2 0.8
E T T 1.0 0.0
O F T 0.1 0.9
E F T 0.1 0.9
O T F 1.0 0.0
E T F 1.0 0.0
O F F 0.1 0.9
E F F 0.1 0.9
T T(t+1) T(t+1)
T 0.91 0.09
F 0.0 1.0
RHMt RHMt+1
Mt Mt+1
Pr(RHMt+1|RHMt)RHM R(t+1) R(t+1)
T 1.0 0.0
F 0.0 1.0
is the product of each of the 6 tables.
12
Benefits of DBN RepresentationPr(St+1 | St) = Pr(RHMt+1,Mt+1,Tt+1,Lt+1,Ct+1,RHCt+1 | RHMt,Mt,Tt,Lt,Ct,RHCt)
= Pr(RHMt+1 |RHMt) * Pr(Mt+1 | Mt) * Pr(Tt+1 | Tt)
* Pr(Lt+1 | Lt) * Pr(CRt+1 | CRt,RHCt,Lt) * Pr(RHCt+1 | RHCt,Lt)
- Only 20 parameters vs. 4032 for matrix
- Removes global exponential dependence
s1 s2 ... s64
s1 0.9 0.05 ... 0.0s2 0.0 0.20 ... 0.1
S64 0.1 0.0 ... 0.0
...
Tt
Lt
CRt
RHCt
Tt+1
Lt+1
CRt+1
RHCt+1
RHMt RHMt+1
Mt Mt+1
Full Matrix
13
Structure in CPTs
h So far we have represented each CPT as a table of size exponential in the number of parents
h Notice that there’s regularity in CPTs5 e.g., Pr(CRt+1 | Lt,CRt,RHCt) has many similar entries
h Compact function representations for CPTs can be used to great effect5 decision trees5 algebraic decision diagrams (ADDs/BDDs)
h Here we show examples of decision trees (DTs)
14
Action Representation – DBN/DT
CR(t)
0.1 1.0
RHC(t)
L(t)
0.2
Decision Tree (DT)Tt
Lt
CRt
RHCt
Tt+1
Lt+1
CRt+1
RHCt+1
RHMt RHMt+1
Mt Mt+1
f
t
t
oe
Leaves of DT givePr(CRt+1=true | Lt,CRt,RHCt)
DTs can often represent conditional probabilities much morecompactly than a full conditional probability table
e.g. If CR(t) = true & RHC(t) = false then CR(t+1)=TRUE with prob. 1
1.0
f
15
Reward Representation
h Rewards represented with DTs in a similar fashion 5 Would require vector of size 2n for explicit representation
CR
M
T
-1 1
-100
f
tf
t
-10f t
Small reward for satisfying all of these conditions
High cost for unsatisfied coffee request
High, but lower, cost for undelivered mail
Cost for lab being untidy
16
Structured Computation
h Given our compact decision tree (DBN)
representation, can we solve MDP without
explicit state space enumeration?
h Can we avoid O(|S|)-computations by exploiting
regularities made explicit by representation?
h We will study a general approach for doing this
called structured dynamic programming
17
Structured Dynamic Programming
h We now consider how to perform dynamic programming techniques such as VI and PI using the problem structure
h VI and PI are based on a few basic operations.5 Here we will show how to perform these operations directly on tree
representations of value functions, policies, and transitions functions
h The approach is very general and can be applied to other representations (e.g. algebraic decision diagrams, situation calculus) and other problems after the main idea is understood
h We will focus on VI here, but the paper also describes a version of modified policy iteration
18
Recall Tree-Based Representations
X
Y
Z
X
Y
Z
X
Y0.9
0.0
X
1.0 0.0
1.0
Z
Y1.0
0.00.9
Z
10 0
DBN for Action AReward Function R
Note: we are leaving off time subscripts for readability and using X(t), Y(t), …, instead.
e.g. If X(t)=false & Y(t) = true then Y(t+1)=true w/ prob 1
e.g. If X(t)=true THEN Y(t+1)=true w/ prob 0.9t f
t f
t f
t f
t f
t f
Recall that each action of the MDP has its own DBN.
19
Structured Dynamic Programming
h Value functions and policies can also have tree representations5 Often much more compact representations than tables
h Our Goal: compute the tree representations of policy and value function given the tree representations of the transitions and rewards
20
Recall Value Iteration
Suppose that initial is compactly represented as a tree.
1. Show how to compute compact trees for
2. Use a max operation on the Q-trees (returns a single tree)
;; could initialize to 0
Value Iteration:
Bellman Backup
Symbolic Value Iteration
Tree
Symbolic MAX Tree
V(X)
Pr 𝐴=𝑎 (𝑆 ′|𝑆¿ . . . . .Pr 𝐴=𝑏 (𝑆 ′|𝑆¿ Pr 𝐴=𝑧 (𝑆 ′|𝑆¿
. . . . . T Tree
?
? ? ?
22
The MAX Trees OperationX
Y0.9
0.0
X
1.0 0.0
1.0
Tree partitions the state space, assigning values to each region
1.0 0.0 0.9
0.0
1.0
1.0
0.0
1.0
The state space max for the above trees is:
In general, how can we compute the tree representing the max?
23
The MAX Tree Operation
X
Y
X
1.0, 0.0 0.0, 0.0
1.0 0.0 0.9
0.0
1.0
Can simply append one tree to leaves of other. Makes all the distinctions that either tree makes. Max operation is taken at leaves of result.
X
1.0,1.0 0.0, 1.0
X
1.0, 0.9 0.0,0.9
X
Y0.9
0.0
X
1.0 0.0
1.0
MAXX
Y
X
1.0 0.0
X
1.0 1.0
X
1.0 0.9
24
The MAX Tree Operation
1.0 0.0 0.9
0.0
1.0
The resulting tree may have unreachable leaves. We can simplify the tree by removing such paths.
X
Y0.9
0.0
X
1.0 0.0
1.0
SimplifyX
Y
X
1.0 0.0
X
1.0 1.0
X
1.0 0.9
X
Y
0.01.0
1.0
unreachable
25
BINARY OPERATIONS(other binary operations similar to max)
26
MARGINALIZATION
∑A
Compute diagram representing
There are libraries for doing this.
Symbolic Bellman Backup
for each action a
TreeTreeTree
Tree
𝑆=( 𝑋1 ,…, 𝑋 𝑙 ) ,𝑆 ′=(𝑋 ′ 1 ,…,𝑋 ′𝑙)
Symbol
TreeTree
Symbol
Tree
Symbol
Tree
Symbol
TreeTree
Symbol
Tree
Symbolic Bellman Backup
for each action a
TreeTreeTree
Tree
34
SDP: Relative Meritsh Adaptive, nonuniform, exact abstraction method
5 provides exact solution to MDP5 much more efficient on certain problems (time/space)5 400 million state problems in a couple hrs
h Can formulate a similar procedure for modified policy iteration
h Some drawbacks5 produces piecewise constant VF5 some problems admit no compact solution representation
g so the sizes of trees blows up with enough iterations5 approximation may be desirable or necessary
35
Approximate SDPh Easy to approximate solution using SDP
h Simple pruning of value function
5 Simply “merge” leaves that have similar values
5 Can prune trees [BouDearden96] or ADDs [StaubinHoeyBou00]
h Gives regions of approximately same value
36
A Pruned Value ADD
8.368.45
7.45
U
R
W
6.817.64
6.64
U
R
W
5.626.19
5.19
U
R
WLoc
HCR
HCU
9.00
W
10.00
[7.45, 8.45]
Loc
HCR
HCU
[9.00, 10.00]
[6.64, 7.64]
[5.19, 6.19]
37
Approximate SDP: Relative Meritsh Relative merits of ASDP fewer regions implies faster
computation5 30-40 billion state problems in a couple hours5 allows fine-grained control of time vs. solution quality with
dynamic error bounds5 technical challenges: variable ordering, convergence, fixed
vs. adaptive tolerance, etc.
h Some drawbacks5 (still) produces piecewise constant VF5 doesn’t exploit additive structure of VF at all
h Bottom-line: When a problem matches the structural assumptions of SDP then we can gain much. But many problems do not match assumptions.
38
Ongoing Workh Factored action spaces
5 Sometimes the action space is large, but has structure. 5 For example, cooperative multi-agent systems
h Recent work (at OSU) has studied SDP for factored action spaces5 Include action variables in the DBNs
Action variables
Statevariables