1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier

Preview:

Citation preview

1

Symbolic Dynamic Programming

Alan Fern *

* Based in part on slides by Craig Boutilier

2

Planning in Large State Space MDPs

h You have learned algorithms for computing optimal policies5 Value Iteration5 Policy Iteration

h These algorithms explicitly enumerate the state space5 Often this is impractical

h Simulation-based planning and RL allowed for approximate planning in large MDPs5 Did not utilize an explicit model of the MDP. Only used a strong or

weak simulator.

h How can we get exact solutions to enormous MDPs?

3

Structured Representations

h Policy iteration and value iteration treat states as atomic entities with no internal structure.

h In most cases, states actually do have internal structure5 E.g. described by a set of state variables, or objects with properties

and relationships5 Humans exploit this structure to plan effectively

h What if we had a compact, structured representation for a large MDP and could efficiently plan with it?5 Would allow for exact solutions to very large MDPs

4

A Planning Problem

5

Logical or Feature-based Problems

h For most AI problems, states are not viewed as atomic entities.5 They contain structure. For example, they are

described by a set of boolean propositions/variables

5 |S| exponential in number of propositions

h Basic policy and value iteration do nothing to exploit the structure of the MDP when it is available

nXXXS 21

6

Solution?h Require structured representations in terms

of propositions5 compactly represent transition function5 compactly represent reward function5 compactly represent value functions and policies

h Require structured computation5 perform steps of PI or VI directly on structured

representations5 can avoid the need to enumerate state space

h We start by representing the transition structure as dynamic Bayesian networks

7

Propositional Representations

h States decomposable into state variables (we will assume boolean variables)

h Structured representations the norm in AI5 Decision diagrams, Bayesian networks, etc.5 Describe how actions affect/depend on features5 Natural, concise, can be exploited computationally

h Same ideas can be used for MDPs

nXXXS 21

8

Robot Domain as Propositional MDPh Propositional variables for single user version

5 Loc (robot’s locat’n): Office, Entrance5 T (lab is tidy): boolean5 CR (coffee request outstanding): boolean5 RHC (robot holding coffee): boolean5 RHM (robot holding mail): boolean5 M (mail waiting for pickup): boolean

h Actions/Events5 move to an adjacent location, pickup mail, get coffee, deliver

mail, deliver coffee, tidy lab5 mail arrival, coffee request issued, lab gets messy

h Rewards5 rewarded for tidy lab, satisfying a coffee request, delivering mail5 (or penalized for their negation)

9

State Spaceh State of MDP: assignment to these six

variables5 64 states5 grows exponentially with number of variables

h Transition matrices 5 4032 parameters required per matrix5 one matrix per action (6 or 7 or more actions)

h Reward function5 64 reward values needed

h Factored state and action descriptions will break this exponential dependence (generally)

10

Dynamic Bayesian Networks (DBNs)h Bayesian networks (BNs) a common

representation for probability distributions5 A graph (DAG) represents conditional

independence5 Conditional probability tables (CPTs) quantify local

probability distributions

h Dynamic Bayes net action representation5 one Bayes net for each action a, representing the

set of conditional distributions Pr(St+1|At,St)5 each state variable occurs at time t and t+15 dependence of t+1 variables on t variables

depicted by directed arcs

11

DBN Representation: deliver coffee

Tt

Lt

CRt

RHCt

Tt+1

Lt+1

CRt+1

RHCt+1

Pr(CRt+1 | Lt,CRt,RHCt)

Pr(Tt+1| Tt)

L CR RHC CR(t+1) CR(t+1)

O T T 0.2 0.8

E T T 1.0 0.0

O F T 0.1 0.9

E F T 0.1 0.9

O T F 1.0 0.0

E T F 1.0 0.0

O F F 0.1 0.9

E F F 0.1 0.9

T T(t+1) T(t+1)

T 0.91 0.09

F 0.0 1.0

RHMt RHMt+1

Mt Mt+1

Pr(RHMt+1|RHMt)RHM R(t+1) R(t+1)

T 1.0 0.0

F 0.0 1.0

is the product of each of the 6 tables.

12

Benefits of DBN RepresentationPr(St+1 | St) = Pr(RHMt+1,Mt+1,Tt+1,Lt+1,Ct+1,RHCt+1 | RHMt,Mt,Tt,Lt,Ct,RHCt)

= Pr(RHMt+1 |RHMt) * Pr(Mt+1 | Mt) * Pr(Tt+1 | Tt)

* Pr(Lt+1 | Lt) * Pr(CRt+1 | CRt,RHCt,Lt) * Pr(RHCt+1 | RHCt,Lt)

- Only 20 parameters vs. 4032 for matrix

- Removes global exponential dependence

s1 s2 ... s64

s1 0.9 0.05 ... 0.0s2 0.0 0.20 ... 0.1

S64 0.1 0.0 ... 0.0

...

Tt

Lt

CRt

RHCt

Tt+1

Lt+1

CRt+1

RHCt+1

RHMt RHMt+1

Mt Mt+1

Full Matrix

13

Structure in CPTs

h So far we have represented each CPT as a table of size exponential in the number of parents

h Notice that there’s regularity in CPTs5 e.g., Pr(CRt+1 | Lt,CRt,RHCt) has many similar entries

h Compact function representations for CPTs can be used to great effect5 decision trees5 algebraic decision diagrams (ADDs/BDDs)

h Here we show examples of decision trees (DTs)

14

Action Representation – DBN/DT

CR(t)

0.1 1.0

RHC(t)

L(t)

0.2

Decision Tree (DT)Tt

Lt

CRt

RHCt

Tt+1

Lt+1

CRt+1

RHCt+1

RHMt RHMt+1

Mt Mt+1

f

t

t

oe

Leaves of DT givePr(CRt+1=true | Lt,CRt,RHCt)

DTs can often represent conditional probabilities much morecompactly than a full conditional probability table

e.g. If CR(t) = true & RHC(t) = false then CR(t+1)=TRUE with prob. 1

1.0

f

15

Reward Representation

h Rewards represented with DTs in a similar fashion 5 Would require vector of size 2n for explicit representation

CR

M

T

-1 1

-100

f

tf

t

-10f t

Small reward for satisfying all of these conditions

High cost for unsatisfied coffee request

High, but lower, cost for undelivered mail

Cost for lab being untidy

16

Structured Computation

h Given our compact decision tree (DBN)

representation, can we solve MDP without

explicit state space enumeration?

h Can we avoid O(|S|)-computations by exploiting

regularities made explicit by representation?

h We will study a general approach for doing this

called structured dynamic programming

17

Structured Dynamic Programming

h We now consider how to perform dynamic programming techniques such as VI and PI using the problem structure

h VI and PI are based on a few basic operations.5 Here we will show how to perform these operations directly on tree

representations of value functions, policies, and transitions functions

h The approach is very general and can be applied to other representations (e.g. algebraic decision diagrams, situation calculus) and other problems after the main idea is understood

h We will focus on VI here, but the paper also describes a version of modified policy iteration

18

Recall Tree-Based Representations

X

Y

Z

X

Y

Z

X

Y0.9

0.0

X

1.0 0.0

1.0

Z

Y1.0

0.00.9

Z

10 0

DBN for Action AReward Function R

Note: we are leaving off time subscripts for readability and using X(t), Y(t), …, instead.

e.g. If X(t)=false & Y(t) = true then Y(t+1)=true w/ prob 1

e.g. If X(t)=true THEN Y(t+1)=true w/ prob 0.9t f

t f

t f

t f

t f

t f

Recall that each action of the MDP has its own DBN.

19

Structured Dynamic Programming

h Value functions and policies can also have tree representations5 Often much more compact representations than tables

h Our Goal: compute the tree representations of policy and value function given the tree representations of the transitions and rewards

20

Recall Value Iteration

Suppose that initial is compactly represented as a tree.

1. Show how to compute compact trees for

2. Use a max operation on the Q-trees (returns a single tree)

;; could initialize to 0

Value Iteration:

Bellman Backup

Symbolic Value Iteration

Tree

Symbolic MAX Tree

V(X)

Pr 𝐴=𝑎 (𝑆 ′|𝑆¿ . . . . .Pr 𝐴=𝑏 (𝑆 ′|𝑆¿ Pr 𝐴=𝑧 (𝑆 ′|𝑆¿

. . . . . T Tree

?

? ? ?

22

The MAX Trees OperationX

Y0.9

0.0

X

1.0 0.0

1.0

Tree partitions the state space, assigning values to each region

1.0 0.0 0.9

0.0

1.0

1.0

0.0

1.0

The state space max for the above trees is:

In general, how can we compute the tree representing the max?

23

The MAX Tree Operation

X

Y

X

1.0, 0.0 0.0, 0.0

1.0 0.0 0.9

0.0

1.0

Can simply append one tree to leaves of other. Makes all the distinctions that either tree makes. Max operation is taken at leaves of result.

X

1.0,1.0 0.0, 1.0

X

1.0, 0.9 0.0,0.9

X

Y0.9

0.0

X

1.0 0.0

1.0

MAXX

Y

X

1.0 0.0

X

1.0 1.0

X

1.0 0.9

24

The MAX Tree Operation

1.0 0.0 0.9

0.0

1.0

The resulting tree may have unreachable leaves. We can simplify the tree by removing such paths.

X

Y0.9

0.0

X

1.0 0.0

1.0

SimplifyX

Y

X

1.0 0.0

X

1.0 1.0

X

1.0 0.9

X

Y

0.01.0

1.0

unreachable

25

BINARY OPERATIONS(other binary operations similar to max)

26

MARGINALIZATION

∑A

Compute diagram representing

There are libraries for doing this.

Symbolic Bellman Backup

for each action a

TreeTreeTree

Tree

𝑆=( 𝑋1 ,…, 𝑋 𝑙 ) ,𝑆 ′=(𝑋 ′ 1 ,…,𝑋 ′𝑙)

Symbol

TreeTree

Symbol

Tree

Symbol

Tree

Symbol

TreeTree

Symbol

Tree

Symbolic Bellman Backup

for each action a

TreeTreeTree

Tree

34

SDP: Relative Meritsh Adaptive, nonuniform, exact abstraction method

5 provides exact solution to MDP5 much more efficient on certain problems (time/space)5 400 million state problems in a couple hrs

h Can formulate a similar procedure for modified policy iteration

h Some drawbacks5 produces piecewise constant VF5 some problems admit no compact solution representation

g so the sizes of trees blows up with enough iterations5 approximation may be desirable or necessary

35

Approximate SDPh Easy to approximate solution using SDP

h Simple pruning of value function

5 Simply “merge” leaves that have similar values

5 Can prune trees [BouDearden96] or ADDs [StaubinHoeyBou00]

h Gives regions of approximately same value

36

A Pruned Value ADD

8.368.45

7.45

U

R

W

6.817.64

6.64

U

R

W

5.626.19

5.19

U

R

WLoc

HCR

HCU

9.00

W

10.00

[7.45, 8.45]

Loc

HCR

HCU

[9.00, 10.00]

[6.64, 7.64]

[5.19, 6.19]

37

Approximate SDP: Relative Meritsh Relative merits of ASDP fewer regions implies faster

computation5 30-40 billion state problems in a couple hours5 allows fine-grained control of time vs. solution quality with

dynamic error bounds5 technical challenges: variable ordering, convergence, fixed

vs. adaptive tolerance, etc.

h Some drawbacks5 (still) produces piecewise constant VF5 doesn’t exploit additive structure of VF at all

h Bottom-line: When a problem matches the structural assumptions of SDP then we can gain much. But many problems do not match assumptions.

38

Ongoing Workh Factored action spaces

5 Sometimes the action space is large, but has structure. 5 For example, cooperative multi-agent systems

h Recent work (at OSU) has studied SDP for factored action spaces5 Include action variables in the DBNs

Action variables

Statevariables

Recommended