38
1 Symbolic Dynamic Programming Alan Fern * sed in part on slides by Craig Boutilier

1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier

Embed Size (px)

Citation preview

Page 1: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier

1

Symbolic Dynamic Programming

Alan Fern *

* Based in part on slides by Craig Boutilier

Page 2: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier

2

Planning in Large State Space MDPs

h You have learned algorithms for computing optimal policies5 Value Iteration5 Policy Iteration

h These algorithms explicitly enumerate the state space5 Often this is impractical

h Simulation-based planning and RL allowed for approximate planning in large MDPs5 Did not utilize an explicit model of the MDP. Only used a strong or

weak simulator.

h How can we get exact solutions to enormous MDPs?

Page 3: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier

3

Structured Representations

h Policy iteration and value iteration treat states as atomic entities with no internal structure.

h In most cases, states actually do have internal structure5 E.g. described by a set of state variables, or objects with properties

and relationships5 Humans exploit this structure to plan effectively

h What if we had a compact, structured representation for a large MDP and could efficiently plan with it?5 Would allow for exact solutions to very large MDPs

Page 4: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier

4

A Planning Problem

Page 5: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier

5

Logical or Feature-based Problems

h For most AI problems, states are not viewed as atomic entities.5 They contain structure. For example, they are

described by a set of boolean propositions/variables

5 |S| exponential in number of propositions

h Basic policy and value iteration do nothing to exploit the structure of the MDP when it is available

nXXXS 21

Page 6: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier

6

Solution?h Require structured representations in terms

of propositions5 compactly represent transition function5 compactly represent reward function5 compactly represent value functions and policies

h Require structured computation5 perform steps of PI or VI directly on structured

representations5 can avoid the need to enumerate state space

h We start by representing the transition structure as dynamic Bayesian networks

Page 7: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier

7

Propositional Representations

h States decomposable into state variables (we will assume boolean variables)

h Structured representations the norm in AI5 Decision diagrams, Bayesian networks, etc.5 Describe how actions affect/depend on features5 Natural, concise, can be exploited computationally

h Same ideas can be used for MDPs

nXXXS 21

Page 8: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier

8

Robot Domain as Propositional MDPh Propositional variables for single user version

5 Loc (robot’s locat’n): Office, Entrance5 T (lab is tidy): boolean5 CR (coffee request outstanding): boolean5 RHC (robot holding coffee): boolean5 RHM (robot holding mail): boolean5 M (mail waiting for pickup): boolean

h Actions/Events5 move to an adjacent location, pickup mail, get coffee, deliver

mail, deliver coffee, tidy lab5 mail arrival, coffee request issued, lab gets messy

h Rewards5 rewarded for tidy lab, satisfying a coffee request, delivering mail5 (or penalized for their negation)

Page 9: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier

9

State Spaceh State of MDP: assignment to these six

variables5 64 states5 grows exponentially with number of variables

h Transition matrices 5 4032 parameters required per matrix5 one matrix per action (6 or 7 or more actions)

h Reward function5 64 reward values needed

h Factored state and action descriptions will break this exponential dependence (generally)

Page 10: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier

10

Dynamic Bayesian Networks (DBNs)h Bayesian networks (BNs) a common

representation for probability distributions5 A graph (DAG) represents conditional

independence5 Conditional probability tables (CPTs) quantify local

probability distributions

h Dynamic Bayes net action representation5 one Bayes net for each action a, representing the

set of conditional distributions Pr(St+1|At,St)5 each state variable occurs at time t and t+15 dependence of t+1 variables on t variables

depicted by directed arcs

Page 11: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier

11

DBN Representation: deliver coffee

Tt

Lt

CRt

RHCt

Tt+1

Lt+1

CRt+1

RHCt+1

Pr(CRt+1 | Lt,CRt,RHCt)

Pr(Tt+1| Tt)

L CR RHC CR(t+1) CR(t+1)

O T T 0.2 0.8

E T T 1.0 0.0

O F T 0.1 0.9

E F T 0.1 0.9

O T F 1.0 0.0

E T F 1.0 0.0

O F F 0.1 0.9

E F F 0.1 0.9

T T(t+1) T(t+1)

T 0.91 0.09

F 0.0 1.0

RHMt RHMt+1

Mt Mt+1

Pr(RHMt+1|RHMt)RHM R(t+1) R(t+1)

T 1.0 0.0

F 0.0 1.0

is the product of each of the 6 tables.

Page 12: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier

12

Benefits of DBN RepresentationPr(St+1 | St) = Pr(RHMt+1,Mt+1,Tt+1,Lt+1,Ct+1,RHCt+1 | RHMt,Mt,Tt,Lt,Ct,RHCt)

= Pr(RHMt+1 |RHMt) * Pr(Mt+1 | Mt) * Pr(Tt+1 | Tt)

* Pr(Lt+1 | Lt) * Pr(CRt+1 | CRt,RHCt,Lt) * Pr(RHCt+1 | RHCt,Lt)

- Only 20 parameters vs. 4032 for matrix

- Removes global exponential dependence

s1 s2 ... s64

s1 0.9 0.05 ... 0.0s2 0.0 0.20 ... 0.1

S64 0.1 0.0 ... 0.0

...

Tt

Lt

CRt

RHCt

Tt+1

Lt+1

CRt+1

RHCt+1

RHMt RHMt+1

Mt Mt+1

Full Matrix

Page 13: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier

13

Structure in CPTs

h So far we have represented each CPT as a table of size exponential in the number of parents

h Notice that there’s regularity in CPTs5 e.g., Pr(CRt+1 | Lt,CRt,RHCt) has many similar entries

h Compact function representations for CPTs can be used to great effect5 decision trees5 algebraic decision diagrams (ADDs/BDDs)

h Here we show examples of decision trees (DTs)

Page 14: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier

14

Action Representation – DBN/DT

CR(t)

0.1 1.0

RHC(t)

L(t)

0.2

Decision Tree (DT)Tt

Lt

CRt

RHCt

Tt+1

Lt+1

CRt+1

RHCt+1

RHMt RHMt+1

Mt Mt+1

f

t

t

oe

Leaves of DT givePr(CRt+1=true | Lt,CRt,RHCt)

DTs can often represent conditional probabilities much morecompactly than a full conditional probability table

e.g. If CR(t) = true & RHC(t) = false then CR(t+1)=TRUE with prob. 1

1.0

f

Page 15: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier

15

Reward Representation

h Rewards represented with DTs in a similar fashion 5 Would require vector of size 2n for explicit representation

CR

M

T

-1 1

-100

f

tf

t

-10f t

Small reward for satisfying all of these conditions

High cost for unsatisfied coffee request

High, but lower, cost for undelivered mail

Cost for lab being untidy

Page 16: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier

16

Structured Computation

h Given our compact decision tree (DBN)

representation, can we solve MDP without

explicit state space enumeration?

h Can we avoid O(|S|)-computations by exploiting

regularities made explicit by representation?

h We will study a general approach for doing this

called structured dynamic programming

Page 17: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier

17

Structured Dynamic Programming

h We now consider how to perform dynamic programming techniques such as VI and PI using the problem structure

h VI and PI are based on a few basic operations.5 Here we will show how to perform these operations directly on tree

representations of value functions, policies, and transitions functions

h The approach is very general and can be applied to other representations (e.g. algebraic decision diagrams, situation calculus) and other problems after the main idea is understood

h We will focus on VI here, but the paper also describes a version of modified policy iteration

Page 18: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier

18

Recall Tree-Based Representations

X

Y

Z

X

Y

Z

X

Y0.9

0.0

X

1.0 0.0

1.0

Z

Y1.0

0.00.9

Z

10 0

DBN for Action AReward Function R

Note: we are leaving off time subscripts for readability and using X(t), Y(t), …, instead.

e.g. If X(t)=false & Y(t) = true then Y(t+1)=true w/ prob 1

e.g. If X(t)=true THEN Y(t+1)=true w/ prob 0.9t f

t f

t f

t f

t f

t f

Recall that each action of the MDP has its own DBN.

Page 19: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier

19

Structured Dynamic Programming

h Value functions and policies can also have tree representations5 Often much more compact representations than tables

h Our Goal: compute the tree representations of policy and value function given the tree representations of the transitions and rewards

Page 20: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier

20

Recall Value Iteration

Suppose that initial is compactly represented as a tree.

1. Show how to compute compact trees for

2. Use a max operation on the Q-trees (returns a single tree)

;; could initialize to 0

Value Iteration:

Bellman Backup

Page 21: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier

Symbolic Value Iteration

Tree

Symbolic MAX Tree

V(X)

Pr 𝐴=𝑎 (𝑆 ′|𝑆¿ . . . . .Pr 𝐴=𝑏 (𝑆 ′|𝑆¿ Pr 𝐴=𝑧 (𝑆 ′|𝑆¿

. . . . . T Tree

?

? ? ?

Page 22: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier

22

The MAX Trees OperationX

Y0.9

0.0

X

1.0 0.0

1.0

Tree partitions the state space, assigning values to each region

1.0 0.0 0.9

0.0

1.0

1.0

0.0

1.0

The state space max for the above trees is:

In general, how can we compute the tree representing the max?

Page 23: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier

23

The MAX Tree Operation

X

Y

X

1.0, 0.0 0.0, 0.0

1.0 0.0 0.9

0.0

1.0

Can simply append one tree to leaves of other. Makes all the distinctions that either tree makes. Max operation is taken at leaves of result.

X

1.0,1.0 0.0, 1.0

X

1.0, 0.9 0.0,0.9

X

Y0.9

0.0

X

1.0 0.0

1.0

MAXX

Y

X

1.0 0.0

X

1.0 1.0

X

1.0 0.9

Page 24: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier

24

The MAX Tree Operation

1.0 0.0 0.9

0.0

1.0

The resulting tree may have unreachable leaves. We can simplify the tree by removing such paths.

X

Y0.9

0.0

X

1.0 0.0

1.0

SimplifyX

Y

X

1.0 0.0

X

1.0 1.0

X

1.0 0.9

X

Y

0.01.0

1.0

unreachable

Page 25: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier

25

BINARY OPERATIONS(other binary operations similar to max)

Page 26: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier

26

MARGINALIZATION

∑A

Compute diagram representing

There are libraries for doing this.

Page 27: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier

Symbolic Bellman Backup

for each action a

TreeTreeTree

Tree

𝑆=( 𝑋1 ,…, 𝑋 𝑙 ) ,𝑆 ′=(𝑋 ′ 1 ,…,𝑋 ′𝑙)

Page 28: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier

Symbol

TreeTree

Page 29: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier

Symbol

Tree

Page 30: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier

Symbol

Tree

Page 31: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier

Symbol

TreeTree

Page 32: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier

Symbol

Tree

Page 33: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier

Symbolic Bellman Backup

for each action a

TreeTreeTree

Tree

Page 34: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier

34

SDP: Relative Meritsh Adaptive, nonuniform, exact abstraction method

5 provides exact solution to MDP5 much more efficient on certain problems (time/space)5 400 million state problems in a couple hrs

h Can formulate a similar procedure for modified policy iteration

h Some drawbacks5 produces piecewise constant VF5 some problems admit no compact solution representation

g so the sizes of trees blows up with enough iterations5 approximation may be desirable or necessary

Page 35: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier

35

Approximate SDPh Easy to approximate solution using SDP

h Simple pruning of value function

5 Simply “merge” leaves that have similar values

5 Can prune trees [BouDearden96] or ADDs [StaubinHoeyBou00]

h Gives regions of approximately same value

Page 36: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier

36

A Pruned Value ADD

8.368.45

7.45

U

R

W

6.817.64

6.64

U

R

W

5.626.19

5.19

U

R

WLoc

HCR

HCU

9.00

W

10.00

[7.45, 8.45]

Loc

HCR

HCU

[9.00, 10.00]

[6.64, 7.64]

[5.19, 6.19]

Page 37: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier

37

Approximate SDP: Relative Meritsh Relative merits of ASDP fewer regions implies faster

computation5 30-40 billion state problems in a couple hours5 allows fine-grained control of time vs. solution quality with

dynamic error bounds5 technical challenges: variable ordering, convergence, fixed

vs. adaptive tolerance, etc.

h Some drawbacks5 (still) produces piecewise constant VF5 doesn’t exploit additive structure of VF at all

h Bottom-line: When a problem matches the structural assumptions of SDP then we can gain much. But many problems do not match assumptions.

Page 38: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier

38

Ongoing Workh Factored action spaces

5 Sometimes the action space is large, but has structure. 5 For example, cooperative multi-agent systems

h Recent work (at OSU) has studied SDP for factored action spaces5 Include action variables in the DBNs

Action variables

Statevariables