1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier

Symbolic Dynamic Programming

Alan Fern *

* Based in part on slides by Craig Boutilier

Planning in Large State Space MDPs

h You have learned algorithms for computing optimal policies5 Value Iteration5 Policy Iteration

h These algorithms explicitly enumerate the state space5 Often this is impractical

h Simulation-based planning and RL allowed for approximate planning in large MDPs5 Did not utilize an explicit model of the MDP. Only used a strong or

weak simulator.

h How can we get exact solutions to enormous MDPs?

Structured Representations

h Policy iteration and value iteration treat states as atomic entities with no internal structure.

h In most cases, states actually do have internal structure5 E.g. described by a set of state variables, or objects with properties

and relationships5 Humans exploit this structure to plan effectively

h What if we had a compact, structured representation for a large MDP and could efficiently plan with it?5 Would allow for exact solutions to very large MDPs

A Planning Problem

Logical or Feature-based Problems

h For most AI problems, states are not viewed as atomic entities.5 They contain structure. For example, they are

described by a set of boolean propositions/variables

5 |S| exponential in number of propositions

h Basic policy and value iteration do nothing to exploit the structure of the MDP when it is available

nXXXS 21

Solution?h Require structured representations in terms

of propositions5 compactly represent transition function5 compactly represent reward function5 compactly represent value functions and policies

h Require structured computation5 perform steps of PI or VI directly on structured

representations5 can avoid the need to enumerate state space

h We start by representing the transition structure as dynamic Bayesian networks

Propositional Representations

h States decomposable into state variables (we will assume boolean variables)

h Structured representations the norm in AI5 Decision diagrams, Bayesian networks, etc.5 Describe how actions affect/depend on features5 Natural, concise, can be exploited computationally

h Same ideas can be used for MDPs

nXXXS 21

Robot Domain as Propositional MDPh Propositional variables for single user version

5 Loc (robot’s locat’n): Office, Entrance5 T (lab is tidy): boolean5 CR (coffee request outstanding): boolean5 RHC (robot holding coffee): boolean5 RHM (robot holding mail): boolean5 M (mail waiting for pickup): boolean

h Actions/Events5 move to an adjacent location, pickup mail, get coffee, deliver

mail, deliver coffee, tidy lab5 mail arrival, coffee request issued, lab gets messy

h Rewards5 rewarded for tidy lab, satisfying a coffee request, delivering mail5 (or penalized for their negation)

State Spaceh State of MDP: assignment to these six

variables5 64 states5 grows exponentially with number of variables

h Transition matrices 5 4032 parameters required per matrix5 one matrix per action (6 or 7 or more actions)

h Reward function5 64 reward values needed

h Factored state and action descriptions will break this exponential dependence (generally)

Dynamic Bayesian Networks (DBNs)h Bayesian networks (BNs) a common

representation for probability distributions5 A graph (DAG) represents conditional

independence5 Conditional probability tables (CPTs) quantify local

probability distributions

h Dynamic Bayes net action representation5 one Bayes net for each action a, representing the

set of conditional distributions Pr(St+1|At,St)5 each state variable occurs at time t and t+15 dependence of t+1 variables on t variables

depicted by directed arcs

DBN Representation: deliver coffee

RHCt+1

Pr(CRt+1 | Lt,CRt,RHCt)

Pr(Tt+1| Tt)

L CR RHC CR(t+1) CR(t+1)

O T T 0.2 0.8

E T T 1.0 0.0

O F T 0.1 0.9

E F T 0.1 0.9

O T F 1.0 0.0

E T F 1.0 0.0

O F F 0.1 0.9

E F F 0.1 0.9

T T(t+1) T(t+1)

T 0.91 0.09

F 0.0 1.0

RHMt RHMt+1

Mt Mt+1

Pr(RHMt+1|RHMt)RHM R(t+1) R(t+1)

T 1.0 0.0

F 0.0 1.0

is the product of each of the 6 tables.

Benefits of DBN RepresentationPr(St+1 | St) = Pr(RHMt+1,Mt+1,Tt+1,Lt+1,Ct+1,RHCt+1 | RHMt,Mt,Tt,Lt,Ct,RHCt)

= Pr(RHMt+1 |RHMt) * Pr(Mt+1 | Mt) * Pr(Tt+1 | Tt)

* Pr(Lt+1 | Lt) * Pr(CRt+1 | CRt,RHCt,Lt) * Pr(RHCt+1 | RHCt,Lt)

- Only 20 parameters vs. 4032 for matrix

- Removes global exponential dependence

s1 s2 ... s64

s1 0.9 0.05 ... 0.0s2 0.0 0.20 ... 0.1

S64 0.1 0.0 ... 0.0

RHCt+1

RHMt RHMt+1

Mt Mt+1

Full Matrix

Structure in CPTs

h So far we have represented each CPT as a table of size exponential in the number of parents

h Notice that there’s regularity in CPTs5 e.g., Pr(CRt+1 | Lt,CRt,RHCt) has many similar entries

h Compact function representations for CPTs can be used to great effect5 decision trees5 algebraic decision diagrams (ADDs/BDDs)

h Here we show examples of decision trees (DTs)

Action Representation – DBN/DT

0.1 1.0

RHC(t)

Decision Tree (DT)Tt

RHCt+1

RHMt RHMt+1

Mt Mt+1

Leaves of DT givePr(CRt+1=true | Lt,CRt,RHCt)

DTs can often represent conditional probabilities much morecompactly than a full conditional probability table

e.g. If CR(t) = true & RHC(t) = false then CR(t+1)=TRUE with prob. 1

Reward Representation

h Rewards represented with DTs in a similar fashion 5 Would require vector of size 2n for explicit representation

-10f t

Small reward for satisfying all of these conditions

High cost for unsatisfied coffee request

High, but lower, cost for undelivered mail

Cost for lab being untidy

Structured Computation

h Given our compact decision tree (DBN)

representation, can we solve MDP without

explicit state space enumeration?

h Can we avoid O(|S|)-computations by exploiting

regularities made explicit by representation?

h We will study a general approach for doing this

called structured dynamic programming

Structured Dynamic Programming

h We now consider how to perform dynamic programming techniques such as VI and PI using the problem structure

h VI and PI are based on a few basic operations.5 Here we will show how to perform these operations directly on tree

representations of value functions, policies, and transitions functions

h The approach is very general and can be applied to other representations (e.g. algebraic decision diagrams, situation calculus) and other problems after the main idea is understood

h We will focus on VI here, but the paper also describes a version of modified policy iteration

Recall Tree-Based Representations

1.0 0.0

0.00.9

DBN for Action AReward Function R

Note: we are leaving off time subscripts for readability and using X(t), Y(t), …, instead.

e.g. If X(t)=false & Y(t) = true then Y(t+1)=true w/ prob 1

e.g. If X(t)=true THEN Y(t+1)=true w/ prob 0.9t f

Recall that each action of the MDP has its own DBN.

Structured Dynamic Programming

h Value functions and policies can also have tree representations5 Often much more compact representations than tables

h Our Goal: compute the tree representations of policy and value function given the tree representations of the transitions and rewards

Recall Value Iteration

Suppose that initial is compactly represented as a tree.

1. Show how to compute compact trees for

2. Use a max operation on the Q-trees (returns a single tree)

;; could initialize to 0

Value Iteration:

Bellman Backup

Symbolic Value Iteration

Symbolic MAX Tree

Pr 𝐴=𝑎 (𝑆 ′|𝑆¿ . . . . .Pr 𝐴=𝑏 (𝑆 ′|𝑆¿ Pr 𝐴=𝑧 (𝑆 ′|𝑆¿

. . . . . T Tree

The MAX Trees OperationX

1.0 0.0

Tree partitions the state space, assigning values to each region

1.0 0.0 0.9

The state space max for the above trees is:

In general, how can we compute the tree representing the max?

The MAX Tree Operation

1.0, 0.0 0.0, 0.0

1.0 0.0 0.9

Can simply append one tree to leaves of other. Makes all the distinctions that either tree makes. Max operation is taken at leaves of result.

1.0,1.0 0.0, 1.0

1.0, 0.9 0.0,0.9

1.0 0.0

1.0 1.0

1.0 0.9

The MAX Tree Operation

1.0 0.0 0.9

The resulting tree may have unreachable leaves. We can simplify the tree by removing such paths.

1.0 0.0

SimplifyX

1.0 0.0

1.0 1.0

1.0 0.9

0.01.0

unreachable

BINARY OPERATIONS(other binary operations similar to max)

MARGINALIZATION

Compute diagram representing

There are libraries for doing this.

Symbolic Bellman Backup

for each action a

TreeTreeTree

𝑆=( 𝑋1 ,…, 𝑋 𝑙 ) ,𝑆 ′=(𝑋 ′ 1 ,…,𝑋 ′𝑙)

Symbol

TreeTree

Symbol

TreeTree

Symbol

Symbolic Bellman Backup

for each action a

TreeTreeTree

SDP: Relative Meritsh Adaptive, nonuniform, exact abstraction method

5 provides exact solution to MDP5 much more efficient on certain problems (time/space)5 400 million state problems in a couple hrs

h Can formulate a similar procedure for modified policy iteration

h Some drawbacks5 produces piecewise constant VF5 some problems admit no compact solution representation

g so the sizes of trees blows up with enough iterations5 approximation may be desirable or necessary

Approximate SDPh Easy to approximate solution using SDP

h Simple pruning of value function

5 Simply “merge” leaves that have similar values

5 Can prune trees [BouDearden96] or ADDs [StaubinHoeyBou00]

h Gives regions of approximately same value

A Pruned Value ADD

8.368.45

6.817.64

5.626.19

[7.45, 8.45]

[9.00, 10.00]

[6.64, 7.64]

[5.19, 6.19]

Approximate SDP: Relative Meritsh Relative merits of ASDP fewer regions implies faster

computation5 30-40 billion state problems in a couple hours5 allows fine-grained control of time vs. solution quality with

dynamic error bounds5 technical challenges: variable ordering, convergence, fixed

vs. adaptive tolerance, etc.

h Some drawbacks5 (still) produces piecewise constant VF5 doesn’t exploit additive structure of VF at all

h Bottom-line: When a problem matches the structural assumptions of SDP then we can gain much. But many problems do not match assumptions.

Ongoing Workh Factored action spaces

5 Sometimes the action space is large, but has structure. 5 For example, cooperative multi-agent systems

h Recent work (at OSU) has studied SDP for factored action spaces5 Include action variables in the DBNs

Action variables

Statevariables

1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier

Documents

TEO HUI FERN

The Fern Resort

Dryopteris erythrosora Autumn Fern, Japanese Shield Fern ...Dryopteris erythrosora Autumn Fern, Japanese Shield Fern 2 Growth rate: slow Texture: fine Foliage Leaf arrangement: most

Vascular plants: FERN and FERN ALLIES

Japanese Climbing Fern Old World Climbing Fernplants.ifas.ufl.edu/.../caip/inv_plant_man_plan_pdf/climbing_fern_PP.… · Japanese Climbing Fern Old World Climbing Fern Lygodium japonicum

Fern presentation

FERN PindoDeli Final

Evolution of fern

David Fern

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld

Descendants of Anna Catherine Boutilier and Jean George ......Descendants of Anna Catherine Boutilier and Jean George Besancon "Bezanson" 1. Jean George1 Besancon "Bezanson", born

1 Markov Decision Processes * Based in part on slides by Alan Fern, Craig Boutilier and Daniel Weld

Book Review: Mission, Church, and Sect in Oceania. J.A. Boutilier,

Osmundaceae Family “The Royal Fern Family” (AKA “The C innamon Fern Family”)

StolenGoods FERN

Fern Grant Credentials

Maidenhair Fern

Descendants of Anna Catherine Boutilier and Jean George Bezanson

Kent & Fern Frostcentralpt.com/upload/345/2860_Frost.pdf · Kent & Fern Frost February 1994 - 1 - Kent & Fern Frost Interview with Kent & Fern Frost on 12 February 1994. Interviewers

Staghorn fern