Stochastic Dynamic Programming with Factored Representations Presentation by Dafna Shahaf...

Preview:

Citation preview

Stochastic Dynamic Programming with Factored Representations

Presentation by Dafna Shahaf(Boutilier, Dearden, Goldszmidt 2000)

The Problem Standard MDP algorithms require explicit

state space enumeration Curse of dimensionality Need: Compact Representation

(intuition: STRIPS) Need: versions of standard dynamic

programming algorithms for it

A Glimpse of the Future

Policy Tree Value Tree

A Glimpse of the Future: Some Experimental Results

Roadmap

MDPs- Reminder Structured Representation for MDPs:

Bayesian Nets, Decision Trees Algorithms for Structured Representation Experimental Results Extensions

MDPs- Reminder

(states, actions, transitions, rewards)

Discounted infinite-horizon Stationary Policies

(an action to take at state s) Value functions: is k-stage-to-go

value function for π)(sV k

AS :

RTAS ,,,

Roadmap

MDPs- Reminder Structured Representation for MDPs:

Bayesian Nets, Decision Trees Algorithms for Structured Representation Experimental Results Extensions

Representing MDPs as Bayesian Networks: Coffee world

O: Robot is in office W: Robot is wet U: Has umbrella R: It is raining HCR: Robot has coffee HCO: Owner has coffee

Go: Switch location BuyC: Buy coffee DelC: Deliver coffee GetU: Get umbrella

The effect of the actions might be noisy.Need to provide a distribution for each effect.

Representing Actions: DelC

00.300

Representing Actions: Interesting Points

No need to provide marginal distribution over pre-action variables

Markov Property: we need only the previous state For now, no synchronic arcs Frame Problem? Single Network vs. a network for each action Why Decision Trees?

Representing Reward

Generally determined by a subset of features.

Policies and Value Functions

Policy Tree Value Tree

The optimal choice may depend only on certain variables (given some others).

FeaturesHCR=T

HCR=F

ValuesActions

Roadmap

MDPs- Reminder Structured Representation for MDPs:

Bayesian Nets, Decision Trees Algorithms for Structured Representation Experimental Results Extensions

Bellman Backup

Q-Function: The value of performing a in s, given value function v

Value Iteration- Reminder

)'(' )',,'Pr()()( ss vsassRsQva

)(max:)(max)(1

sQsQsV kaa

Vaa

kk

)}'(' ),,'Pr({max)()( 1 ss VsassRsV ka

k

Structured Value Iteration- OverviewInput: Tree( ). Output: Tree( ).

1. Set Tree( )= Tree( )

2. Repeat

(a) Compute Tree( )= Regress(Tree( ),a)

for each action a

(b) Merge (via maximization) trees Tree( )

to obtain Tree( )

Until termination criterion. Return Tree( ).

VkaQ

VkaQ

RR

0V

1kV1kV

kV

*V

Example World

Step 2a: Calculating Q-Functions

)'(' )',,Pr()()( ss VsassRsQVa

1. Expected FutureValue

2. DiscountingFutureValue

3. AddingImmediate

Reward

How to use the structure of the trees?

Tree( ) should distinguish only conditions under which a makes a branch of Tree(V) true with different odds.

VaQ

Calculating :

Tree(V0)

1aQ

PTree( )

Finding conditions under which a will have distinct expected value, with respect to V0

1aQ FVTree( )1aQ

Undiscounted Expected Future Value for performing action a with one-stage-to-go.

Tree( )1aQ

Discounting FVTree (by 0.9), and adding the immediate reward function.

Z:

Z: Z:

1*10+0*0

An Alternative View:

(a more complicated example)

Tree(V1) PartialPTree( )

UnsimplifiedPTree( )

PTree( )

2aQ

2aQ

2aQ FVTree( )2aQ Tree( )

2aQ

The Algorithm: Regress

Input: Tree(V), action a. Output: Tree( )

1. PTree( )= PRegress(Tree(V),a) (simplified)

VaQ

VaQ

The Algorithm: Regress

Input: Tree(V), action a. Output: Tree( )

1. PTree( )= PRegress(Tree(V),a) (simplified)

2. Construct FVTree( ):

for each branch b of PTree, with leaf node l(b)

(a) Prb =the product of individual distr. from l(b)

(b)

(c) Re-label leaf l(b) with vb.

VaQ

VaQ

VaQ

)(')'()'(Pr

VTreeb

bb bVbv

The Algorithm: Regress

Input: Tree(V), action a. Output: Tree( )

1. PTree( )= PRegress(Tree(V),a) (simplified)

2. Construct FVTree( ):

for each branch b of PTree, with leaf node l(b)

(a) Prb =the product of individual distr. from l(b)

(b)

(c) Re-label leaf l(b) with vb.

3. Discount FVTree( ) with , append Tree(R)

4. Return FVTree( )

VaQ

VaQ

VaQ

)(')'()'(Pr

VTreeb

bb bVbv

VaQ

VaQ

The Algorithm: PRegressInput: Tree(V), action a. Output: PTree( )

1. If Tree(V) is a single node, return emptyTree

2. X = the variable at the root of Tree(V)

= the tree for CPT(X) (label leaves with X)

VaQ

PXT

The Algorithm: PRegressInput: Tree(V), action a. Output: PTree( )

1. If Tree(V) is a single node, return emptyTree

2. X = the variable at the root of Tree(V)

= the tree for CPT(X) (label leaves with X)

3. = the subtrees of Tree(V) for X=t, X=f

4. = call PRegress on

VaQ

PXT

VfX

VtX TT ,

PfX

PtX TT ,

VfX

VtX TT ,

The Algorithm: PRegressInput: Tree(V), action a. Output: PTree( )

1. If Tree(V) is a single node, return emptyTree

2. X = the variable at the root of Tree(V)

= the tree for CPT(X) (label leaves with X)

3. = the subtrees of Tree(V) for X=t, X=f

4. = call PRegress on

5. For each leaf l in , add or both (according to distribution. Use union to combine labels)

6. Return

VaQ

PXT

VfX

VtX TT ,

PfX

PtX TT ,

VfX

VtX TT ,

PXT

PfX

PtX TT ,

PXT

Step 2b. Maximization

Value Iteration Complete.

Roadmap

MDPs- Reminder Structured Representation for MDPs:

Bayesian Nets, Decision Trees Algorithms for Structured Representation Experimental Results Extensions

Experimental Results

WorstCase:

BestCase:

Roadmap

MDPs- Reminder Structured Representation for MDPs:

Bayesian Nets, Decision Trees Algorithms for Structured Representation Experimental Results Extensions

Extensions

Synchronic edges POMDPs Rewards Approximation

Questions?

Backup slides

Here be dragons.

Regression through a Policy

Improving Policies: Example

Maximization Step, Improved Policy

Recommended