34
Value and Planning in MDPs

Value and Planning in MDPs

  • Upload
    leia

  • View
    36

  • Download
    0

Embed Size (px)

DESCRIPTION

Value and Planning in MDPs. Administrivia. Reading 3 assigned today Mahdevan, S., “Representation Policy Iteration”. In Proc. of 21st Conference on Uncertainty in Artificial Intelligence (UAI-2005) . http://www.cs.umass.edu/~mahadeva/papers/uai-final-paper.pdf Due: Apr 20 - PowerPoint PPT Presentation

Citation preview

Page 1: Value and Planning in MDPs

Value and Planning in MDPs

Page 2: Value and Planning in MDPs

Administrivia•Reading 3 assigned today

•Mahdevan, S., “Representation Policy Iteration”. In Proc. of 21st Conference on Uncertainty in Artificial Intelligence (UAI-2005).

•http://www.cs.umass.edu/~mahadeva/papers/uai-final-paper.pdf

•Due: Apr 20

•Groups assigned this time

Page 3: Value and Planning in MDPs

Where we are•Last time:

•Expected value of policies

•Principle of maximum expected utility

•The Bellman equation

•Today:

•A little intuition (pictures)

•Finding π*: the policy iteration algorithm

•The Q function

•On to actual learning (maybe?)

Page 4: Value and Planning in MDPs

The Bellman equation•The final recursive equation is known as the Bellman equation:

•Unique soln to this eqn gives value of a fixed policy π when operating in a known MDP M= 〈 S,A,T,R 〈

•When state/action spaces are discrete, can think of V and R as vectors and Tπ as matrix, and get matrix eqn:

Page 5: Value and Planning in MDPs

Exercise•Solve the matrix Bellman equation (i.e., find V):

•I formulated the Bellman equations for “state-based” rewards: R(s)

•Formulate & solve the B.E. for:

•“state-action” rewards (R(s,a))

•“state-action-state” rewards (R(s,a,s’))

Page 6: Value and Planning in MDPs

Exercise•Solve the matrix Bellman equation (i.e., find V):

•Formulate & solve the B.E. for:

•“state-action” rewards (R(s,a))

•“state-action-state” rewards (R(s,a,s’))

Page 7: Value and Planning in MDPs

Policy values in practice“Robot” navigation in a grid maze

Goalstate

Page 8: Value and Planning in MDPs

The MDP formulation•State space:

•Action space:

•Reward function:

•Transition function: ...

Page 9: Value and Planning in MDPs

The MDP formulation•Transition function:

•If desired direction is unblocked

•Move in desired direction with probability 0.7

•Stay in same place w/ prob 0.1

•Move “forward right” w/ prob 0.1

•Move “forward left” w/ prob 0.1

•If desired direction is blocked (wall)

•Stay in same place w/ prob 1.0

Page 10: Value and Planning in MDPs

Policy values in practice

Optimal policy, π*

EAST

SOUTH

WEST

NORTH

Page 11: Value and Planning in MDPs

Policy values in practice

Value function for optimal policy, V*

Why does it look like this?

Page 12: Value and Planning in MDPs

A harder “maze”...

Walls

Doors

Page 13: Value and Planning in MDPs

A harder “maze”...Optimal policy, π*

Page 14: Value and Planning in MDPs

A harder “maze”...Value function for optimal policy, V*

Page 15: Value and Planning in MDPs

A harder “maze”...Value function for optimal policy, V*

Page 16: Value and Planning in MDPs

Still more complex...

Page 17: Value and Planning in MDPs

Still more complex...Optimal policy, π*

Page 18: Value and Planning in MDPs

Still more complex...Value function for optimal policy, V*

Page 19: Value and Planning in MDPs

Still more complex...Value function for optimal policy, V*

Page 20: Value and Planning in MDPs

Planning: finding π*•So we know how to evaluate a single policy, π

•How do you find the best policy?

•Remember: still assuming that we know M= 〈 S,A,T,R 〈

Page 21: Value and Planning in MDPs

Planning: finding π*•So we know how to evaluate a single policy, π

•How do you find the best policy?

•Remember: still assuming that we know M= 〈 S,A,T,R 〈

•Non-solution: iterate through all possible π, evaluating each one; keep best

Page 22: Value and Planning in MDPs

Policy iteration & friends•Many different solutions available.

•All exploit some characteristics of MDPs:

•For infinite-horizon discounted reward in a discrete, finite MDP, there exists at least one optimal, stationary policy (may exist more than one equivalent policy)

•The Bellman equation expresses recursive structure of an optimal policy

•Leads to a series of closely related policy solutions: policy iteration, value iteration, generalized policy iteration, etc.

Page 23: Value and Planning in MDPs

The policy iteration alg.Function: policy_iteration

Input: MDP M= 〈 S,A,T,R 〈 discount γ

Output: optimal policy π*; opt. value func. V*Initialization: choose π

0 arbitrarily

Repeat {Vi=eval_policy(M,π

i,γ) // from Bellman eqn

πi+1=local_update_policy(π

i,V

i)

} Until (πi+1==π

i)

Function: π’=local_update_policy(π,V)for i=1..|S| {π’(s

i)=argmax

a∈A( sum

j(T(s

i,a,s

j)*V(s

j)) )

}

Page 24: Value and Planning in MDPs

Why does this work?•2 explanations:

•Theoretical:

•The local update w.r.t. the policy value is a contractive mapping, ergo a fixed point exists and will be reached

•See, “contraction mapping”, “Banach fixed-point theorem”, etc.•http://math.arizona.edu/~restrepo/475A/Notes/sourcea/node22.html

•http://planetmath.org/encyclopedia/BanachFixedPointTheorem.html

•Contracts w.r.t. the Bellman Error:

Page 25: Value and Planning in MDPs

Why does this work?•The intuitive explanation

•It’s doing a dynamic-programming “backup” of reward from reward “sources”

•At every step, the policy is locally updated to take advantage of new information about reward that is propagated back by the evaluation step

•Value “propagates away” from sources and the policy is able to say “hey! there’s reward over there! I can get some of that action if I change a bit!”

Page 26: Value and Planning in MDPs

P.I. in actionPolicy Value

Iteration 0

Page 27: Value and Planning in MDPs

P.I. in actionPolicy Value

Iteration 1

Page 28: Value and Planning in MDPs

P.I. in actionPolicy Value

Iteration 2

Page 29: Value and Planning in MDPs

P.I. in actionPolicy Value

Iteration 3

Page 30: Value and Planning in MDPs

P.I. in actionPolicy Value

Iteration 4

Page 31: Value and Planning in MDPs

P.I. in actionPolicy Value

Iteration 5

Page 32: Value and Planning in MDPs

P.I. in actionPolicy Value

Iteration 6: done

Page 33: Value and Planning in MDPs

Properties•Policy iteration

•Known to converge (provable)

•Observed to converge exponentially quickly

•# iterations is O(ln(|S|))

•Empirical observation; strongly believed but no proof (yet)

•O(|S|3) time per iteration (policy

evaluation)

Page 34: Value and Planning in MDPs

Variants

•Other methods possible

•Linear program (poly time soln exists)

•Value iteration

•Generalized policy iter. (often best in practice)