Value and Planning in MDPs

Administrivia•Reading 3 assigned today

•Mahdevan, S., “Representation Policy Iteration”. In Proc. of 21st Conference on Uncertainty in Artificial Intelligence (UAI-2005).

•http://www.cs.umass.edu/~mahadeva/papers/uai-final-paper.pdf

•Due: Apr 20

•Groups assigned this time

Where we are•Last time:

•Expected value of policies

•Principle of maximum expected utility

•The Bellman equation

•Today:

•A little intuition (pictures)

•Finding π*: the policy iteration algorithm

•The Q function

•On to actual learning (maybe?)

The Bellman equation•The final recursive equation is known as the Bellman equation:

•Unique soln to this eqn gives value of a fixed policy π when operating in a known MDP M= 〈 S,A,T,R 〈

•When state/action spaces are discrete, can think of V and R as vectors and Tπ as matrix, and get matrix eqn:

Exercise•Solve the matrix Bellman equation (i.e., find V):

•I formulated the Bellman equations for “state-based” rewards: R(s)

•Formulate & solve the B.E. for:

•“state-action” rewards (R(s,a))

•“state-action-state” rewards (R(s,a,s’))

Exercise•Solve the matrix Bellman equation (i.e., find V):

•Formulate & solve the B.E. for:

•“state-action” rewards (R(s,a))

•“state-action-state” rewards (R(s,a,s’))

Policy values in practice“Robot” navigation in a grid maze

Goalstate

The MDP formulation•State space:

•Action space:

•Reward function:

•Transition function: ...

The MDP formulation•Transition function:

•If desired direction is unblocked

•Move in desired direction with probability 0.7

•Stay in same place w/ prob 0.1

•Move “forward right” w/ prob 0.1

•Move “forward left” w/ prob 0.1

•If desired direction is blocked (wall)

•Stay in same place w/ prob 1.0

Policy values in practice

Optimal policy, π*

Policy values in practice

Value function for optimal policy, V*

Why does it look like this?

A harder “maze”...

A harder “maze”...Optimal policy, π*

A harder “maze”...Value function for optimal policy, V*

Still more complex...

Still more complex...Optimal policy, π*

Still more complex...Value function for optimal policy, V*

Planning: finding π*•So we know how to evaluate a single policy, π

•How do you find the best policy?

•Remember: still assuming that we know M= 〈 S,A,T,R 〈

Planning: finding π*•So we know how to evaluate a single policy, π

•How do you find the best policy?

•Remember: still assuming that we know M= 〈 S,A,T,R 〈

•Non-solution: iterate through all possible π, evaluating each one; keep best

Policy iteration & friends•Many different solutions available.

•All exploit some characteristics of MDPs:

•For infinite-horizon discounted reward in a discrete, finite MDP, there exists at least one optimal, stationary policy (may exist more than one equivalent policy)

•The Bellman equation expresses recursive structure of an optimal policy

•Leads to a series of closely related policy solutions: policy iteration, value iteration, generalized policy iteration, etc.

The policy iteration alg.Function: policy_iteration

Input: MDP M= 〈 S,A,T,R 〈 discount γ

Output: optimal policy π*; opt. value func. V*Initialization: choose π

0 arbitrarily

Repeat {Vi=eval_policy(M,π

i,γ) // from Bellman eqn

πi+1=local_update_policy(π

} Until (πi+1==π

Function: π’=local_update_policy(π,V)for i=1..|S| {π’(s

i)=argmax

a∈A( sum

j)*V(s

Why does this work?•2 explanations:

•Theoretical:

•The local update w.r.t. the policy value is a contractive mapping, ergo a fixed point exists and will be reached

•See, “contraction mapping”, “Banach fixed-point theorem”, etc.•http://math.arizona.edu/~restrepo/475A/Notes/sourcea/node22.html

•http://planetmath.org/encyclopedia/BanachFixedPointTheorem.html

•Contracts w.r.t. the Bellman Error:

Why does this work?•The intuitive explanation

•It’s doing a dynamic-programming “backup” of reward from reward “sources”

•At every step, the policy is locally updated to take advantage of new information about reward that is propagated back by the evaluation step

•Value “propagates away” from sources and the policy is able to say “hey! there’s reward over there! I can get some of that action if I change a bit!”

P.I. in actionPolicy Value

Iteration 0

Iteration 1

Iteration 2

Iteration 3

Iteration 4

Iteration 5

Iteration 6: done

Properties•Policy iteration

•Known to converge (provable)

•Observed to converge exponentially quickly

•# iterations is O(ln(|S|))

•Empirical observation; strongly believed but no proof (yet)

•O(|S|3) time per iteration (policy

evaluation)

Variants

•Other methods possible

•Linear program (poly time soln exists)

•Value iteration

•Generalized policy iter. (often best in practice)

Value and Planning in MDPs

Documents

Lecture 2: In nite Horizon and Inde nite Horizon MDPs · In nite Horizon and Inde nite Horizon MDPs Lecture 2 / #15 In nite Horizon Discounted MDPs: Main Results Cost-to go functions

Causal Graph Based Decomposition of Factored MDPs

Making Complex Decisions - ti.tuwien.ac.at · Outline § Sequen+al decision problems § Value iteraon § Policy iteraon § Par+ally Observable MDPs

Distributed Planning in Hierarchical Factored MDPs Carlos Guestrin Stanford University Geoffrey Gordon Carnegie Mellon University

Quiz 7: MDPs

Fast Approximate Hierarchical Solutions of MDPs

Privacy-Preserving Bayes -Adaptive MDPs

Master Development Plans (MDPs)

Reachability in MDPs: Refining Convergence of Value Iteration · Reachability in MDPs: Refining Convergence of Value Iteration Serge Haddad (LSV, ENS Cachan, CNRS & Inria) and Benjamin

A1A1 A4A4 A2A2 A3A3 Context-Specific Multiagent Coordination and Planning with Factored MDPs Carlos Guestrin Shobha Venkataraman Daphne Koller Stanford

9/14: Belief Search Heuristics Today: Planning graph heuristics for belief search Wed: MDPs

Multiagent Planning with Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University

Reachability in MDPs: Refining Convergence of Value Iterationdi.ulb.ac.be/verif/monmege/talks/RP2014.pdf · Reachability in MDPs: Refining Convergence of Value Iteration Serge Haddad

German MDPs: Lessons to Learn - Pennsylvania State University · German MDPs: Lessons to Learn ... cluding statistics about German lawyers and information about some of the MDPs in

MDPS Workshop-8, 11-15 June 2012

Between MDPs and Semi-MDPs: A Framework for Temporal ...dprecup/publications/SPS-aij.pdfBetween MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning Richard

Technical Service Bulletin AUGUST, 2020 See Below SUBJECT ... · 1. Column & Shaft Assembly (C-MDPS full assembly) 2. Column & Housing Assembly 3. C-MDPS Motor 4. C-MDPS ECU 5. Universal

Markov Decision Processes - INAOEesucar/Clases-mgp/Notes/c11-mdp.pdfMarkov Decision Processes Representation Evaluation Value Iteration Policy Iteration Factored MDPs Abstraction Decomposition

Examples of MDPs

Value Base Planning Using Value Engineering