Click here to load reader

Quiz 7: MDPs

  • View

  • Download

Embed Size (px)


Quiz 7: MDPs. MDPs must have terminal states. False Every MDP is also a search problem. False “ Markov ” means the future is independent of the past, given the present. True There are many (>17) ways to define stationary utilities over sequences of rewards. False - PowerPoint PPT Presentation

Text of Quiz 7: MDPs

  • Quiz 7: MDPsMDPs must have terminal states. FalseEvery MDP is also a search problem. False

    Markov means the future is independent of the past, given the present. True There are many (>17) ways to define stationary utilities over sequences of rewards. FalseDiscounting with 0

  • CS 511a: Artificial IntelligenceSpring 2013Lecture 10: MEU / Markov Processes02/18/2012Robert Pless, Course adapted from one taught by Kilian Weinberge, with many slides over the course adapted from either Dan Klein, Stuart Russell or Andrew Moore*

  • AnnouncementsProject 2 outdue Thursday, Feb 28, 11:59pm (1 minute grace period).HW 1 outdue Friday, March 1, 5pm, accepted with no penalty (or charged late days) until Monday, March 4, 10am.Midterm Wednesday March 6.Class Wednesday? No!Instead (if possible) attend talk by Alexis Battle, Friday, 11am, Lopata 101.


  • Recap: Defining MDPsMarkov decision processes:States SStart state s0Actions ATransitions P(s|s,a) (or T(s,a,s))Rewards R(s,a,s) Discount

    MDP quantities so far:Policy = Choice of action for each stateUtility (or return) = sum of discounted rewards*

  • Grid WorldThe agent lives in a gridWalls block the agents pathThe agents actions do not always go as planned:80% of the time, the action North takes the agent North (if there is no wall there)10% of the time, North takes the agent West; 10% EastIf there is a wall in the direction the agent would have been taken, the agent stays putSmall (negative) reward each stepBig rewards come at the endGoal: maximize sum of rewards*

  • DiscountingTypically discount rewards by < 1 each time stepSooner rewards have higher utility than later rewardsAlso helps the algorithms converge*

  • Optimal UtilitiesFundamental operation: compute the values (optimal expectimax utilities) of states s

    Why? Optimal values define optimal policies!

    Define the value of a state s:V*(s) = expected utility starting in s and acting optimally

    Define the value of a q-state (s,a):Q*(s,a) = expected utility starting in s, taking action a and thereafter acting optimally

    Define the optimal policy:*(s) = optimal action from state s*Quiz: Define V* in terms of Q* and vice versa!V*(s)Q*(s,a)V*(s)

  • The Bellman EquationsDefinition of optimal utility leads to a simple one-step lookahead relationship amongst optimal utility values:Optimal rewards = maximize over first action and then follow optimal policy



  • Solving MDPsWe want to find the optimal policy *

    Proposal 1: modified expectimax search, starting from each state s:*a=*(s)V*(s)Q*(s,a)V*(s)

  • Why Not Search Trees?Why not solve with expectimax?

    Problems:This tree is usually infinite (why?)Same states appear over and over (why?)We would search once per state (why?)

    Idea: Value iterationCompute optimal values for all states all at once using successive approximationsWill be a bottom-up dynamic program similar in cost to memoizationDo all planning offline, no replanning needed!


  • Value EstimatesCalculate estimates Vk*(s)Not the optimal value of s!The optimal value considering only next k time steps (k rewards)As k , it approaches the optimal valueWhy:If discounting, distant rewards become negligibleIf terminal states reachable from everywhere, fraction of episodes not ending becomes negligibleOtherwise, can get infinite expected utility and then this approach actually wont work


  • Value IterationIdea, Bellman equation is property of optimal valuations:

    Start with V0*(s) = 0, which we know is right (why?)Given Vi*, calculate the values for all states for depth i+1:

    This is called a value update or Bellman updateRepeat until convergence

    Theorem: will converge to unique optimal valuesBasic idea: approximations get refined towards optimal valuesPolicy may converge long before values do*

  • Example: Bellman Updates*max happens for a=right, other actions not shownExample: =0.9, living reward=0, noise=0.2=0.8[0+0.91]+0.1[0+0.90]+0.1[0+0.90]

  • *Gridworld

  • Gridworld*

  • Gridworld*

  • Gridworld*

  • Gridworld*

  • Convergence*Define the max-norm:

    Theorem: For any two approximations U and V

    I.e. any distinct approximations must get closer to each other, so, in particular, any approximation must get closer to the true U and value iteration converges to a unique, stable, optimal solutionTheorem:

    I.e. once the change in our approximation is small, it must also be close to correct**

  • Example*

  • Policy Iteration*Why do we compute V* or Q*, if all we care about is the best policy *?

  • Utilities for Fixed PoliciesAnother basic operation: compute the utility of a state s under a fix (general non-optimal) policyDefine the utility of a state s, under a fixed policy :V(s) = expected total discounted rewards (return) starting in s and following Recursive relation (one-step look-ahead / Bellman equation):*V (s)Q*(s,a)a=(s)

  • Policy EvaluationHow do we calculate the Vs for a fixed policy?

    Idea one: modify Bellman updates

    Idea two: Optimal solution is stationary point (equality). Then its just a linear system, solve with Matlab (or whatever)*

  • Policy IterationPolicy evaluation: with fixed current policy , find values with simplified Bellman updates:Iterate until values converge

    Policy improvement: with fixed utilities, find the best action according to one-step look-ahead


  • ComparisonIn value iteration:Every pass (or backup) updates both utilities (explicitly, based on current utilities) and policy (possibly implicitly, based on current policy)Policy might not change between updates (wastes computation)

    In policy iteration:Several passes to update utilities with frozen policyOccasional passes to update policiesValue update can be solved as linear systemCan be faster, if policy changes infequently

    Hybrid approaches (asynchronous policy iteration):Any sequences of partial updates to either policy entries or utilities will converge if every state is visited infinitely often*

  • Asynchronous Value IterationIn value iteration, we update every state in each iteration

    Actually, any sequences of Bellman updates will converge if every state is visited infinitely often

    In fact, we can update the policy as seldom or often as we like, and we will still converge

    Idea: Update states whose value we expect to change:If is large then update predecessors of s

  • How does this all play out?for blackjack


    a faculty candidate for Wash U. CS dept, Graduating PhD student of Daphne Koller one of the most Famous AI faculty, now a founding member of Coursera.

    Alexis works on trying to learn the network structure of how genes interact, in order, among other things, to make better predictions about drug responses. She develops new versions of Bayes Nets, whichIs something we will introduce in a few weeks.*python -g MazeGrid

    *python -a value -i 100 --noise 0.0 --livingReward -0.01 --pause*Notice! V*(S) is a double loop, have to take the max over all actions, and for each action, you have to consider all the states that you might have gone to.*Expecitmax search with Max layers and with expected value layers.*V_i is value at depth 0 (after taking 0 actions). Since you get rewards when you take actions, you cant have any rewards yet.Optimization strategy: You have a rule that defines an optimal constraint of states relative to other states.How do you find a solution consistent with that rule? Create new values at each state consistent with all current values. done? No, because everything has changed. so iterate. Works *surprisingly often**Here, a more carefully fleshed out summation compared to last time.*Let U be any valuation of the states (s) so U(s) is the (approx.) value of s.Let V be any other valuation of the states (s).If you apply the Bellman equations to update U,V THEN the max difference between U,V becomes closer by the factor gamma. [can prove this. Plug them into the bellman eq.] I wont do that. COOL FACT. V could be the optimal valuation, where the Bellman equations dont change anything!.

    *Point out that most times the policy does change.*V is state valuationQ is q-state valuation (value of state,action pairs).*Why is this optimization better?!

    (because youve gotten rid of the max operator, so this converges much faster!).

    *V = AV****