15
Examples of MDPs Goal-directed, Indefinite Horizon, Cost Minimization MDP <S, A, Pr, C, G, s 0 > Most often studied in planning community Infinite Horizon, Discounted Reward Maximization MDP <S, A, Pr, R, > Most often studied in reinforcement learning Goal-directed, Finite Horizon, Prob. Maximization MDP <S, A, Pr, G, s 0 , T> Also studied in planning community Oversubscription Planning: Non absorbing goals, Reward Max. MDP

Examples of MDPs

  • Upload
    atira

  • View
    55

  • Download
    1

Embed Size (px)

DESCRIPTION

Examples of MDPs. Goal-directed, Indefinite Horizon, Cost Minimization MDP < S , A , P r, C , G , s 0 > Most often studied in planning community Infinite Horizon, Discounted Reward Maximization MDP < S , A , P r, R ,  > Most often studied in reinforcement learning - PowerPoint PPT Presentation

Citation preview

Page 1: Examples of MDPs

Examples of MDPs Goal-directed, Indefinite Horizon, Cost Minimization MDP

• <S, A, Pr, C, G, s0>• Most often studied in planning community

Infinite Horizon, Discounted Reward Maximization MDP• <S, A, Pr, R, >• Most often studied in reinforcement learning

Goal-directed, Finite Horizon, Prob. Maximization MDP• <S, A, Pr, G, s0, T>• Also studied in planning community

Oversubscription Planning: Non absorbing goals, Reward Max. MDP• <S, A, Pr, G, R, s0>• Relatively recent model

Page 2: Examples of MDPs

SSPP—Stochastic Shortest Path Problem An MDP with Init and Goal states

• MDPs don’t have a notion of an “initial” and “goal” state. (Process orientation instead of “task” orientation)– Goals are sort of modeled by

reward functions• Allows pretty expressive goals

(in theory)– Normal MDP algorithms don’t use

initial state information (since policy is supposed to cover the entire search space anyway).

• Could consider “envelope extension” methods

– Compute a “deterministic” plan (which gives the policy for some of the states; Extend the policy to other states that are likely to happen during execution

– RTDP methods

• SSSP are a special case of MDPs where – (a) initial state is given– (b) there are absorbing goal states– (c) Actions have costs. All states

have zero rewards• A proper policy for SSSP is a policy

which is guaranteed to ultimately put the agent in one of the absorbing states

• For SSSP, it would be worth finding a partial policy that only covers the “relevant” states (states that are reachable from init and goal states on any optimal policy)– Value/Policy Iteration don’t consider

the notion of relevance– Consider “heuristic state search”

algorithms • Heuristic can be seen as the

“estimate” of the value of a state.

Page 3: Examples of MDPs

<S, A, Pr, C, G, s0> Define J*(s) {optimal cost} as the

minimum expected cost to reach a goal from this state.

J* should satisfy the following equation:

Bellman Equations for Cost Minimization MDP(absorbing goals)[also called Stochastic Shortest Path]

Q*(s,a)

Page 4: Examples of MDPs

<S, A, Pr, R, s0, > Define V*(s) {optimal value} as the

maximum expected discounted reward from this state.

V* should satisfy the following equation:

Bellman Equations for infinite horizon discounted reward maximization MDP

Page 5: Examples of MDPs

<S, A, Pr, G, s0, T> Define P*(s,t) {optimal prob.} as the

maximum probability of reaching a goal from this state at tth timestep.

P* should satisfy the following equation:

Bellman Equations for probability maximization MDP

Page 6: Examples of MDPs

Modeling Softgoal problems as deterministic MDPs

• Consider the net-benefit problem, where actions have costs, and goals have utilities, and we want a plan with the highest net benefit

• How do we model this as MDP?– (wrong idea): Make every state in which any subset of goals

hold into a sink state with reward equal to the cumulative sum of utilities of the goals.

• Problem—what if achieving g1 g2 will necessarily lead you through a state where g1 is already true?

– (correct version): Make a new fluent called “done” dummy action called Done-Deal. It is applicable in any state and asserts the fluent “done”. All “done” states are sink states. Their reward is equal to sum of rewards of the individual states.

Page 7: Examples of MDPs

Ideas for Efficient Algorithms..• Use heuristic

search (and reachability information)– LAO*, RTDP

• Use execution and/or Simulation– “Actual Execution”

Reinforcement learning (Main motivation for RL is

to “learn” the model)– “Simulation” –simulate the

given model to sample possible futures

• Policy rollout, hindsight optimization etc.

• Use “factored” representations– Factored representations for Actions, Reward Functions, Values

and Policies– Directly manipulating factored representations during the

Bellman update

Page 8: Examples of MDPs

Heuristic Search vs. Dynamic Programming (Value/Policy Iteration)• VI and PI approaches use Dynamic Programming Update• Set the value of a state in terms of the maximum expected value

achievable by doing actions from that state. • They do the update for every state in the state space

– Wasteful if we know the initial state(s) that the agent is starting from

• Heuristic search (e.g. A*/AO*) explores only the part of the state space that is actually reachable from the initial state

• Even within the reachable space, heuristic search can avoid visiting many of the states. – Depending on the quality of the heuristic used..

• But what is the heuristic?– An admissible heuristic is a lowerbound on the cost to reach goal from

any given state– It is a lowerbound on J*!

Page 9: Examples of MDPs

Real Time Dynamic Programming [Barto, Bradtke, Singh’95]

Trial: simulate greedy policy starting from start state;

perform Bellman backup on visited states

RTDP: repeat Trials until cost function convergesRTDP was originally introduced for Reinforcement Learning For RL, instead of “simulate” you “execute” You also have to do “exploration” in addition to “exploitation”

with probability p, follow the greedy policy with 1-p pick a random action

What if we simulate the action’s effect with noise (rather than exactly wrt its transition probabilities)

Page 10: Examples of MDPs

Min

?

?s0

Jn

Jn

Jn

Jn

Jn

Jn

Jn

Qn+1(s0,a)

Jn+1(s0)

agreedy = a2

Goala1

a2

a3

RTDP Trial

?

Note that the value function is being updated per each level. How about waiting until you hit goal and then update everyone?

Page 11: Examples of MDPs

Greedy “On-Policy” RTDP without execution

Using the current utility values, select the action with the highest expected utility (greedy action) at each state, until you reach a terminating state. Update the values along this path. Loop back—until the values stabilize

Page 12: Examples of MDPs

Labeled RTDP [Bonet&Geffner’03]

Initialise J0 with an admissible heuristic• ⇒ Jn monotonically increases

Label a state as solved • if the Jn for that state has converged

Backpropagate ‘solved’ labeling Stop trials when they reach any solved state Terminate when s0 is solved

s G

high Q

costs

best action

) J(s) won’t change!

s G?

t

both s and tget solved together

high Q

costs

Page 13: Examples of MDPs
Page 14: Examples of MDPs

Properties

admissible J0 ⇒ optimal J* heuristic-guided

• explores a subset of reachable state space anytime

• focusses attention on more probable states fast convergence

• focusses attention on unconverged states terminates in finite time

Page 15: Examples of MDPs

Other Advances

Ordering the Bellman backups to maximise information flow.• [Wingate & Seppi’05]• [Dai & Hansen’07]

Partition the state space and combine value iterations from different partitions.• [Wingate & Seppi’05]• [Dai & Goldsmith’07]

External memory version of value iteration• [Edelkamp, Jabbar & Bonet’07]