45
Markov Decision Processes Mausam CSE 515

Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model

Markov Decision Processes

Mausam

CSE 515

Page 2: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model

Markov Decision Process

Operations

Research

Artificial

Intelligence

Machine

Learning

Graph

Theory

RoboticsNeuroscience

/Psychology

Control

TheoryEconomics

model the sequential decision making of a rational agent.

Page 3: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model

A Statistician’s view to MDPs

Markov

Chain

One-step

Decision Theory

Markov Decision Process

• sequential process

• models state transitions

• autonomous process

• one-step process

• models choice

• maximizes utility

• Markov chain + choice

• Decision theory + sequentiality

• sequential process

• models state transitions

• models choice

• maximizes utility

s s s u

s s

u

a

a

Page 4: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model

A Planning View

What action

next?

Percepts Actions

Environment

Static vs. Dynamic

Fully

vs.

Partially

Observable

Perfect

vs.

Noisy

Deterministic vs.

Stochastic

Instantaneous vs.

Durative

Predictable vs. Unpredictable

Page 5: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model

Classical Planning

What action

next?

Percepts Actions

Environment

Static

Fully

Observable

Perfect

Predictable

Instantaneous

Deterministic

Page 6: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model

Deterministic, fully observable

Page 7: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model

Stochastic Planning: MDPs

What action

next?

Percepts Actions

Environment

Static

Fully

Observable

Perfect

Stochastic

Instantaneous

Unpredictable

Page 8: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model

Stochastic, Fully Observable

Page 9: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model

Markov Decision Process (MDP)

• S: A set of states

• A: A set of actions

• Pr(s’|s,a): transition model

• C(s,a,s’): cost model

• G: set of goals

• s0: start state

• : discount factor

• R(s,a,s’): reward model

factoredFactored MDP

absorbing/

non-absorbing

Page 10: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model

Objective of an MDP

• Find a policy : S→ A

• which optimizes

• minimizes expected cost to reach a goal

• maximizes expected reward

• maximizes expected (reward-cost)

• given a ____ horizon

• finite

• infinite

• indefinite

• assuming full observability

discounted

or

undiscount.

Page 11: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model

Role of Discount Factor ()

• Keep the total reward/total cost finite

• useful for infinite horizon problems

• Intuition (economics):

• Money today is worth more than money tomorrow.

• Total reward: r1 + r2 + 2r3 + …

• Total cost: c1 + c2 + 2c3 + …

Page 12: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model

Examples of MDPs

• Goal-directed, Indefinite Horizon, Cost Minimization MDP

• <S, A, Pr, C, G, s0>

• Most often studied in planning, graph theory communities

• Infinite Horizon, Discounted Reward Maximization MDP

• <S, A, Pr, R, >

• Most often studied in machine learning, economics, operations research communities

• Goal-directed, Finite Horizon, Prob. Maximization MDP

• <S, A, Pr, G, s0, T>

• Also studied in planning community

• Oversubscription Planning: Non absorbing goals, Reward Max. MDP

• <S, A, Pr, G, R, s0>

• Relatively recent model

most popular

Page 13: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model

Bellman Equations for MDP1

• <S, A, Pr, C, G, s0>

• Define J*(s) {optimal cost} as the minimum

expected cost to reach a goal from this state.

• J* should satisfy the following equation:

Page 14: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model

Bellman Equations for MDP2

• <S, A, Pr, R, s0, >

• Define V*(s) {optimal value} as the maximum

expected discounted reward from this state.

• V* should satisfy the following equation:

Page 15: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model

Bellman Equations for MDP3

• <S, A, Pr, G, s0, T>

• Define P*(s,t) {optimal prob} as the maximum

expected probability to reach a goal from this

state starting at tth timestep.

• P* should satisfy the following equation:

Page 16: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model

Bellman Backup (MDP2)

• Given an estimate of V* function (say Vn)

• Backup Vn function at state s

• calculate a new estimate (Vn+1) :

• Qn+1(s,a) : value/cost of the strategy:

• execute action a in s, execute n subsequently

• n = argmaxa∈Ap(s)Qn(s,a)

V

R V

ax

Page 17: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model

Bellman Backup

V0= 0

V0= 1

V0= 2

Q1(s,a1) = 2 + 0

Q1(s,a2) = 5 + 0.9£ 1

+ 0.1£ 2

Q1(s,a3) = 4.5 + 2

max

V1= 6.5

(~1)

agreedy = a3

5a2

a1

a3

s0

s1

s2

s3

Page 18: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model

Value iteration [Bellman’57]

• assign an arbitrary assignment of V0 to each state.

• repeat

• for all states s

• compute Vn+1(s) by Bellman backup at s.

• until maxs |Vn+1(s) – Vn(s)| <

Iteration n+1

Residual(s)

-convergence

Page 19: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model

Comments

• Decision-theoretic Algorithm

• Dynamic Programming

• Fixed Point Computation

• Probabilistic version of Bellman-Ford Algorithm• for shortest path computation

• MDP1 : Stochastic Shortest Path Problem

Time Complexity

• one iteration: O(|S|2|A|)

• number of iterations: poly(|S|, |A|, 1/(1-))

Space Complexity: O(|S|)

Factored MDPs

• exponential space, exponential time

Page 20: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model

Convergence Properties

• Vn → V* in the limit as n→1

• -convergence: Vn function is within of V*

• Optimality: current policy is within 2/(1-) of optimal

• Monotonicity• V0 ≤p V* ⇒ Vn ≤p V* (Vn monotonic from below)

• V0 ≥p V* ⇒ Vn ≥p V* (Vn monotonic from above)

• otherwise Vn non-monotonic

Page 21: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model

Policy Computation

Optimal policy is stationary and time-independent.

• for infinite/indefinite horizon problems

Policy Evaluation

A system of linear equations in |S| variables.

ax

ax R V

R VV

Page 22: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model

Changing the Search Space

• Value Iteration

• Search in value space

• Compute the resulting policy

• Policy Iteration

• Search in policy space

• Compute the resulting value

Page 23: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model

Policy iteration [Howard’60]

• assign an arbitrary assignment of 0 to each state.

• repeat

• Policy Evaluation: compute Vn+1: the evaluation of n

• Policy Improvement: for all states s

• compute n+1(s): argmaxa2 Ap(s)Qn+1(s,a)

• until n+1 = n

Advantage

• searching in a finite (policy) space as opposed to

uncountably infinite (value) space ⇒ convergence faster.

• all other properties follow!

costly: O(n3)

approximate

by value iteration

using fixed policy

Modified

Policy Iteration

Page 24: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model

Modified Policy iteration

• assign an arbitrary assignment of 0 to each state.

• repeat

• Policy Evaluation: compute Vn+1 the approx. evaluation of n

• Policy Improvement: for all states s

• compute n+1(s): argmaxa2 Ap(s)Qn+1(s,a)

• until n+1 = n

Advantage

• probably the most competitive synchronous dynamic

programming algorithm.

Page 25: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model

Asynchronous Value Iteration

States may be backed up in any order

• instead of an iteration by iteration

As long as all states backed up infinitely often

• Asynchronous Value Iteration converges to optimal

Page 26: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model

Asynch VI: Prioritized Sweeping

Why backup a state if values of successors same?

Prefer backing a state

• whose successors had most change

Priority Queue of (state, expected change in value)

Backup in the order of priority

After backing a state update priority queue

• for all predecessors

Page 27: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model

Asynch VI: Real Time Dynamic Programming

[Barto, Bradtke, Singh’95]

• Trial: simulate greedy policy starting from start state;

perform Bellman backup on visited states

• RTDP: repeat Trials until value function converges

Page 28: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model

Min

?

?s0

Vn

Vn

Vn

Vn

Vn

Vn

Vn

Qn+1(s0,a)

Vn+1(s0)

agreedy = a2

RTDP Trial

Goala1

a2

a3

?

Page 29: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model

Comments

• Properties

• if all states are visited infinitely often then Vn → V*

• Advantages

• Anytime: more probable states explored quickly

• Disadvantages

• complete convergence can be slow!

Page 30: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model

Reinforcement Learning

Page 31: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model

Reinforcement Learning

Still have an MDP

• Still looking for policy

New twist: don’t know Pr and/or R

• i.e. don’t know which states are good

• and what actions do

Must actually try out actions to learn

Page 32: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model

Model based methods

Visit different states, perform different actions

Estimate Pr and R

Once model built, do planning using V.I. or

other methods

Con: require _huge_ amounts of data

Page 33: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model

Model free methods

Directly learn Q*(s,a) values

sample = R(s,a,s’) + maxa’Qn(s’,a’)

Nudge the old estimate towards the new sample

Qn+1(s,a) (1-)Qn(s,a) + [sample]

Page 34: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model

Properties

Converges to optimal if

• If you explore enough

• If you make learning rate () small enough

• But not decrease it too quickly

• ∑i(s,a,i) = ∞

• ∑i2(s,a,i) < ∞

where i is the number of visits to (s,a)

Page 35: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model

Model based vs. Model Free RL

Model based

• estimate O(|S|2|A|) parameters

• requires relatively larger data for learning

• can make use of background knowledge easily

Model free

• estimate O(|S||A|) parameters

• requires relatively less data for learning

Page 36: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model

Exploration vs. Exploitation

Exploration: choose actions that visit new states in

order to obtain more data for better learning.

Exploitation: choose actions that maximize the

reward given current learnt model.

-greedy

• Each time step flip a coin

• With prob , take an action randomly

• With prob 1- take the current greedy action

Lower over time

• increase exploitation as more learning has happened

Page 37: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model

Q-learning

Problems

• Too many states to visit during learning

• Q(s,a) is still a BIG table

We want to generalize from small set of training examples

Techniques

• Value function approximators

• Policy approximators

• Hierarchical Reinforcement Learning

Page 38: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model

Task Hierarchy: MAXQ Decomposition [Dietterich’00]

Root

Take GiveNavigate(loc)

DeliverFetch

Extend-arm Extend-armGrab Release

MoveeMovewMovesMoven

Children of a task Children of a task

are unordered

Page 39: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model

Partially Observable Markov Decision Processes

Page 40: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model

Partially Observable MDPs

What action

next?

Percepts Actions

Environment

Static

Partially

Observable

Noisy

Stochastic

Instantaneous

Unpredictable

Page 41: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model

Stochastic, Fully Observable

Page 42: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model

Stochastic, Partially Observable

Page 43: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model

POMDPs

In POMDPs we apply the very same idea as in MDPs.

Since the state is not observable,

the agent has to make its decisions based on the belief state

which is a posterior distribution over states.

Let b be the belief of the agent about the current state

POMDPs compute a value function over belief space:

γa b, aa

Page 44: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model

POMDPs

Each belief is a probability distribution,

• value fn is a function of an entire probability distribution.

Problematic, since probability distributions are continuous.

Also, we have to deal with huge complexity of belief spaces.

For finite worlds with finite state, action, and observation

spaces and finite horizons,

• we can represent the value functions by piecewise linear

functions.

Page 45: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model

Applications

Robotic control

• helicopter maneuvering, autonomous vehicles

• Mars rover - path planning, oversubscription planning

• elevator planning

Game playing - backgammon, tetris, checkers

Neuroscience

Computational Finance, Sequential Auctions

Assisting elderly in simple tasks

Spoken dialog management

Communication Networks – switching, routing, flow control

War planning, evacuation planning