40
Markov Decision Processes & Reinforcement Learning Megan Smith Lehigh University, Fall 2006

Markov Decision Processes & Reinforcement Learning Megan Smith Lehigh University, Fall 2006

Embed Size (px)

Citation preview

Page 1: Markov Decision Processes & Reinforcement Learning Megan Smith Lehigh University, Fall 2006

Markov Decision Processes & Reinforcement LearningMegan SmithLehigh University, Fall 2006

Page 2: Markov Decision Processes & Reinforcement Learning Megan Smith Lehigh University, Fall 2006

Outline

Stochastic Process Markov Property Markov Chain Markov Decision Process Reinforcement Learning RL Techniques Example Applications

Page 3: Markov Decision Processes & Reinforcement Learning Megan Smith Lehigh University, Fall 2006

Stochastic Process

Quick definition: A Random Process

Often viewed as a collection of indexed random variables

Useful to us: Set of states with probabilities of being in those states indexed over time

We’ll deal with discrete stochastic processes

http://en.wikipedia.org/wiki/Image:AAMarkov.jpg

Page 4: Markov Decision Processes & Reinforcement Learning Megan Smith Lehigh University, Fall 2006

Stochastic Process Example Classic: Random Walk

Start at state X0 at time t0

At time ti, move a step Zi whereP(Zi = -1) = p and P(Zi = 1) = 1 - p

At time ti, state Xi = X0 + Z1 +…+ Zi

http://en.wikipedia.org/wiki/Image:Random_Walk_example.png

Page 5: Markov Decision Processes & Reinforcement Learning Megan Smith Lehigh University, Fall 2006

Markov Property

Also thought of as the “memoryless” property

A stochastic process is said to have the Markov property if the probability of state Xn+1 having any given value depends only upon state Xn

Very much depends on description of states

Page 6: Markov Decision Processes & Reinforcement Learning Megan Smith Lehigh University, Fall 2006

Markov Property Example Checkers:

Current State: The current configuration of the board

Contains all information needed for transition to next state

Thus, each configuration can be said to have the Markov property

Page 7: Markov Decision Processes & Reinforcement Learning Megan Smith Lehigh University, Fall 2006

Markov Chain Discrete-time

stochastic process with the Markov property

Industry Example: Google’s PageRank algorithm Probability

distribution representing likelihood of random linking ending up on a page http://en.wikipedia.org/wiki/PageRank

Page 8: Markov Decision Processes & Reinforcement Learning Megan Smith Lehigh University, Fall 2006

Markov Decision Process (MDP) Discrete time stochastic control

process Extension of Markov chains Differences:

Addition of actions (choice) Addition of rewards (motivation)

If the actions are fixed, an MDP reduces to a Markov chain

Page 9: Markov Decision Processes & Reinforcement Learning Megan Smith Lehigh University, Fall 2006

Description of MDPs

Tuple (S, A, P(.,.), R(.))) S -> state space A -> action space Pa(s, s’) = Pr(st+1 = s’ | st = s, at = a) R(s) = immediate reward at state s

Goal is to maximize some cumulative function of the rewards

Finite MDPs have finite state and action spaces

Page 10: Markov Decision Processes & Reinforcement Learning Megan Smith Lehigh University, Fall 2006

Simple MDP Example

Recycling MDP Robot Can search for trashcan, wait for

someone to bring a trashcan, or go home and recharge battery

Has two energy levels – high and low

Searching runs down battery, waiting does not, and a depleted battery has a very low reward

news.bbc.co.uk

Page 11: Markov Decision Processes & Reinforcement Learning Megan Smith Lehigh University, Fall 2006

Transition Probabilities

s = st s’ = st+1 a = at Pass’ Ra

ss’

high high search α Rsearch

high low search 1 - α Rsearch

low high search 1 - β -3

low low search β Rsearch

high high wait 1 Rwait

high low wait 0 Rwait

low high wait 0 Rwait

low low wait 1 Rwait

low high recharge 1 0

low low recharge 0 0

Page 12: Markov Decision Processes & Reinforcement Learning Megan Smith Lehigh University, Fall 2006

Transition Graph

state node

action node

Page 13: Markov Decision Processes & Reinforcement Learning Megan Smith Lehigh University, Fall 2006

Solution to an MDP = Policy π Gives the action to take from a

given state regardless of history Two arrays indexed by state

V is the value function, namely the discounted sum of rewards on average from following a policy

π is an array of actions to be taken in each state (Policy)

V(s): = R(s) + γ∑Pπ(s)(s,s')V(s')

2 basic steps

Page 14: Markov Decision Processes & Reinforcement Learning Megan Smith Lehigh University, Fall 2006

Variants

Value Iteration Policy Iteration Modified Policy Iteration Prioritized Sweeping

V(s): = R(s) + γ∑Pπ(s)(s,s')V(s')

2 basic steps

1

2

Value Function

Page 15: Markov Decision Processes & Reinforcement Learning Megan Smith Lehigh University, Fall 2006

Value Iteration

k Vk(PU) Vk(PF) Vk(RU) Vk(RF)

1

2

3

4

5

6

V(s) = R(s) + γmaxa∑Pa(s,s')V(s')

0 0 10 10

0 4.5 14.5 192.03 8.55 18.55 24.184.76 11.79 19.26 29.237.45 15.30 20.81 31.82

10.23 17.67 22.72 33.68

Page 16: Markov Decision Processes & Reinforcement Learning Megan Smith Lehigh University, Fall 2006

Why So Interesting?

If the transition probabilities are known, this becomes a straightforward computational problem, however…

If the transition probabilities are unknown, then this is a problem for reinforcement learning.

Page 17: Markov Decision Processes & Reinforcement Learning Megan Smith Lehigh University, Fall 2006

Typical Agent In reinforcement learning (RL),

the agent observes a state and takes an action.

Afterward, the agent receives a reward.

Page 18: Markov Decision Processes & Reinforcement Learning Megan Smith Lehigh University, Fall 2006

Mission: Optimize Reward Rewards are calculated in the

environment Used to teach the agent how to

reach a goal state Must signal what we ultimately

want achieved, not necessarily subgoals

May be discounted over time In general, seek to maximize the

expected return

Page 19: Markov Decision Processes & Reinforcement Learning Megan Smith Lehigh University, Fall 2006

Value Functions Vπ is a value function

(How good is it to be in this state?)

Vπ is the unique solution to its Bellman Equation

Expresses relationship between a state and its successor states

Bellman Equation:

State-value function for policy π

Page 20: Markov Decision Processes & Reinforcement Learning Megan Smith Lehigh University, Fall 2006

Another Value Function Qπ defines the value of taking action a in state s

under policy π Expected return starting from s, taking action a,

and thereafter following policy π

Backup diagrams for (a) Vπ and (b) Qπ

Action-value function for policy π

Page 21: Markov Decision Processes & Reinforcement Learning Megan Smith Lehigh University, Fall 2006

Dynamic Programming

Classically, a collection of algorithms used to compute optimal policies given a perfect model of environment as an MDP

The classical view is not so useful in practice since we rarely have a perfect environment model

Provides foundation for other methods

Not practical for large problems

Page 22: Markov Decision Processes & Reinforcement Learning Megan Smith Lehigh University, Fall 2006

DP Continued… Use value functions to organize and

structure the search for good policies. Turn Bellman equations into update

policies. Iterative policy evaluation using full

backups

Page 23: Markov Decision Processes & Reinforcement Learning Megan Smith Lehigh University, Fall 2006

Policy Improvement

When should we change the policy?

If we pick a new action α from state s and thereafter follow the current policy and V(π’) >= V(π), then picking α from state s is a better policy overall.

Results from the policy improvement theorem

Page 24: Markov Decision Processes & Reinforcement Learning Megan Smith Lehigh University, Fall 2006

Policy Iteration Continue

improving the policy π and recalculating V(π)

A finite MDP has a finite number of policies, so convergence is guaranteed in a finite number of iterations

Page 25: Markov Decision Processes & Reinforcement Learning Megan Smith Lehigh University, Fall 2006

Remember Value Iteration?

Used to truncate policy iteration by combining one sweep of policy evaluation and one of policy improvement in each of

its sweeps.

Page 26: Markov Decision Processes & Reinforcement Learning Megan Smith Lehigh University, Fall 2006

Monte Carlo Methods

Requires only episodic experience – on-line or simulated

Based on averaging sample returns

Value estimates and policies only changed at the end of each episode, not on a step-by-step basis

Page 27: Markov Decision Processes & Reinforcement Learning Megan Smith Lehigh University, Fall 2006

Policy Evaluation Compute

average returns as the episode runs

Two methods: first-visit and every-visit

First-visit is most widely studied

First-visit MC method

Page 28: Markov Decision Processes & Reinforcement Learning Megan Smith Lehigh University, Fall 2006

Estimation of Action Values State values are not enough

without a model – we need action values as well

Qπ(s, a) expected return when starting in state s, taking action a, and thereafter following policy π

Exploration vs. Exploitation Exploring starts

Page 29: Markov Decision Processes & Reinforcement Learning Megan Smith Lehigh University, Fall 2006

Example Monte Carlo Algorithm

First-visit Monte Carlo assuming exploring starts

Page 30: Markov Decision Processes & Reinforcement Learning Megan Smith Lehigh University, Fall 2006

Another MC Algorithm

On-line, first-visit, ε-greedy MC without exploring starts

Page 31: Markov Decision Processes & Reinforcement Learning Megan Smith Lehigh University, Fall 2006

Temporal-Difference Learning Central and novel to

reinforcement learning Combines Monte Carlo and DP

methods Can learn from experience w/o a

model – like MC Updates estimates based on

other learned estimates (bootstraps) – like DP

Page 32: Markov Decision Processes & Reinforcement Learning Megan Smith Lehigh University, Fall 2006

TD(0)

Simplest TD method Uses sample backup from single

successor state or state-action pair instead of full backup of DP methods

Page 33: Markov Decision Processes & Reinforcement Learning Megan Smith Lehigh University, Fall 2006

SARSA – On-policy Control

Quintuple of events (st, at, rt+1, st+1, at+1)

Continually estimate Qπ while changing π

Page 34: Markov Decision Processes & Reinforcement Learning Megan Smith Lehigh University, Fall 2006

Q-Learning – Off-policy Control

Learned action-value function, Q, directly approximates Q*, the optimal action-value function, independent of policy being followed

Page 35: Markov Decision Processes & Reinforcement Learning Megan Smith Lehigh University, Fall 2006

Case Study

Job-shop Scheduling Temporal and resource

constraints Find constraint-satisfying

schedules of short duration In it’s general form, NP-complete

Page 36: Markov Decision Processes & Reinforcement Learning Megan Smith Lehigh University, Fall 2006

NASA Space Shuttle Payload Processing Problem (SSPPP) Schedule tasks required for

installation and testing of shuttle cargo bay payloads

Typical: 2-6 shuttle missions, each requiring 34-164 tasks

Zhang and Dietterich (1995, 1996; Zhang, 1996)

First successful instance of RL applied in plan-space states = complete plans actions = plan modifications

Page 37: Markov Decision Processes & Reinforcement Learning Megan Smith Lehigh University, Fall 2006

SSPPP – continued… States were an entire schedule Two types of actions:

REASSIGN-POOL operators – reassigns a resource to a different pool

MOVE operators – moves task to first earlier or later time with satisfied resource constraints

Small negative reward for each step

Resource dilation factor (RDF) formula for rewarding final schedule’s duration

Page 38: Markov Decision Processes & Reinforcement Learning Megan Smith Lehigh University, Fall 2006

Even More SSPPP… Used TD() to learn value function Actions selected by decreasing ε-

greedy policy with one-step lookahead

Function approximation used multilayer neural networks

Training generally took 10,000 episodes

Each resulting network represented different scheduling algorithm – not a schedule for a specific instance!

Page 39: Markov Decision Processes & Reinforcement Learning Megan Smith Lehigh University, Fall 2006

RL and CBR

Example: CBR used to store various policies and RL used to learn and modify those policies Ashwin Ram and Juan Carlos

Santamarıa, 1993 Autonomous Robotic Control

Job shop scheduling: RL used to repair schedules, CBR used to determine which repair to make

Similar methods can be used for IDSS

Page 40: Markov Decision Processes & Reinforcement Learning Megan Smith Lehigh University, Fall 2006

References Sutton, R. S. and Barto A. G. Reinforcement

Learning: An Introduction. The MIT Press, Cambridge, MA, 1998

Stochastic Processes, www.hanoivn.net http://en.wikipedia.org/wiki/PageRank http://en.wikipedia.org/wiki/Markov_decision

_process Using Case-Based Reasoning as a

Reinforcement Learning framework for Optimization with Changing Criteria, Zeng, D. and Sycara, K. 1995