REINFORCEMENT LEARNING · • Planning is any computational process that uses a model to create or...

REINFORCEMENT LEARNINGMulti-state RL

OVERVIEW

Reinforcement Learning • Markov Decision Process• Q-Learning & Sarsa• Convergence• Planning & learning• Actor Critic• Monte Carlo

Reinforcement Learning in Normal Form Games

MARKOV DECISION PROCESS• It is often useful to assume that all relevant information is

present in the current state: Markov property

• If a RL task has the Markov property, it is basically a Markov Decision Process (MDP)

• Assuming finite state and action spaces, it is a finite MDP

P (st+1, rt+1|st, at) = P (st+1, rt+1|st, at, rt, st�1, at�1, . . . , r1, a0, s0)

MARKOV DECISION PROCESSAn MDP is defined by:• State and action sets• A transition function

• A reward function

P ass0 = P (st+1 = s0|st = s, at = a)

Rass0 = E(rt+1|st = s, at = a, st+1 = s0)

AGENT-ENVIRONMENT INTERFACE

Environment

actionatst

rewardrt

rt+1st+1

. . . st art +1 st +1

t +1art +2 st +2

t +2art +3 st +3

. . .t +3a

VALUE FUNCTIONS• Goal: learn , given • When following a fixed policy we can define the value

of a state s under that policy as

• Similar, we can define the value of taking action a in state s as

• Optimal

⇡ : S ! A hhs, a, i, ri

V ⇡(s) = E⇡(Rt|st = s) = E⇡(1X

�krt+k+1|st = s)

Q⇡(s, a) = E⇡(Rt|st = s, at = a)

⇡⇤= argmax

⇡V ⇡

BACKUP DIAGRAMS

V ⇡(s) =X

⇡(s, a)X

P ass0 [R

ass0 + �V ⇡(s0)] Q⇡(s, a) =

P ass0 [R

ass0 + �V ⇡(s0)]

V ⇡(s) = argmax

aQ(s, a)

STATE VALUES & STATE-ACTION VALUES

MODEL BASED: DYNAMIC PROGRAMMING

TT TT T

T T TT T

V ⇡(s) = E⇡{Rt|st = s}= E⇡{rt+1 + �V (st+1)|st = s}

MODEL FREE: REINFORCEMENT LEARNING

TT TT T

T T TT T

Q-LEARNINGOne-step Q-learning:

Q(s, a) Q(s, a) + ↵[rt+1 + � argmax

a0Q(st+1, a

0)�Q(s, a)]

Q-LEARNING: EXAMPLE

• Epoch 1: 1,2,4• Epoch 2: 1,6• Epoch 3: 1,3• Epoch 4: 1,2,5• Epoch 6: 2,5

R=1R=1

R=2R=10

UPDATING Q: IN PRACTICE

CONVERGENCE OF DETERMINISTIC Q-LEARNING

Q-learning is guaranteed to converge in a Markovian settingi.e. Q converges to Q when each (s,a) is visited infinitely often

Extra material: Tsitsiklis, J.N. Asynchronous Stochastic Approximation and Q-learning. in Machine Learning, Vol 16:pp185-202, 1994.

Proof:

• Let a full interval be an interval during which each (s,a) is visited

• Let be the Q-table after n updates• is the maximum error in :

�n Q̂n

�n = maxs,a|Q̂n �Q(s, a)|

For any table entry updated on iteration n+1, the error in the revised estimate is

Q̂n+1(s, a)�Q(s, a)| = |(r + �maxa0Q̂n(s

�(r + �maxa0Q(s0, a0))|

= |�maxa0Q̂n(s

0))� �maxa0Q(s0, a0))|

�maxa0 |Q̂n(s0, a

0)�Q(s0, a0))| �maxs00,a0 |Q̂n(s

0)�Q(s00, a0))|Q̂n+1(s, a)�Q(s, a)| ��n < �n

Q̂n(s, a)Q̂n+1(s, a)

SARSA: ON-POLICY TD-CONTROL

Q(st, at) Q(st, at) = ↵[rt+1 + �Q(st+1, at+1)�Q(st, at)]

Q-LEARNING VS SARSA• One step Q-learning:

• Sarsa:

Q(st, at) Q(st, at) = ↵[rt+1 + �Q(st+1, at+1)�Q(st, at)]

Q(s, a) Q(s, a) + ↵[rt+1 + � argmax

a0Q(st+1, a

0)�Q(s, a)]

Off-policy

On-policy

CLIFF WALKING EXAMPLE

Rewardper

epsiode

−100

0 100 200 300 400 500

Episodes

Q-learning

r = −100

T h e C l i f f

r = −1 safe path

optimal path

Actions: , , , Reward: cliff = -100

goal = 0default = -1

-greedy, with ✏ ✏ = 0.1

PLANNING AND LEARNING• Model: anything the agent can use to predict how the environment

will respond to its actions• Distribution model: description of all possibilities and their

probabilities

e.g., and for all s,s’ and • Sample model: produces sample experiences

e.g., a simulation model• Both type of models can be used to produce simulated experience

• Sample models are often easier to come by

P ass0 Ra

ss0 a 2 A(s)

PLANNING• Planning is any computational process that uses a model

to create or improve a policy

• Planning in AI:- state-space planning (e.g. Heuristic search methods)

• We take the following (unusual) view:- all state-space planning methods involve computing value functions either explicitly or implicitly - they all apply backups to simulated experience

model policyplanning

model simulatedexperience

backups values policy

PLANNINGTwo uses of real experience:

• model learning: to improve the model

• direct RL: to directly improve the value function and policy

Improving value function and/or policy via a model is sometimes called indirect RL, model-based RL or planning.

planning

value/policy

experiencemodel

modellearning

acting

directRL

INDIRECT VS DIRECT RLIndirect methods

• Make fuller use of experience: get better policy with fewer environment interactions

Direct methods

• simpler• not affected by bad

models

These are closely related and planning, acting, model learning and direct RL can occur simultaneously and in parallel

DYNA-Q ALGORITHM

DYNA-Q IN A MAZE

142010 30 40 50

0 planning steps(direct RL only)

Episodes

Stepsper

episode 5 planning steps

50 planning steps

actions

Reward = 0, until goal when =1 e-greedy, e=0.1learning rate = 0.1initial Q-values = 0discount factor = 0.95

DYNA-Q: SNAPSHOTS

GWITHOUT PLANNING (N=0) WITH PLANNING (N=50)

DYNA-Q: WRONG MODEL• Easier env

Cumulativereward

0 3000 6000Time steps

Dyna-Q+Dyna-Q

Dyna-AC

DYNA-Q WRONG MODEL• Harder env

Cumulativereward

0 1000 2000 3000

Time steps

Dyna-Q+Dyna-Q

Dyna-AC

WHAT IS DYNA-Q+

Uses an ‘exploration bonus’

• Keep track of time since each state-action pair was tried for real

• An extra reward is added for transitions caused by state-action pairs related to how long ago they were tried: the longer unvisited, the more reward for visiting

• The agent actually “plans” how to visit long unvisited states

r + kpn , with k a weight factor

ACTOR CRITIC METHODS• Explicit representation of policy

as well as value function• Minimal computation to select

actions• Can learn an explicit stochastic

policy• Can put constraints on policies• Appealing as psychological and

neural models

Policy

TDerror

Environment

ValueFunction

reward

state action

Critic

ACTOR-CRITIC DETAILSIf actions are determined by preferences, , as follows:

then you can update the preferences like this:

TD-error is used to evaluate actions:

p(s, a)

⇡t(s, a) = Pr{at = a|st = s} = ep(s,a)P

b ep(s,b)

p(st, at) p(st, at) + ��t

�t = rt+1 + �V (st+1)� V (st)

MONTE CARLO METHODS• Monte Carlo methods learn from complete sample

returns - Only defined for episodic tasks- No Bootstrapping

• Monte Carlo methods learn directly from experience - Online: No model necessary and still attains optimality- Simulated: No need for a full model

MONTE CARLO POLICY EVALUATION• Goal: learn • Given: some number of episodes under which contains s• Idea: average returns observed after visits to s

• Every-visit MC: average returns for every time s is visited in an episode

• First-visit MC: average returns only for first time s is visited in an episode

• Both converge asymptotically 32

V ⇡(s)

1 2 3 4 5

FIRST-VISIT MONTE CARLO POLICY EVALUATION

BLACKJACK EXAMPLE• Objective: Have your card sum be greater than the dealers

without exceeding 21.• States:

- #200- current sum (12-21) - dealer's showing card (ace-10)- do I have a useable ace?

• Reward: +1 for winning, 0 for a draw, -1 for losing• Actions: stick (stop receiving cards), hit (receive another card)• Policy: Stick if my sum is 20 or 21, else hit• Dealer's policy: sticks on any sum of 17 or greater, otherwise hit

BLACKJACK VALUE FUNCTIONS

BACKUP SCHEME DP

V (st) E⇡{rt+1 + �V (st)}

TT TT T

T T TT T

BACKUP SCHEME TD

TT TT T

T T TT T

V (st) V (st) + ↵[rt+1 + �V (st+1)� V (st)]

SIMPLE MONTE CARLO

TT TT T

T T TT T

V (st) V (st) + ↵[Rt � V (st)], where Rt is the actual return following state st

MONTE CARLO ESTIMATION OF ACTION VALUES (Q)

= average return starting from state s and action a following Also converges asymptotically if every state-action pair is visited

Exploring starts: Every state-action pair has a non-zeroprobability of being the starting pair

Q⇡(s, a)

MONTE CARLO CONTROL

MC policy iteration: Policy evaluation using MC methods followed by policy improvementPolicy improvement step: greedify with respect to value (or action-value) funtion

Greedified policy meets the conditions for policy improvement:

This assumes exploring starts and infinite number of episodes for MC policy evaluationTo solve the latter :- update only to a given level of performance - alternate between evaluation and improvement per episode

CONVERGENCE OF MC CONTROL

MONTE CARLO EXPLORING STARTS

BLACKJACK EXAMPLE CONTINUEDExploring startsInitial policy as described before

ON-POLICY MONTE CARLO CONTROL• On-policy: learn about policy currently executing

Policy can also be non-deterministic? e.g. -soft policy- Probability of selecting non-best action: - Probability of selecting best action:Similar to GPI: move policy towards greedy policy (i.e. - soft) Converges to best -soft policy

✏✏

ON-POLICY MC CONTROL

REINFORCEMENT LEARNING · • Planning is any computational process that uses a model to create or...

Documents

Research Article Heuristic Search for Planning with ...downloads.hindawi.com/journals/tswj/2013/963874.pdf · Research Article Heuristic Search for Planning with Different Forced

Parallel power system restoration planning using heuristic ... · Parallel power system restoration planning using heuristic initialization and discrete evolutionary programming Dian

Heuristic Learning for Look-Ahead Manipulation Planning in

Reinforcement Learning for Classical Planning: Viewing

Harnessing Reinforcement Learning for Neural Motion Planning

Automatic Planning Chapter 7: Heuristic Searchfai.cs.uni-saarland.de/...material/...heuristic-search-pre-handout.pdf · Forsatis cingplanning, heuristic search vastly outperforms

A Guide to Heuristic-based Path Planning

EE456_NWPUD Transmission Reinforcement Planning

Reinforcement Learning of Heuristic EV Fleet Charging in a

Reinforcement Planning using Building Information Modeling ...aucache.autodesk.com/au2012/sessionsFiles/3363/2886/handout_3363... · Reinforcement Planning using Building Information

A New Heuristic for Planning WAC Programs 3 A New

Single- and Dual-Arm Motion Planning with Heuristic Searchmaxim/files/armplanwithheursearch_ijrr13.pdf · Single- and Dual-Arm Motion Planning with Heuristic Search Benjamin Cohen

The FF Planning System: Fast Plan Generation Through Heuristic

An Admissible Heuristic for SAS+ Planning Obtained from

Approximation Algorithms for Planning Under Uncertainty · Monte-Carlo planning Heuristic search with inadmissible heuristics Hybridized planning Hierarchical planning Dimensionality

Robotic Path Planning using Multi Neuron Heuristic Search

Heuristic algorithms for motion planning - PARC · Heuristic algorithms for motion planning Craig Eldershaw, ... 4 Specialised modes of locomotion 84 ... Martin and Lydia Cordell

LQ - e-learnify.in SCM.pdf · The Supply Network Planning Heuristic SNP Heuristic Run Capacity Check and Leveling Planning Supplier Constraints using the SNP Heuristic , SNP Heuristic

Distributed Deliberative Planning with Partial Observability: Heuristic Approaches

Heuristic Planning for Rough Terrain Locomotion in Presence of … · 2018-11-08 · Heuristic Planning for Rough Terrain Locomotion in Presence of External Disturbances and Variable