Model-Free vs. Model- Based RL: Q, SARSA, & E 3. Administrivia Reminder: Office hours tomorrow...

Model-Free vs. Model-Based RL: Q,

SARSA, & E3

Administrivia•Reminder:

•Office hours tomorrow truncated

•9:00-10:15 AM

•Can schedule other times if necessary

•Final projects

•Final presentations Dec 2, 7, 9

•20 min (max) presentations

•3 or 4 per day

•Sign up for presentation slots today!

The Q-learning algorithmAlgorithm: Q_learn

Inputs: State space S; Act. space A

Discount γ (0<=γ<1); Learning rate α (0<=α<1)

Outputs: Q

Repeat {

s=get_current_world_state()

a=pick_next_action(Q,s)

(r,s’)=act_in_world(a)

Q(s,a)=Q(s,a)+α*(r+γ*max_a’(Q(s’,a’))-Q(s,a))

} Until (bored)

SARSA-learning algorithmAlgorithm: SARSA_learn

Inputs: State space S; Act. space A

Discount γ (0<=γ<1); Learning rate α (0<=α<1)

Outputs: Q

s=get_current_world_state()

a=pick_next_action(Q,s)

Repeat {

(r,s’)=act_in_world(a)

a’=pick_next_action(Q,s’)

Q(s,a)=Q(s,a)+α*(r+γ*Q(s’,a’)-Q(s,a))

a=a’; s=s’;

} Until (bored)

SARSA vs. Q•SARSA and Q-learning very similar

•SARSA updates Q(s,a) for the policy it’s actually executing

•Lets the pick_next_action() function pick action to update

•Q updates Q(s,a) for greedy policy w.r.t. current Q

•Uses max_a to pick action to update

•might be diff than the action it executes at s’

•In practice: Q will learn the “true” π*, but SARSA will learn about what it’s actually doing

•Exploration can get Q-learning in trouble...

Radioactive breadcrumbs•Can now define eligibility traces for SARSA

•In addition to Q(s,a) table, keep an e(s,a) table

•Records “eligibility” (real number) for each state/action pair

•At every step ((s,a,r,s’,a’) tuple):

•Increment e(s,a) for current (s,a) pair by 1

•Update all Q(s’’,a’’) vals in proportion to their e(s’’,a’’)

•Decay all e(s’’,a’’) by factor of λγ

•Leslie Kaelbling calls this the “radioactive breadcrumbs” form of RL

SARSA(λ)-learning alg.Algorithm: SARSA(λ)_learnInputs: S, A, γ (0<=γ<1), α (0<=α<1), λ (0<=λ<1)Outputs: Qe(s,a)=0 // for all s, as=get_curr_world_st(); a=pick_nxt_act(Q,s);Repeat {

(r,s’)=act_in_world(a)a’=pick_next_action(Q,s’)δ=r+γ*Q(s’,a’)-Q(s,a)e(s,a)+=1foreach (s’’,a’’) pair in (SXA) {

Q(s’’,a’’)=Q(s’’,a’’)+α*e(s’’,a’’)*δe(s’’,a’’)*=λγ }

a=a’; s=s’;} Until (bored)

The trail of crumbs

Sutton & Barto, Sec 7.5

The trail of crumbs

Eligibility for a single state

e(si,a

1st visit2nd visit ...

Eligibility trace followup•Eligibility trace allows:

•Tracking where the agent has been

•Backup of rewards over longer periods

•Credit assignment: state/action pairs rewarded for having contributed to getting to the reward

•Why does it work?

The “forward view” of elig.•Original SARSA did “one step” backup:

Q(s,a)r

Q(st+1

Rest of trajectoryInfo backup

•Could also do a “two step backup”:

Q(s,a)r

Q(st+2

Rest of trajectory

Info backup

•Could also do a “two step backup”:

•Or even an “n step backup”:

The “forward view” of elig.•Small-step backups (n=1, n=2, etc.) are

slow and nearsighted

•Large-step backups (n=100, n=1000, n=∞) are expensive and may miss near-term effects

•Want a way to combine them

•Can take a weighted average of different backups

•E.g.:

The “forward view” of elig.

The “forward view” of elig.•How do you know which number of steps

to avg over? And what the weights should be?

•Accumulating eligibility traces are just a clever way to easily avg. over all n:

The “forward view” of elig.λ0

Replacing traces•Kind just described are accumulating e-

traces

•Every time you go back to state, add extra e.

•There are also replacing eligibility traces

•Every time you go back to a state/action, reset e(s,a) to 1

•Works better sometimes

Sutton &Barto, Sec 7.8

Model-free vs.Model-based

What do you know?•Both Q-learning and SARSA(λ) are model

free methods

•A.k.a., value-based methods

•Learn a Q function

•Never learn T or R explicitly

•At the end of learning, agent knows how to act, but doesn’t explicitly know anything about the environment

•Also, no guarantees about explore/exploit tradeoff

•Sometimes, want one or both of the above

Model-based methods•Model based methods, OTOH, do

explicitly learn T & R

•At end of learning, have entire M= 〈 S,A,T,R 〉

•Also have π*

•At least one model-based method also guarantees explore/exploit tradeoff properties

•Efficient Explore & Exploit algorithm•Kearns & Singh, Machine Learning 49, 2002

•Explicitly keeps a T matrix and a R table•Plan (policy iter) w/ curr. T & R -> curr. π

•Every state/action entry in T and R:•Can be marked known or unknown•Has a #visits counter, nv(s,a)

•After every 〈 s,a,r,s’ 〉 tuple, update T & R (running average)

•When nv(s,a)>NVthresh , mark cell as known & re-plan

•When all states known, done learning & have π*

The E3 algorithmAlgorithm: E3_learn_sketch // only an overviewInputs: S, A, γ (0<=γ<1), NVthresh, R

max, Var

Outputs: T, R, π*Initialization:

R(s)=Rmax // for all s

T(s,a,s’)=1/|S| // for all s,a,s’known(s,a)=0; nv(s,a)=0; // for all s, aπ=policy_iter(S,A,T,R)

The E3 algorithmAlgorithm: E3_learn_sketch // con’tRepeat {

s=get_current_world_state()a=π(s)(r,s’)=act_in_world(a)T(s,a,s’)=(1+T(s,a,s’)*nv(s,a))/(nv(s,a)+1)nv(s,a)++;if (nv(s,a)>NVthresh) {

known(s,a)=1;π=policy_iter(S,A,T,R)

}} Until (all (s,a) known)

Model-Free vs. Model- Based RL: Q, SARSA, & E 3. Administrivia Reminder: Office hours tomorrow...

Documents

20100830 Power Point Truncated)

Analysis and Design of Optimized Truncated Scarfed Nozzles ... · PDF fileAnalysis and Design of Optimized Truncated Scarfed Nozzles ... AND DESIGN OF OPTIMIZED TRUNCATED SCARFED NOZZLES

Transfer matrix of a truncated cone with viscothermal losses ......2 Helmholtz equation and transfer matrix for truncated cones 2.1 Geometry Consider a truncated cone of length L,

CONTRIBUTIONS TO THE TRUNCATED VON MISES …cig.fi.upm.es/thesis/master/Contributions to the truncated von... · CONTRIBUTIONS TO THE TRUNCATED VON MISES ... dos de este trabajo para

Double Sarsa and Double Expected Sarsa with Shallow and ... · combined many techniques in order to learn the game of Go. This study used super-vised learning to initialize a policy

On The Relations Between Truncated Cuboctahedron ...forumgeom.fau.edu/FG2017volume17/FG201729.pdf · On the relations between truncated cuboctahedron, truncated icosidodecahedron

Administrivia- Introduction

Ff truncated final 3

Administrivia: Golden Tips for Making JIRA Hum

CECS543 Administrivia,$outline,$recap,$ projects$and$teams$

Fuzzy Sarsa: An approach to linear function approximation in …laurissa/Laurissas_Pages/... · 2013. 6. 25. · Fuzzy Sarsa: An approach to linear function approximation in reinforcement

Python Basics! - arguments, parameters, methods, comments · CS101 Lecture #5 2016-10-12. Administrivia 1/18 Administrivia. Administrivia Administrivia 2/18 Homework #2 is due Wed

Truncated Regression Model

DBPIA-NURIMEDIAacml.gnu.ac.kr/download/Publications/28.pdf39 6 2011. 6 Navier-Stokes Eulerian Eulerian Bourgault [7] 01 FLUENT* FENSAP-ICE 3.1 Truncated Flapped Flapol Truncated Truncated

Thermoelastic Analysis of Rotating Thick Truncated Conical ...jsm.iau-arak.ac.ir/article_524264_3a069afa854573459815f0e67d00588c.pdf · Thermoelastic Analysis of Rotating Thick Truncated

TRUNCATED PLATONIC SOLID DAN TRUNCATED …

Administrivia- Introduction CSE 373 Data Structures

Right-truncated data

Frailty Models For Arbitrarily Censored And Truncated Data · 2005-05-27 · and truncated. Discrete time regression models for right-truncated data have been developed among others

Power to the People; Reducing HR Administrivia