Lecture 19: Reinforcement Learning –part II (RL algorithms ... · Model-free RL Monte-carlo policy evaluation: instead of computing, estimate K H 3 = 9 H[= 0|" 0 = 3] by random

Lecture19:ReinforcementLearning– partII(RLalgorithms)

SanjeevArora EladHazan

COS402– MachineLearningand

ArtificialIntelligenceFall2016

Admin

• (programming)exerciseMCMC– duenextclass• Wewillhaveatleast1moreexerciseonRL• Lastlectureofthecourse:“askusanything”,Prof.Arora+myself.Exercise:submitaquestionthelecturebefore (graded)

MarkovDecisionProcess

MarkovRewardProcess,definition:• Tuple(𝑆,𝑃, 𝑅, 𝐴, 𝛾) where

• S=states,includingstartstate• A=setofpossibleactions• P=transitionmatrix𝑃))*+ = Pr[𝑆012 = 𝑠′|𝑆0 = 𝑠,𝐴0 = 𝑎]• R=rewardfunction,𝑅)+ = 𝐸[𝑅012|𝑆0 = 𝑠, 𝐴0 = 𝑎]• 𝛾 ∈ [0,1]=discountfactor

• Return𝐺0 = > 𝑅01?𝛾?@2

?A20BC• Goal:takeactionstomaximizeexpectedreturn

Policies

TheMarkovianstructureè bestactiondependsonlyoncurrentstate!

• Policy=mappingfromstatetodistributionoveractions𝜋: 𝑆 ↦ Δ(𝐴), 𝜋 𝑎 𝑠 = Pr[𝐴0 = 𝑎|𝑆0 = 𝑠]

• Givenapolicy,theMDPreducestoaMarkovRewardProcess

Reminder:MDP1

+1

-1

START

actions: UP, DOWN, LEFT, RIGHT

UP

80% move UP10% move LEFT10% move RIGHT

reward +1 at [4,3], -1 at [4,2]reward -0.04 for each step

• states• actions• rewards

• Policies?

Reminder2

• State?Actions?Rewards?Policy?

Policies:action=study

ClassR=2

PartyR=1

SleepR=0Facebook

R=-1

0.1

0.1

0.9

0.1

0.1

0.1

0.9

0.8

0.8

Policies:action=facebook

ClassR=2

PartyR=1

SleepR=0Facebook

R=-1

0.9

0.9

0.1

0.1

0.1

0.05

0.9

0.8

0.05

FixedpolicyàMarkovRewardprocess

• Givenapolicy,theMDPreducestoaMarkovRewardProcess

• 𝑃))*H = ∑ 𝜋 𝑎 𝑠+∈J 𝑃))*+

• 𝑅)H = ∑ 𝜋 𝑎 𝑠+∈J 𝑅)+

• Valuefunctionforpolicy:𝑣H 𝑠 = 𝐸H[𝐺0|𝑆0 = 𝑠]• Action-Valuefunctionforpolicy:𝑞H 𝑠, 𝑎 = 𝐸H[𝐺0|𝑆0 = 𝑠, 𝐴0 = 𝑎]

• Howtocomputethebestpolicy?

TheBellmanequation

• PoliciessatisfytheBellmanequation:

𝑣H 𝑠 = 𝐸H 𝑅012 + 𝛾𝑣H(𝑆012)|𝑆0 = 𝑠 = 𝑅)H + 𝛾>𝑃))NH 𝑣H(𝑠*)

)• Andsimilarlyforvalue-actionfunction:

𝑣H 𝑠 => 𝜋 𝑎 𝑠 𝑞H(𝑠, 𝑎)+∈J

• Optimalvaluefunction,andvalue-actionfunction• 𝑣∗ 𝑠 = max

H 𝑣H 𝑠 𝑞∗ 𝑠, 𝑎 = max

H 𝑞H(𝑠, 𝑎)

• Important:𝑣∗ 𝑠 = max+𝑞∗(𝑠, 𝑎),why?

Bellmanoptimalityequations

• Thereexistsanoptimalpolicy𝜋∗ (itisdeterministic!)• Alloptimalpolicyachievethesameoptimalvalue𝑣∗(𝑠) ateverystate,andthesameoptimalvalue-actionfunction𝑞∗ 𝑠, 𝑎 ateverystateandforeveryaction.

• Howcanwefind it?Bellmanequation:𝑣∗ 𝑠 = max+ 𝑞∗(𝑠, 𝑎) impliesBellmanoptimalityequations:

(why?)

𝑞∗ 𝑠, 𝑎 = 𝑅)+ + 𝛾>𝑃))*+)*

max+* 𝑞∗(𝑠′, 𝑎′)

𝑣∗ 𝑠 = max+

𝑅)+ + 𝛾>𝑃))N+ 𝑣∗(𝑠*)

)*

Non-linear!Solutionà optimalpolicy.why?(later)

Algorithms– findinganoptimalpolicy

𝑞∗ 𝑠, 𝑎 = 𝑅)+ + 𝛾>𝑃))*+)*

max+* 𝑞∗(𝑠′, 𝑎′)

𝑣∗ 𝑠 = max+

𝑅)+ + 𝛾>𝑃))N+ 𝑣∗(𝑠*)

)*

• IterativemethodsbasedontheBellmanequations:dynamicprogramming• Policyiteration• Valueiteration

Policyiteration

Evaluatepolicy

Improvepolicy

Start:arbitrarypolicy

Computefinalpolicy

Policyiteration

• Startwitharbitrarypolicy𝜋S: 𝑆 ↦ 𝐴• While(notconvergedtooptimality)do:• Evaluatecurrentpolicy

𝑣T = 𝑅HU + 𝛾𝑃HU𝑣T

• Improvepolicybygreedystep(fromsomeorallstates)

𝜋T12 𝑠 = argmax+∈J

𝑅)+ + 𝛾>𝑃))N+ 𝑣T(𝑠*)

)*

Policyiteration

• Startwitharbitrarypolicy𝜋S: 𝑆 ↦ 𝐴• While(notconvergedtooptimality)do:

• Evaluatecurrentpolicy𝑣T = 𝑅HU + 𝛾𝑃HU𝑣T

• Improvepolicybygreedystep(fromsomeorallstates)

𝜋T12 𝑠 = argmax+∈J

𝑅)+ + 𝛾>𝑃))N+ 𝑣T(𝑠*)

)*

= argmax+∈J

𝑞H(𝑠, 𝑎)

(compute𝑞H(𝑠, 𝑎) from𝑣H(𝑠))

• Measureofdistancetooptimality,e.g., 𝑣T12 − 𝑣T C

Policyiteration

• Intuitive• Provablyconverging(prove?)• Computationallysomewhatexpensive• Bettermethodwitheasierconvergenceanalysisnext…

Valueiteration

Improvevalues

Start:statevaluescorresponding toarbitrarypolicy

Computefinalpolicy

Valueiteration

𝑣T12 𝑠 = max+∈J

𝑅)+ + 𝛾>𝑃))N+ 𝑣T(𝑠*)

)*Inmatrixform:

𝑣T12 = max+

𝑅 + 𝛾𝑃+𝑣T

• Initialize,v0(s)=1• While 𝑣T12 − 𝑣T C > 𝜖 do:

• Updateforallstatesvk+1(s) fromvk(s)

• Compute(near)optimalpolicy𝜋 𝑠 = argmax+∈J

𝑅)+ + 𝛾∑ 𝑃))N+ 𝑣T(𝑠*))*

Valueiteration

• Fastercomputationally(nopolicytilltheend)• Easierconvergenceanalysis:

Theorem:v∗ isunique,andvalueIterationconvergesgeometrically:

𝑣T12 − 𝑣∗ C ≤ 𝛾T 𝑣2 − 𝑣∗ C

Valueiteration- proof

DefinetheoperatorTas𝑇 𝑣 = max

+𝑅 + 𝛾𝑃+𝑣

Then,𝑇 𝑣∗ = 𝑣∗

And,𝑣T12 − 𝑣∗ C = 𝑇 𝑣T − 𝑇 𝑣∗ C

= max+

𝑅 + 𝛾𝑃+𝑣T −max+

𝑅 + 𝛾𝑃+𝑣∗C

≤ max+

𝑅 + 𝛾𝑃+𝑣T − 𝑅 + 𝛾𝑃+𝑣∗ C

= 𝛾 𝑃+ 𝑣T − 𝑣∗ C ≤ 𝛾 𝑣T − 𝑣∗ C

|maxx f(x)– maxy g(y)|<=maxz |(f-g)(z)|

Pisatranition matrix

Valueiteration- proof

Thus,𝑣T12 − 𝑣∗ C ≤ 𝛾 𝑣T − 𝑣∗ C

Andrecursively𝑣T12 − 𝑣∗ C ≤ 𝛾T 𝑣2− 𝑣∗ C

Uniqueness ofv*?

Valueandpolicyiteration

Exercise:giverunningtimeboundsintermsof#ofstates,actions

Exactcomputationofoptimalpolicy/values?

𝑣∗ 𝑠 = max+

𝑅)+ + 𝛾>𝑃))N+ 𝑣∗(𝑠*)

)*

Equivalentto:∀𝑎 ∈ 𝐴. 𝑣∗ ≥ 𝑅+ + 𝛾𝑃+𝑣∗

Linearprogram!

Reminder:linearprogramming

formulation:𝐴 ∈ 𝑅l×n, 𝑥 ∈ 𝑅n, 𝑏 ∈ 𝑅l,𝑚 ≥ 𝑛

𝐴𝑥 ≥ 𝑏

Specialcaseofconvexprogramming/optimization.Admitspolynomialtimealgorithms(roughlyO 𝑚𝑛t ),andpracticalalgorithmssuchasthesimplex.

Admitsapproximationalgorithmsthatruninlineartime,orevensublineartimeO l1nuv .

MDPLP:∀𝑎 ∈ 𝐴. 𝑣∗ ≥ 𝑅+ + 𝛾𝑃+𝑣∗

Soforus,n=|S|,m=|A|*|S|If𝛾 = 1 ?

Model-freeRL

Thusfar:assumedweknowtransitionmatrices,rewards,states,andtheyarenottoolarge.Whatiftransitions/rewardsare:1. unknown2. toomanytokeepinmemory/computeover

“modelfree”=wedonothavethe”model”=transitionmatrixPandrewardvectorR

Model-freeRL

Monte-carlo policyevaluation:insteadofcomputing,estimate𝑣H 𝑠 = 𝐸H[𝐺0|𝑆0 = 𝑠] byrandomwalk:

• Thefirsttimestatesisvisited,updatecounterN(s)(incrementeverytimeit’svisitedagain)• Keeptrackofallrewardsfromthispointonwards• EstimateofGt issumofrewards/N(s).• Claim:thisestimatorhasexpectation𝑣H 𝑠 ,andconvergestoitbylawoflargenumbers• Similarlycanestimatevalue-actionfunction𝑞H 𝑠, 𝑎 = 𝐸H[𝐺0|𝑆0 = 𝑠, 𝐴0 = 𝑎]

• Whatdowedowithestimatedvalues?• policyiterationrequiresrewards+transitions• Model-freepolicyimprovement:

𝜋 𝑠 = argmax+

𝑞H 𝑠,𝑎

Summary

• AlgorithmsforsolvingMDPsbasedondynamicprogramming(Bellmanequation)• Valueiteration,policyiteration

• Provedconvergence,convergencerate

• Linearprogrammingview

• Partialobservation– estimationofthestate-actionfunction(modelfree)

• Next:functionapproximation

Documents

Lecture 19: Reinforcement Learning –part II (RL algorithms ... · Model-free RL Monte-carlo policy evaluation: instead of computing, estimate K H 3 = 9 H[= 0|" 0 = 3] by random