Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Lecture19:ReinforcementLearning– partII(RLalgorithms)
SanjeevArora EladHazan
COS402– MachineLearningand
ArtificialIntelligenceFall2016
Admin
• (programming)exerciseMCMC– duenextclass• Wewillhaveatleast1moreexerciseonRL• Lastlectureofthecourse:“askusanything”,Prof.Arora+myself.Exercise:submitaquestionthelecturebefore (graded)
MarkovDecisionProcess
MarkovRewardProcess,definition:• Tuple(𝑆,𝑃, 𝑅, 𝐴, 𝛾) where
• S=states,includingstartstate• A=setofpossibleactions• P=transitionmatrix𝑃))*+ = Pr[𝑆012 = 𝑠′|𝑆0 = 𝑠,𝐴0 = 𝑎]• R=rewardfunction,𝑅)+ = 𝐸[𝑅012|𝑆0 = 𝑠, 𝐴0 = 𝑎]• 𝛾 ∈ [0,1]=discountfactor
• Return𝐺0 = > 𝑅01?𝛾?@2
?A20BC• Goal:takeactionstomaximizeexpectedreturn
Policies
TheMarkovianstructureè bestactiondependsonlyoncurrentstate!
• Policy=mappingfromstatetodistributionoveractions𝜋: 𝑆 ↦ Δ(𝐴), 𝜋 𝑎 𝑠 = Pr[𝐴0 = 𝑎|𝑆0 = 𝑠]
• Givenapolicy,theMDPreducestoaMarkovRewardProcess
Reminder:MDP1
+1
-1
START
actions: UP, DOWN, LEFT, RIGHT
UP
80% move UP10% move LEFT10% move RIGHT
reward +1 at [4,3], -1 at [4,2]reward -0.04 for each step
• states• actions• rewards
• Policies?
Reminder2
• State?Actions?Rewards?Policy?
Policies:action=study
ClassR=2
PartyR=1
SleepR=0Facebook
R=-1
0.1
0.1
0.9
0.1
0.1
0.1
0.9
0.8
0.8
Policies:action=facebook
ClassR=2
PartyR=1
SleepR=0Facebook
R=-1
0.9
0.9
0.1
0.1
0.1
0.05
0.9
0.8
0.05
FixedpolicyàMarkovRewardprocess
• Givenapolicy,theMDPreducestoaMarkovRewardProcess
• 𝑃))*H = ∑ 𝜋 𝑎 𝑠+∈J 𝑃))*+
• 𝑅)H = ∑ 𝜋 𝑎 𝑠+∈J 𝑅)+
• Valuefunctionforpolicy:𝑣H 𝑠 = 𝐸H[𝐺0|𝑆0 = 𝑠]• Action-Valuefunctionforpolicy:𝑞H 𝑠, 𝑎 = 𝐸H[𝐺0|𝑆0 = 𝑠, 𝐴0 = 𝑎]
• Howtocomputethebestpolicy?
TheBellmanequation
• PoliciessatisfytheBellmanequation:
𝑣H 𝑠 = 𝐸H 𝑅012 + 𝛾𝑣H(𝑆012)|𝑆0 = 𝑠 = 𝑅)H + 𝛾>𝑃))NH 𝑣H(𝑠*)
)• Andsimilarlyforvalue-actionfunction:
𝑣H 𝑠 => 𝜋 𝑎 𝑠 𝑞H(𝑠, 𝑎)+∈J
• Optimalvaluefunction,andvalue-actionfunction• 𝑣∗ 𝑠 = max
H 𝑣H 𝑠 𝑞∗ 𝑠, 𝑎 = max
H 𝑞H(𝑠, 𝑎)
• Important:𝑣∗ 𝑠 = max+𝑞∗(𝑠, 𝑎),why?
Bellmanoptimalityequations
• Thereexistsanoptimalpolicy𝜋∗ (itisdeterministic!)• Alloptimalpolicyachievethesameoptimalvalue𝑣∗(𝑠) ateverystate,andthesameoptimalvalue-actionfunction𝑞∗ 𝑠, 𝑎 ateverystateandforeveryaction.
• Howcanwefind it?Bellmanequation:𝑣∗ 𝑠 = max+ 𝑞∗(𝑠, 𝑎) impliesBellmanoptimalityequations:
(why?)
𝑞∗ 𝑠, 𝑎 = 𝑅)+ + 𝛾>𝑃))*+)*
max+* 𝑞∗(𝑠′, 𝑎′)
𝑣∗ 𝑠 = max+
𝑅)+ + 𝛾>𝑃))N+ 𝑣∗(𝑠*)
)*
Non-linear!Solutionà optimalpolicy.why?(later)
Algorithms– findinganoptimalpolicy
𝑞∗ 𝑠, 𝑎 = 𝑅)+ + 𝛾>𝑃))*+)*
max+* 𝑞∗(𝑠′, 𝑎′)
𝑣∗ 𝑠 = max+
𝑅)+ + 𝛾>𝑃))N+ 𝑣∗(𝑠*)
)*
• IterativemethodsbasedontheBellmanequations:dynamicprogramming• Policyiteration• Valueiteration
Policyiteration
Evaluatepolicy
Improvepolicy
Start:arbitrarypolicy
Computefinalpolicy
Policyiteration
• Startwitharbitrarypolicy𝜋S: 𝑆 ↦ 𝐴• While(notconvergedtooptimality)do:• Evaluatecurrentpolicy
𝑣T = 𝑅HU + 𝛾𝑃HU𝑣T
• Improvepolicybygreedystep(fromsomeorallstates)
𝜋T12 𝑠 = argmax+∈J
𝑅)+ + 𝛾>𝑃))N+ 𝑣T(𝑠*)
)*
Policyiteration
• Startwitharbitrarypolicy𝜋S: 𝑆 ↦ 𝐴• While(notconvergedtooptimality)do:
• Evaluatecurrentpolicy𝑣T = 𝑅HU + 𝛾𝑃HU𝑣T
• Improvepolicybygreedystep(fromsomeorallstates)
𝜋T12 𝑠 = argmax+∈J
𝑅)+ + 𝛾>𝑃))N+ 𝑣T(𝑠*)
)*
= argmax+∈J
𝑞H(𝑠, 𝑎)
(compute𝑞H(𝑠, 𝑎) from𝑣H(𝑠))
• Measureofdistancetooptimality,e.g., 𝑣T12 − 𝑣T C
Policyiteration
• Intuitive• Provablyconverging(prove?)• Computationallysomewhatexpensive• Bettermethodwitheasierconvergenceanalysisnext…
Valueiteration
Improvevalues
Start:statevaluescorresponding toarbitrarypolicy
Computefinalpolicy
Valueiteration
𝑣T12 𝑠 = max+∈J
𝑅)+ + 𝛾>𝑃))N+ 𝑣T(𝑠*)
)*Inmatrixform:
𝑣T12 = max+
𝑅 + 𝛾𝑃+𝑣T
• Initialize,v0(s)=1• While 𝑣T12 − 𝑣T C > 𝜖 do:
• Updateforallstatesvk+1(s) fromvk(s)
• Compute(near)optimalpolicy𝜋 𝑠 = argmax+∈J
𝑅)+ + 𝛾∑ 𝑃))N+ 𝑣T(𝑠*))*
Valueiteration
• Fastercomputationally(nopolicytilltheend)• Easierconvergenceanalysis:
Theorem:v∗ isunique,andvalueIterationconvergesgeometrically:
𝑣T12 − 𝑣∗ C ≤ 𝛾T 𝑣2 − 𝑣∗ C
Valueiteration- proof
DefinetheoperatorTas𝑇 𝑣 = max
+𝑅 + 𝛾𝑃+𝑣
Then,𝑇 𝑣∗ = 𝑣∗
And,𝑣T12 − 𝑣∗ C = 𝑇 𝑣T − 𝑇 𝑣∗ C
= max+
𝑅 + 𝛾𝑃+𝑣T −max+
𝑅 + 𝛾𝑃+𝑣∗C
≤ max+
𝑅 + 𝛾𝑃+𝑣T − 𝑅 + 𝛾𝑃+𝑣∗ C
= 𝛾 𝑃+ 𝑣T − 𝑣∗ C ≤ 𝛾 𝑣T − 𝑣∗ C
|maxx f(x)– maxy g(y)|<=maxz |(f-g)(z)|
Pisatranition matrix
Valueiteration- proof
Thus,𝑣T12 − 𝑣∗ C ≤ 𝛾 𝑣T − 𝑣∗ C
Andrecursively𝑣T12 − 𝑣∗ C ≤ 𝛾T 𝑣2− 𝑣∗ C
Uniqueness ofv*?
Valueandpolicyiteration
Exercise:giverunningtimeboundsintermsof#ofstates,actions
Exactcomputationofoptimalpolicy/values?
𝑣∗ 𝑠 = max+
𝑅)+ + 𝛾>𝑃))N+ 𝑣∗(𝑠*)
)*
Equivalentto:∀𝑎 ∈ 𝐴. 𝑣∗ ≥ 𝑅+ + 𝛾𝑃+𝑣∗
Linearprogram!
Reminder:linearprogramming
formulation:𝐴 ∈ 𝑅l×n, 𝑥 ∈ 𝑅n, 𝑏 ∈ 𝑅l,𝑚 ≥ 𝑛
𝐴𝑥 ≥ 𝑏
Specialcaseofconvexprogramming/optimization.Admitspolynomialtimealgorithms(roughlyO 𝑚𝑛t ),andpracticalalgorithmssuchasthesimplex.
Admitsapproximationalgorithmsthatruninlineartime,orevensublineartimeO l1nuv .
MDPLP:∀𝑎 ∈ 𝐴. 𝑣∗ ≥ 𝑅+ + 𝛾𝑃+𝑣∗
Soforus,n=|S|,m=|A|*|S|If𝛾 = 1 ?
Model-freeRL
Thusfar:assumedweknowtransitionmatrices,rewards,states,andtheyarenottoolarge.Whatiftransitions/rewardsare:1. unknown2. toomanytokeepinmemory/computeover
“modelfree”=wedonothavethe”model”=transitionmatrixPandrewardvectorR
Model-freeRL
Monte-carlo policyevaluation:insteadofcomputing,estimate𝑣H 𝑠 = 𝐸H[𝐺0|𝑆0 = 𝑠] byrandomwalk:
• Thefirsttimestatesisvisited,updatecounterN(s)(incrementeverytimeit’svisitedagain)• Keeptrackofallrewardsfromthispointonwards• EstimateofGt issumofrewards/N(s).• Claim:thisestimatorhasexpectation𝑣H 𝑠 ,andconvergestoitbylawoflargenumbers• Similarlycanestimatevalue-actionfunction𝑞H 𝑠, 𝑎 = 𝐸H[𝐺0|𝑆0 = 𝑠, 𝐴0 = 𝑎]
• Whatdowedowithestimatedvalues?• policyiterationrequiresrewards+transitions• Model-freepolicyimprovement:
𝜋 𝑠 = argmax+
𝑞H 𝑠,𝑎
Summary
• AlgorithmsforsolvingMDPsbasedondynamicprogramming(Bellmanequation)• Valueiteration,policyiteration
• Provedconvergence,convergencerate
• Linearprogrammingview
• Partialobservation– estimationofthestate-actionfunction(modelfree)
• Next:functionapproximation