Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
AnIntroductiontoReinforcementLearning
AnandSubramoneyanand [at]igi.tugraz.at
InstituteforTheoreticalComputerScience,TUGrazhttp://www.igi.tugraz.at/
MachineLearningGrazMeetup12th October2017
Outline
• Introduction• Valueestimation• Q-learning• Policygradient• DQN• A3C
WhatisReinforcementLearning?
• Learninganagentwhileinteracting withtheenvironment• Theagentreceivesa“reward”foreachactionittakes• Thegoaloftheagentistomaximizetherewarditreceives• Theagentisnottoldwhatthe”right”actionis.i.e.itisnotsupervised
Notation
• Thestateoftheenvironmentis𝑠" attime𝑡• Examplesofstate:the(x,y)coordinates,imagepixelsetc.
• Ateachtimestep𝑡,theagenttakesaction𝑎" (knowing𝑠")• Examplesofaction:Moveright/left/up/down,accelerationofcaretc.
• Thentheagentgetsareward𝑟"• Couldbe0/1orpointsinthegame
• Theagentplaysforone“episode”• Called“episodic”RL• E.g.onegameuntilitwins/losesetc.• Non-episodicalsopossible
Notation
• Model:𝒫''() = Pr{𝑠"/0 = 𝑠1|𝑠" = 𝑠, 𝑎" = 𝑎}
• Whatisthenextstategiventhecurrentstateandactiontaken?• Theenvironmentcanbestochastic,inwhichcasethisisaprobabilitydistribution
• Reward:ℛ''() = 𝐸{𝑟"/0|𝑠" = 𝑠, 𝑎" = 𝑎, 𝑠"/0 = 𝑠1}
• Expectedvalueofrewardwhengoingfromonestatetoanothertakingacertainaction• Inthemostgeneralcase,therewardisnotdeterministic
Policy
• Theagenthasacertainmappingbetweenstateandaction• Thisiscalledthepolicy oftheagent• Denotedby𝜋(𝑠, 𝑎)• Inthestochasticcase,it’stheprobabilitydistributionoveractionsatagivenstate𝜋 𝒔, 𝒂 = P(𝒂"|𝒔")
Thegoalofreinforcementlearning
• Istofindapolicythatmaximizesthetotalexpectedreward• alsocalledthe“return”
• Inanepisode• 𝛾 iscalledthe“discountingfactor”
• Small 𝛾 producesshortsighted,largeg far-sightedpolicies.• Risalwaysfiniteif𝛾 < 1 andthelocalrewardsrarefromaboundedsetofnumbers.
𝑅" = 𝑟"/0 + 𝛾𝑟"/Y + 𝛾Y𝑟"/Z + ⋯ = \𝛾]𝑟"/]/0
^
]_`
Exampleenvironment
Theagentreceives-0.001rewardeverystep.Whenitreachesthegoalorapit,itobtainsrewardsof+1.0or-1.0resp.andtheepisodeisterminated.
Thegoalofreinforcementlearning
• Howcantheagentquantifythedesirabilityofintermediatestates(whereno,ornorelevantrewardisgiven)?
• Thedifficultyis,thatthedesirabilityofintermediatestatesdependson:• TheconcreteselectionofactionsAFTERbeinginsuchanintermediatestate,• ANDonthedesirabilityofsubsequentintermediatestates.
• Thevaluefunctionallowsustodothis
Thevaluefunction
• Definedas:• 𝑉b 𝑠 = 𝐸b 𝑅" 𝑠" = 𝑠 = 𝐸b{∑ 𝛾]𝑟"/]/0|𝑠" = 𝑠^
]_` }
• Thevalueofastatesistheexpectedreturnstartingfromthatstatesandfollowingpolicy𝜋• SatisfiestheBellmanequations
Bellman equation for Vp :
Vp (s) = p (s,a) Ps ¢ s a Rs ¢ s
a + gV p( ¢ s )[ ]¢ s å
aå
– a system of S simultaneous linear equations
Notethatit’sarecursiveformulationofthevaluefunction
Examplevaluefunction
Calculatingthevaluefunction
• Ifthemodel𝒫''() andrewardℛ''(
) areknown,calculate𝑉b 𝑠 usingiterativepolicyevaluation.
http://cs.stanford.edu/people/karpathy/reinforcejs/gridworld_dp.html
Whyvaluefunction?
• There existsanaturalpartialorderonallpossiblepolicies:
𝜋1 ≥ 𝜋𝑖𝑓𝑎𝑛𝑑𝑜𝑛𝑙𝑦𝑖𝑓𝑉b( 𝑠 ≥ 𝑉b 𝑠 𝑓𝑜𝑟𝑎𝑙𝑙𝑠 ∈ 𝑆
• Definition: Apolicy 𝜋1 iscalledoptimalif 𝜋1 ≥ 𝜋forallpolicies 𝜋
• Existenceofatleastoneoptimalpolicyisguaranteed,andtheysatisfyBellmanOptimalityequations.
Theaction-valuefunction
• Definedas:• 𝑄b 𝑠, 𝑎 = 𝐸b 𝑅" 𝑠" = 𝑠, 𝑎" = 𝑎 = 𝐸b{∑ 𝛾]𝑟"/]/0|𝑠" = 𝑠^
]_` , 𝑎" = 𝑎}
• Thisiscalledthe“Qfunction”• Thevalueoftakingaction𝑎instate𝑠 followingpolicy𝜋 thereafter• AlsosatisfiestheBellmanequations
Qp (s,a) = Ep rt +1 + gV p(st +1 ) st = s, at = a{ }= Ps ¢ s
a
¢ s å Rs ¢ s
a +g Vp ( ¢ s )[ ]
Findinganoptimalpolicy
• Defineanewpolicy𝜋1 thatisgreedywithrespectto𝑉b
• Forallstates𝑠:𝜋1 = 𝑎𝑟𝑔𝑚𝑎𝑥)𝑄b 𝑠, 𝑎• Thispolicysatisfies𝑄b 𝑠, 𝜋1 𝑠 ≥ 𝑉b 𝑠• Canbeshownthat:• 𝜋1 ≥ 𝜋 for𝛾 < 1• Eventuallyconvergestoanoptimalpolicy
• Thisworksonlyif𝑉b 𝑠 canbecalculated
OtherwaystocalculateV/Q
• Monte-carlo policyevaluation• Sampleoneepisodeandupdatethevaluefunctionforeachstate• 𝑉 𝑠" ⟵ 𝑉 𝑠" + 𝛼 𝑅" − 𝑉 𝑠"• Asymptoticallyconvergestothetruevaluefunction
• TemporalDifference(TD)Learning• Foreachstepofeachepisode:• Takeaction𝑎,observereward𝑟"/0andnextstate𝑠"/0• 𝑉 𝑠" ⟵ 𝑉 𝑠" + 𝛼(𝑟"/0 + 𝛾𝑉 𝑠"/0 − 𝑉 𝑠" )
TemporalDifference
LearningQ-function(SARSA)
• Qcanbeusedtodefineapolicy• takeactiona = 𝑎𝑟𝑔𝑚𝑎𝑥)𝑄(𝑠, 𝑎) ateverystatewithprobability1 − 𝜖• Withprobability𝜖 takearandomaction(exploration)
• UsetemporaldifferencelearningtolearnQ-function• Foreachstepofeachepisode:• Takeaction𝑎,observereward𝑟"/0andnextstate𝑠"/0• 𝑄 𝑠", 𝑎" ⟵ 𝑄 𝑠", 𝑎" + 𝛼(𝑟"/0 + 𝛾𝑄 𝑠"/0, 𝑎"/0 − 𝑄 𝑠", 𝑎" )
• 𝑎"/0forlearningcanbeusedfromthispolicy• CalledSARSA
Q-learning
• UsetemporaldifferencelearningtolearnQ-function• Foreachstepofeachepisode:• Takeaction𝑎,observereward𝑟"/0andnextstate𝑠"/0• 𝑄 𝑠", 𝑎" ⟵ 𝑄 𝑠", 𝑎" + 𝛼(𝑟"/0 + 𝛾max) 𝑄 𝑠"/0, 𝑎 − 𝑄 𝑠", 𝑎" )
• Q-learningrequiresforconvergencetotheoptimalpolicythatrewardsaresampledforeachpair(s,a)infinitelyoften.
• http://cs.stanford.edu/people/karpathy/reinforcejs/gridworld_td.html
Functionapproximation
• TheQ-functioncanbeapproximatedwithaneuralnetwork(oranyotherfunctionapproximator)
• Thetargetsforthenetworkwouldbe𝑟"/0 + 𝛾max) 𝑄 𝑠"/0, 𝑎
• Traintheneuralnetworkwithbackpropagation
Thegoalofreinforcementlearning(repeated)
• Istofindapolicythatmaximizesthetotalexpectedreward• alsocalledthe“return”
• 𝛾 iscalledthe“discountingfactor”
• Small 𝛾 producesshortsighted,largeg far-sightedpolicies.• Risalwaysfiniteif𝛾 < 1 andthelocalrewardsrarefromaboundedsetofnumbers.
𝑅" = 𝑟"/0 + 𝛾𝑟"/Y + 𝛾Y𝑟"/Z + ⋯ = \𝛾]𝑟"/]/0
^
]_`
PolicyGradient
• Whynotlearnthepolicydirectly?• Definecostfunctionasthetotalexpectedreward:
𝐽 𝜃 = 𝐸 \𝑎]𝑟]
}
]_`
= 𝐸{𝑟 𝜏 }
• 𝑎] issomediscountingfactor• 𝑟] isrewardatstepk• 𝜏 isatrajectoryand𝑟 𝜏 =∑ 𝑎]𝑟]}
]_`
• Learnthisusinggradientascent:
𝜃"/0 = 𝜃" + 𝜂𝛻�𝐽 𝜃
• Problems?• CannotcalculategradientofJ
PolicyGradient
• Itispossibletoempiricallyestimatethegradient(Williams1992)
𝛻�𝐽 𝜃 = 𝐸{𝛻� log 𝑝�(𝜏)(𝑟 𝜏 − 𝑏)}
=\𝛻� log 𝜋�(𝑎"|𝑠")�
"_`
(𝑅" − 𝑏)
• Usesthelog-likelihoodtrick(orREINFORCEtrick)• Baselineisusedtoreducevarianceofgradientestimator• Baselinedoesn’tintroducebias• DEMO
DQNandA3C
DQN
• Mnih,V.etal. Human-levelcontrolthroughdeepreinforcementlearning.Nature 518, 529–533(2015).• UsesadeepneuralnetworktolearntheQ-values
DQN:Twokeyideas
• Episodereplay:• StoreearlierstepsandapplyQ-learningupdatesinrandombatchesfromthismemory
• UpdatepolicynetworkonlyonceeveryCsteps
DQN
A3C
• Mnih,V.etal. AsynchronousMethodsforDeepReinforcementLearning.arXiv:1602.01783[cs] (2016).
• A3C:AsynchronousAdvantageActorCritic• Usespolicygradientwithabaselinethatisthevaluefunction
𝛻�𝐽 𝜃 =\𝛻� log 𝜋�(𝑎"|𝑠")�
"_`
(𝑅" − 𝑉(𝑠"))
AdvantageActor
Critic
A3C
Resources
• Book:ReinforcementLearningAnIntroduction,RichardSuttonandAndrewBarto• AvailableonlineonAndrewBarto’s website:http://www.incompleteideas.net/sutton/book/the-book-1st.html
• Course:AutonomouslyLearningSystemsIGITUGraz• 2016website:http://www.igi.tugraz.at/lehre/Autonomously_learning_systems/WS16/• Nextcoursein2018• Lectureslidesavailablethere
• DQN:https://deepmind.com/research/dqn/• OpenAI Gym:https://gym.openai.com/envs• DeepReinforcementLearning:PongfromPixels(AndrejKarpathy):https://karpathy.github.io/2016/05/31/rl/• Book:DeepLearning,IanGoodfellow,Yoshua Bengio andAaronCourville
• Availableonline:http://www.deeplearningbook.org• RLPy:https://rlpy.readthedocs.io/en/latest/ (python2.7only)