An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action

AnIntroductiontoReinforcementLearning

AnandSubramoneyanand [at]igi.tugraz.at

InstituteforTheoreticalComputerScience,TUGrazhttp://www.igi.tugraz.at/

MachineLearningGrazMeetup12th October2017

Outline

• Introduction• Valueestimation• Q-learning• Policygradient• DQN• A3C

WhatisReinforcementLearning?

• Learninganagentwhileinteracting withtheenvironment• Theagentreceivesa“reward”foreachactionittakes• Thegoaloftheagentistomaximizetherewarditreceives• Theagentisnottoldwhatthe”right”actionis.i.e.itisnotsupervised

Notation

• Thestateoftheenvironmentis𝑠" attime𝑡• Examplesofstate:the(x,y)coordinates,imagepixelsetc.

• Ateachtimestep𝑡,theagenttakesaction𝑎" (knowing𝑠")• Examplesofaction:Moveright/left/up/down,accelerationofcaretc.

• Thentheagentgetsareward𝑟"• Couldbe0/1orpointsinthegame

• Theagentplaysforone“episode”• Called“episodic”RL• E.g.onegameuntilitwins/losesetc.• Non-episodicalsopossible

Notation

• Model:𝒫''() = Pr{𝑠"/0 = 𝑠1|𝑠" = 𝑠, 𝑎" = 𝑎}

• Whatisthenextstategiventhecurrentstateandactiontaken?• Theenvironmentcanbestochastic,inwhichcasethisisaprobabilitydistribution

• Reward:ℛ''() = 𝐸{𝑟"/0|𝑠" = 𝑠, 𝑎" = 𝑎, 𝑠"/0 = 𝑠1}

• Expectedvalueofrewardwhengoingfromonestatetoanothertakingacertainaction• Inthemostgeneralcase,therewardisnotdeterministic

Policy

• Theagenthasacertainmappingbetweenstateandaction• Thisiscalledthepolicy oftheagent• Denotedby𝜋(𝑠, 𝑎)• Inthestochasticcase,it’stheprobabilitydistributionoveractionsatagivenstate𝜋 𝒔, 𝒂 = P(𝒂"|𝒔")

Thegoalofreinforcementlearning

• Istofindapolicythatmaximizesthetotalexpectedreward• alsocalledthe“return”

• Inanepisode• 𝛾 iscalledthe“discountingfactor”

• Small 𝛾 producesshortsighted,largeg far-sightedpolicies.• Risalwaysfiniteif𝛾 < 1 andthelocalrewardsrarefromaboundedsetofnumbers.

𝑅" = 𝑟"/0 + 𝛾𝑟"/Y + 𝛾Y𝑟"/Z + ⋯ = \𝛾]𝑟"/]/0

^

]_`

Exampleenvironment

Theagentreceives-0.001rewardeverystep.Whenitreachesthegoalorapit,itobtainsrewardsof+1.0or-1.0resp.andtheepisodeisterminated.

Thegoalofreinforcementlearning

• Howcantheagentquantifythedesirabilityofintermediatestates(whereno,ornorelevantrewardisgiven)?

• Thedifficultyis,thatthedesirabilityofintermediatestatesdependson:• TheconcreteselectionofactionsAFTERbeinginsuchanintermediatestate,• ANDonthedesirabilityofsubsequentintermediatestates.

• Thevaluefunctionallowsustodothis

Thevaluefunction

• Definedas:• 𝑉b 𝑠 = 𝐸b 𝑅" 𝑠" = 𝑠 = 𝐸b{∑ 𝛾]𝑟"/]/0|𝑠" = 𝑠^

]_` }

• Thevalueofastatesistheexpectedreturnstartingfromthatstatesandfollowingpolicy𝜋• SatisfiestheBellmanequations

Bellman equation for Vp :

Vp (s) = p (s,a) Ps ¢ s a Rs ¢ s

a + gV p( ¢ s )[ ]¢ s å

aå

– a system of S simultaneous linear equations

Notethatit’sarecursiveformulationofthevaluefunction

Examplevaluefunction

Calculatingthevaluefunction

• Ifthemodel𝒫''() andrewardℛ''(

) areknown,calculate𝑉b 𝑠 usingiterativepolicyevaluation.

http://cs.stanford.edu/people/karpathy/reinforcejs/gridworld_dp.html

Whyvaluefunction?

• There existsanaturalpartialorderonallpossiblepolicies:

𝜋1 ≥ 𝜋𝑖𝑓𝑎𝑛𝑑𝑜𝑛𝑙𝑦𝑖𝑓𝑉b( 𝑠 ≥ 𝑉b 𝑠 𝑓𝑜𝑟𝑎𝑙𝑙𝑠 ∈ 𝑆

• Definition: Apolicy 𝜋1 iscalledoptimalif 𝜋1 ≥ 𝜋forallpolicies 𝜋

• Existenceofatleastoneoptimalpolicyisguaranteed,andtheysatisfyBellmanOptimalityequations.

Theaction-valuefunction

• Definedas:• 𝑄b 𝑠, 𝑎 = 𝐸b 𝑅" 𝑠" = 𝑠, 𝑎" = 𝑎 = 𝐸b{∑ 𝛾]𝑟"/]/0|𝑠" = 𝑠^

]_` , 𝑎" = 𝑎}

• Thisiscalledthe“Qfunction”• Thevalueoftakingaction𝑎instate𝑠 followingpolicy𝜋 thereafter• AlsosatisfiestheBellmanequations

Qp (s,a) = Ep rt +1 + gV p(st +1 ) st = s, at = a{ }= Ps ¢ s

a

¢ s å Rs ¢ s

a +g Vp ( ¢ s )[ ]

Findinganoptimalpolicy

• Defineanewpolicy𝜋1 thatisgreedywithrespectto𝑉b

• Forallstates𝑠:𝜋1 = 𝑎𝑟𝑔𝑚𝑎𝑥)𝑄b 𝑠, 𝑎• Thispolicysatisfies𝑄b 𝑠, 𝜋1 𝑠 ≥ 𝑉b 𝑠• Canbeshownthat:• 𝜋1 ≥ 𝜋 for𝛾 < 1• Eventuallyconvergestoanoptimalpolicy

• Thisworksonlyif𝑉b 𝑠 canbecalculated

OtherwaystocalculateV/Q

• Monte-carlo policyevaluation• Sampleoneepisodeandupdatethevaluefunctionforeachstate• 𝑉 𝑠" ⟵ 𝑉 𝑠" + 𝛼 𝑅" − 𝑉 𝑠"• Asymptoticallyconvergestothetruevaluefunction

• TemporalDifference(TD)Learning• Foreachstepofeachepisode:• Takeaction𝑎,observereward𝑟"/0andnextstate𝑠"/0• 𝑉 𝑠" ⟵ 𝑉 𝑠" + 𝛼(𝑟"/0 + 𝛾𝑉 𝑠"/0 − 𝑉 𝑠" )

TemporalDifference

LearningQ-function(SARSA)

• Qcanbeusedtodefineapolicy• takeactiona = 𝑎𝑟𝑔𝑚𝑎𝑥)𝑄(𝑠, 𝑎) ateverystatewithprobability1 − 𝜖• Withprobability𝜖 takearandomaction(exploration)

• UsetemporaldifferencelearningtolearnQ-function• Foreachstepofeachepisode:• Takeaction𝑎,observereward𝑟"/0andnextstate𝑠"/0• 𝑄 𝑠", 𝑎" ⟵ 𝑄 𝑠", 𝑎" + 𝛼(𝑟"/0 + 𝛾𝑄 𝑠"/0, 𝑎"/0 − 𝑄 𝑠", 𝑎" )

• 𝑎"/0forlearningcanbeusedfromthispolicy• CalledSARSA

Q-learning

• UsetemporaldifferencelearningtolearnQ-function• Foreachstepofeachepisode:• Takeaction𝑎,observereward𝑟"/0andnextstate𝑠"/0• 𝑄 𝑠", 𝑎" ⟵ 𝑄 𝑠", 𝑎" + 𝛼(𝑟"/0 + 𝛾max) 𝑄 𝑠"/0, 𝑎 − 𝑄 𝑠", 𝑎" )

• Q-learningrequiresforconvergencetotheoptimalpolicythatrewardsaresampledforeachpair(s,a)infinitelyoften.

• http://cs.stanford.edu/people/karpathy/reinforcejs/gridworld_td.html

Functionapproximation

• TheQ-functioncanbeapproximatedwithaneuralnetwork(oranyotherfunctionapproximator)

• Thetargetsforthenetworkwouldbe𝑟"/0 + 𝛾max) 𝑄 𝑠"/0, 𝑎

• Traintheneuralnetworkwithbackpropagation

Thegoalofreinforcementlearning(repeated)

• Istofindapolicythatmaximizesthetotalexpectedreward• alsocalledthe“return”

• 𝛾 iscalledthe“discountingfactor”

• Small 𝛾 producesshortsighted,largeg far-sightedpolicies.• Risalwaysfiniteif𝛾 < 1 andthelocalrewardsrarefromaboundedsetofnumbers.

𝑅" = 𝑟"/0 + 𝛾𝑟"/Y + 𝛾Y𝑟"/Z + ⋯ = \𝛾]𝑟"/]/0

^

]_`

PolicyGradient

• Whynotlearnthepolicydirectly?• Definecostfunctionasthetotalexpectedreward:

𝐽 𝜃 = 𝐸 \𝑎]𝑟]

}

]_`

= 𝐸{𝑟 𝜏 }

• 𝑎] issomediscountingfactor• 𝑟] isrewardatstepk• 𝜏 isatrajectoryand𝑟 𝜏 =∑ 𝑎]𝑟]}

]_`

• Learnthisusinggradientascent:

𝜃"/0 = 𝜃" + 𝜂𝛻�𝐽 𝜃

• Problems?• CannotcalculategradientofJ

PolicyGradient

• Itispossibletoempiricallyestimatethegradient(Williams1992)

𝛻�𝐽 𝜃 = 𝐸{𝛻� log 𝑝�(𝜏)(𝑟 𝜏 − 𝑏)}

=\𝛻� log 𝜋�(𝑎"|𝑠")�

"_`

(𝑅" − 𝑏)

• Usesthelog-likelihoodtrick(orREINFORCEtrick)• Baselineisusedtoreducevarianceofgradientestimator• Baselinedoesn’tintroducebias• DEMO

DQNandA3C

DQN

• Mnih,V.etal. Human-levelcontrolthroughdeepreinforcementlearning.Nature 518, 529–533(2015).• UsesadeepneuralnetworktolearntheQ-values

DQN:Twokeyideas

• Episodereplay:• StoreearlierstepsandapplyQ-learningupdatesinrandombatchesfromthismemory

• UpdatepolicynetworkonlyonceeveryCsteps

DQN

A3C

• Mnih,V.etal. AsynchronousMethodsforDeepReinforcementLearning.arXiv:1602.01783[cs] (2016).

• A3C:AsynchronousAdvantageActorCritic• Usespolicygradientwithabaselinethatisthevaluefunction

𝛻�𝐽 𝜃 =\𝛻� log 𝜋�(𝑎"|𝑠")�

"_`

(𝑅" − 𝑉(𝑠"))

AdvantageActor

Critic

A3C

Resources

• Book:ReinforcementLearningAnIntroduction,RichardSuttonandAndrewBarto• AvailableonlineonAndrewBarto’s website:http://www.incompleteideas.net/sutton/book/the-book-1st.html

• Course:AutonomouslyLearningSystemsIGITUGraz• 2016website:http://www.igi.tugraz.at/lehre/Autonomously_learning_systems/WS16/• Nextcoursein2018• Lectureslidesavailablethere

• DQN:https://deepmind.com/research/dqn/• OpenAI Gym:https://gym.openai.com/envs• DeepReinforcementLearning:PongfromPixels(AndrejKarpathy):https://karpathy.github.io/2016/05/31/rl/• Book:DeepLearning,IanGoodfellow,Yoshua Bengio andAaronCourville

• Availableonline:http://www.deeplearningbook.org• RLPy:https://rlpy.readthedocs.io/en/latest/ (python2.7only)

Documents

An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action