Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
scalingupRLwithfunc1onapproxima1on
Human-levelcontrolthroughdeepreinforcementlearning,Mnihet.al.,Nature518,Feb2015hFp://www.nature.com/nature/journal/v518/n7540/full/nature14236.html
•pixelinput
•18joys-ck/bu3onposi-onsoutput
•changeingamescoreasfeedback
• convolu-onalnetrepresen-ngQ
•backpropaga-onfortraining!
humanlevelgamecontrol
neuralnetwork
convolu1on,weightsharing,andpooling
pixel
sharedfeaturedetector/kernel/filterw
pooled
featuremap
max(window)
fewerparametersduetosharingandpooling!
reverseprojec-onsofneuronoutputsinpixelspace
whatdoesadeepneuralnetworkdo?
composi1onalfeaturescomposi1onalproblem
solvingmul1plica1on(circuitdesign)
<—composedofaddingnumbers
<—composedofaddingbits
output:x.y—————multiply————————addingnums———————addingbits————input:xandy
humanknowledgeorganisa1on
findrootsofalinearexpression
<—composedofseWngexpressiontozeroandsolvinglinearequa1ons
<—composedofrearrangingterms
output:x=-2—————(findrootsofx+2)————————————(setx+2=0)—(solve)——————————————————————(rearrange)—————input:x,+,2,=,0
deeplayersmakerepresenta1onofknowledgeandprocesseshappenwithfewerneurons!
backpropaga1on?Whatisthetargetagainstwhichtominimiseerror?
prac1callyspeaking…minimiseMSEbySGD
experiencereplay
savecurrenttransi-on(s,a,r,s’)inmemory
every1mestep
randomlysampleasetof(s,a,r,s’)frommemoryfortrainingQnetwork
(insteadoflearningfromcurrentstatetransi-on)
everystep=i.i.d+learnfromthepast
at
st
st+1
rt+1
freezingtargetQmovingtarget=>oscilla-ons
stabiliselearningbyfixingtarget,movingiteverynowandthen
freeze
doubleDQN
evalua1onoftargetac1on
selec1onoftargetac1on
maxQa’
DeepReinforcementLearningwithDoubleQ-learningvanHasseltet.al.,AAAI2016hFps://arxiv.org/pdf/1509.06461v3.pdf
priori1sedexperiencereplay
sample(s,a,r,s’)frommemory
basedonsurprise
Priori1sedExperienceReplaySchaulet.al.,ICLR2016hFps://arxiv.org/pdf/1511.05952v4.pdf
Combiningdecoupling(double),priori1sedreplay,andduellinghelps!
duellingarchitecture
Q(s,a)=V(s,u)+A(s,a,v)
u
v
Q
Q
DuelingNetworkArchitecturesforDeepRLWanget.al.,ICML2016hFps://arxiv.org/pdf/1511.06581v3.pdf
howevertrainingis
SLOW
makingdeepRLfasterandwilder(more
applicableintherealworld)!
dataefficientexplora1on?
parallelism?
transferlearning?
makinguseofamodel?
Q Q Q Q Q
Q Q Q Q Q
Q Qt
Qt Qt Qt Qt Qt
Qt Qt Qt Qt Qt
sharedparamsforQandtargetQ
parallellearnersgeWngindividual
experiences
lock-freeparamupdates
AsynchronousMethodsforDeepReinforcementLearning,Mnihet.al.,ICML2016hFp://jmlr.org/proceedings/papers/v48/mniha16.pdf
codeforyoutoplaywith...
Telenor’sownimplementa1onofasynchronousdeepRL:hFps://github.com/traai/async-deep-rl
hFps://openrl.slack.comLet’skeeptheconversa1ongoing: