Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
Samplez~đ(đ§) andtrajectories,thenuseMAMLtoadaptmetabaselineđ(
)(*) forpolicygradient:
Input-DependentBaselines
State-dependentbaseline:b(st)=V(st),âzt:âInput-dependentbaseline:b(st ,zt:â)=V(st|zt:â)
Dependontheentirefutureinputsequence{zt,zt+1,âŠ,zâ}duringtraining
Input-dependentbaselinesarebias-free forpolicygradients:
Implementationsofinput-dependentbaselines:
Experiments
Motivation Example
Input-DrivenProcesses
VarianceReductionforReinforcementLearninginInput-DrivenEnvironments
HongziMaoShaileshh Bojja Venkatakrishnan Malte SchwarzkopfMohammadAlizadehMITComputerScienceandArtificialIntelligenceLaboratory
Time
Job
size
Load balancer
Server 1 Server 2
Input 1
Input 2
Input-dependent State-dependent
TRPO A2C
RobustAdversarialRLMeta-PolicyOptimization
Input-dependentbaselinesareapplicabletomanypolicygradientmethods,suchasA2C,TRPO,PPO,andtheyarecomplementaryandorthogonaltorobustadversarialRLmethodssuchasRARL (Pintoetal.,2017)andmeta-policyoptimizationsuchasMPO (Clavera etal.,2018).
(a)
(b)
st-1 st st+1
zt-1 zt zt+1
at-1 at at+1
st-1 st st+1
zt-1 zt zt+1
at-1 at at+1
st-1 st st+1
at-1 at at+1
(c)
(a) StandardMDP
(b) Input-Driven MDP
(c) Input-DrivenPOMDP
Policyvariance Reward
Policyvisualization
Loadbalancingexample
Sampleđ§+~ đ§,, đ§., ⊠, đ§0 , usethecorrespondingvaluenetforpolicygradient:
Samplez~đ(đ§), useLSTMtocomputeđ((1, đ§) forpolicygradient:
Inputz~$(&) LSTM
Baseline()(*, &)
Input!"
Input!#
Input!$
Valuenet%&'( (*)
Valuenet%&', (*)
Valuenet%&'-(*)
Inputz~$(&)
Metabaseline()*(+)
MAMLadapt
Baseline()+(,)
(Notefficient)
workload networkcondition wind buoyancy movingtarget
Environmentswithexogenous,stochasticinputprocessesthataffectthedynamics
Sincetherewardispartiallydictatedbytheinputprocess,thestatealoneonlyprovideslimitedinformationtoestimatetheaveragereturn.Thus,policygradientmethodswithstandardstate-dependentbaselinessufferfromhighvariance.
b(s, z) = V(s|z) (MAML)
b(s, z) = V(s|z) (10 values)b(s) = V(s),ïżœz
heuristic
TRPO, b(s, z) = V(s|z)
RARL, b(s, z) = V(s|z)
TRPO, b(s) = V(s),ïżœz
RARL, b(s) = V(s),ïżœz b(s, z) = V(s|z); TRPOb(s, z) = V(s|z); MPO
b(s) = V(s),ïżœz; TRPOb(s) = V(s),ïżœz; MPO