Transcript
Page 1: Variance Reduction for Reinforcement Learning in Input ...people.csail.mit.edu/hongzi/var-website/content/poster.pdforthogonal to robust adversarial RL methods such as RARL(Pinto et

Samplez~𝑃(𝑧) andtrajectories,thenuseMAMLtoadaptmetabaseline𝑉(

)(*) forpolicygradient:

Input-DependentBaselines

State-dependentbaseline:b(st)=V(st),∀zt:∞Input-dependentbaseline:b(st ,zt:∞)=V(st|zt:∞)

Dependontheentirefutureinputsequence{zt,zt+1,…,z∞}duringtraining

Input-dependentbaselinesarebias-free forpolicygradients:

Implementationsofinput-dependentbaselines:

Experiments

Motivation Example

Input-DrivenProcesses

VarianceReductionforReinforcementLearninginInput-DrivenEnvironments

HongziMaoShaileshh Bojja Venkatakrishnan Malte SchwarzkopfMohammadAlizadehMITComputerScienceandArtificialIntelligenceLaboratory

Time

Job

size

Load balancer

Server 1 Server 2

Input 1

Input 2

Input-dependent State-dependent

TRPO A2C

RobustAdversarialRLMeta-PolicyOptimization

Input-dependentbaselinesareapplicabletomanypolicygradientmethods,suchasA2C,TRPO,PPO,andtheyarecomplementaryandorthogonaltorobustadversarialRLmethodssuchasRARL (Pintoetal.,2017)andmeta-policyoptimizationsuchasMPO (Clavera etal.,2018).

(a)

(b)

st-1 st st+1

zt-1 zt zt+1

at-1 at at+1

st-1 st st+1

zt-1 zt zt+1

at-1 at at+1

st-1 st st+1

at-1 at at+1

(c)

(a) StandardMDP

(b) Input-Driven MDP

(c) Input-DrivenPOMDP

Policyvariance Reward

Policyvisualization

Loadbalancingexample

Sample𝑧+~ 𝑧,, 𝑧., … , 𝑧0 , usethecorrespondingvaluenetforpolicygradient:

Samplez~𝑃(𝑧), useLSTMtocompute𝑉((1, 𝑧) forpolicygradient:

Inputz~$(&) LSTM

Baseline()(*, &)

Input!"

Input!#

Input!$

Valuenet%&'( (*)

Valuenet%&', (*)

Valuenet%&'-(*)

Inputz~$(&)

Metabaseline()*(+)

MAMLadapt

Baseline()+(,)

(Notefficient)

workload networkcondition wind buoyancy movingtarget

Environmentswithexogenous,stochasticinputprocessesthataffectthedynamics

Sincetherewardispartiallydictatedbytheinputprocess,thestatealoneonlyprovideslimitedinformationtoestimatetheaveragereturn.Thus,policygradientmethodswithstandardstate-dependentbaselinessufferfromhighvariance.

b(s, z) = V(s|z) (MAML)

b(s, z) = V(s|z) (10 values)b(s) = V(s),�z

heuristic

TRPO, b(s, z) = V(s|z)

RARL, b(s, z) = V(s|z)

TRPO, b(s) = V(s),�z

RARL, b(s) = V(s),�z b(s, z) = V(s|z); TRPOb(s, z) = V(s|z); MPO

b(s) = V(s),�z; TRPOb(s) = V(s),�z; MPO

Recommended