Sample z~() and trajectories, then use MAML to adapt meta baseline ( )(*) for policy gradient: Input-Dependent Baselines State-dependent baseline: b(s t ) = V(s t ), ∀z t:∞ Input-dependent baseline: b(s t ,z t:∞ ) = V(s t |z t:∞ ) Depend on the entire future input sequence { z t, z t+1, …, z ∞ } during training Input-dependent baselines are bias-free for policy gradients: Implementations of input-dependent baselines: Experiments Motivation Example Input-Driven Processes Variance Reduction for Reinforcement Learning in Input-Driven Environments Hongzi Mao Shaileshh Bojja Venkatakrishnan Malte Schwarzkopf Mohammad Alizadeh MIT Computer Science and Artificial Intelligence Laboratory Time Job size Load balancer Server 1 Server 2 Input 1 Input 2 Input-dependent State-dependent TRPO A2C Robust Adversarial RL Meta-Policy Optimization Input-dependent baselines are applicable to many policy gradient methods, such as A2C, TRPO, PPO, and they are complementary and orthogonal to robust adversarial RL methods such as RARL (Pinto et al., 2017) and meta-policy optimization such as MPO (Clavera et al., 2018). (a) (b) s t-1 s t s t+1 z t-1 z t z t+1 a t-1 a t a t+1 s t-1 s t s t+1 z t-1 z t z t+1 a t-1 a t a t+1 s t-1 s t s t+1 a t-1 a t a t+1 (c) (a) Standard MDP (b) Input -Driven MDP (c) Input -Driven POMDP Policy variance Reward Policy visualization Load balancing example Sample + ~ , , . ,…, 0 , use the corresponding value net for policy gradient: Sample z~(), use LSTM to compute ( (1, ) for policy gradient: Input z ~ $(&) LSTM Baseline ( ) (*, &) Input ! " Input ! # Input ! $ Value net % & ' ( (*) Value net % & ' , (*) Value net % & ' - (*) Input z ~ $(&) Meta baseline ( ) *(+) MAML adapt Baseline ( ) + (,) (Not efficient) workload network condition wind buoyancy moving target Environments with exogenous, stochastic input processes that affect the dynamics Since the reward is partially dictated by the input process, the state alone only provides limited information to estimate the average return. Thus, policy gradient methods with standard state-dependent baselines suffer from high variance. b(s, z) = V(s|z) (MAML) b(s, z) = V(s|z) (10 values) b(s) = V(s),z heuristic TRPO, b(s, z) = V(s|z) RARL, b(s, z) = V(s|z) TRPO, b(s) = V(s),z RARL, b(s) = V(s),z b(s, z) = V(s|z); TRPO b(s, z) = V(s|z); MPO b(s) = V(s),z; TRPO b(s) = V(s),z; MPO

Variance Reduction for Reinforcement Learning in Input ...people.csail.mit.edu/hongzi/var-website/content/poster.pdforthogonal to robust adversarial RL methods such as RARL(Pinto et

Download PDF Report

Upload
others
View
9
Download
0

Embed Size (px)

Citation preview

Page 1: Variance Reduction for Reinforcement Learning in Input ...people.csail.mit.edu/hongzi/var-website/content/poster.pdforthogonal to robust adversarial RL methods such as RARL(Pinto et

Samplez~𝑃(𝑧) andtrajectories,thenuseMAMLtoadaptmetabaseline𝑉(

)(*) forpolicygradient:

Input-DependentBaselines

State-dependentbaseline:b(st)=V(st),∀zt:∞Input-dependentbaseline:b(st ,zt:∞)=V(st|zt:∞)

Dependontheentirefutureinputsequence{zt,zt+1,…,z∞}duringtraining

Input-dependentbaselinesarebias-free forpolicygradients:

Implementationsofinput-dependentbaselines:

Experiments

Motivation Example

Input-DrivenProcesses

VarianceReductionforReinforcementLearninginInput-DrivenEnvironments

HongziMaoShaileshh Bojja Venkatakrishnan Malte SchwarzkopfMohammadAlizadehMITComputerScienceandArtificialIntelligenceLaboratory

Time

Job

size

Load balancer

Server 1 Server 2

Input 1

Input 2

Input-dependent State-dependent

TRPO A2C

RobustAdversarialRLMeta-PolicyOptimization

Input-dependentbaselinesareapplicabletomanypolicygradientmethods,suchasA2C,TRPO,PPO,andtheyarecomplementaryandorthogonaltorobustadversarialRLmethodssuchasRARL (Pintoetal.,2017)andmeta-policyoptimizationsuchasMPO (Clavera etal.,2018).

(a)

(b)

st-1 st st+1

zt-1 zt zt+1

at-1 at at+1

st-1 st st+1

zt-1 zt zt+1

at-1 at at+1

st-1 st st+1

at-1 at at+1

(c)

(a) StandardMDP

(b) Input-Driven MDP

Policyvariance Reward

Policyvisualization

Loadbalancingexample

Sample𝑧+~ 𝑧,, 𝑧., … , 𝑧0 , usethecorrespondingvaluenetforpolicygradient:

Samplez~𝑃(𝑧), useLSTMtocompute𝑉((1, 𝑧) forpolicygradient:

Inputz~$(&) LSTM

Baseline()(*, &)

Input!"

Input!#

Input!$

Valuenet%&'( (*)

Valuenet%&', (*)

Valuenet%&'-(*)

Inputz~$(&)

Metabaseline()*(+)

MAMLadapt

Baseline()+(,)

(Notefficient)

workload networkcondition wind buoyancy movingtarget

Environmentswithexogenous,stochasticinputprocessesthataffectthedynamics

Sincetherewardispartiallydictatedbytheinputprocess,thestatealoneonlyprovideslimitedinformationtoestimatetheaveragereturn.Thus,policygradientmethodswithstandardstate-dependentbaselinessufferfromhighvariance.

b(s, z) = V(s|z) (MAML)

b(s, z) = V(s|z) (10 values)b(s) = V(s),�z

heuristic

TRPO, b(s, z) = V(s|z)

RARL, b(s, z) = V(s|z)

TRPO, b(s) = V(s),�z

RARL, b(s) = V(s),�z b(s, z) = V(s|z); TRPOb(s, z) = V(s|z); MPO

b(s) = V(s),�z; TRPOb(s) = V(s),�z; MPO

NUMFabric: Fast and Flexible Bandwidth Allocation in ...people.csail.mit.edu/hongzi/content/publications/NUMFabric.pdfity functions in a completely distributed fashion. Network utility

Documents

[people.csail.mit.edu]people.csail.mit.edu/jakobn/research/TalkPhDsem060403.pdfOutline of Part I: Proof Complexity and Resolution Introduction Propositional Proof Systems Proof Systems

Documents

Neo: A Learned Query Optimizer - People | MIT CSAIL · Neo: A Learned Query Optimizer Ryan Marcus1, Parimarjan Negi2, Hongzi Mao2, Chi Zhang1, Mohammad Alizadeh2, Tim Kraska2, Olga

Documents

Defining Pictorial Style: Lessons from Linguistics …people.csail.mit.edu/fredo/PUBLI/WillatsDurandAxiomathes.pdfDefining Pictorial Style: Lessons from Linguistics and ... pictures

Documents

ON RECOGNIZING GRAPH PROPERTIES FROM …people.csail.mit.edu/rivest/RivestVuillemin... · Created Date: 5/18/2007 11:07:15 AM

Documents

Related Graphics Research Warning…people.csail.mit.edu/bmcutler/gaudi/graphics_research.pdf• Related Graphics Research – dynamic simulations – optimizations & satisfying constraints

Documents

Practical Reinforcement Learning Using Representation ...people.csail.mit.edu › agf › Files › PhD-Thesis.pdf · Practical Reinforcement Learning Using Representation Learning

Documents

Robust Adversarial Reinforcement Learning - arxiv.org · Robust Adversarial Reinforcement Learning InvertedPendulum HalfCheetah Swimmer Hopper Walker2d Figure 1. We evaluate RARL

Documents

Neo: A Learned Query Optimizerpeople.csail.mit.edu/hongzi/content/publications/Neo-VLDB19.pdf · Neo: A Learned Query Optimizer Ryan Marcus1, Parimarjan Negi2, Hongzi Mao2, Chi Zhang1,

Documents

Download these slides!people.csail.mit.edu/seneff Download individual files from these links: people.csail.mit.edu/seneff/Indianapolis/ Drugs.pdf Nutrition.pdf

Documents

Park: An Open Platform for Learning ... - People | MIT CSAILpeople.csail.mit.edu/hongzi/content/publications/Park-NIPS19.pdf · times. Data processing jobs often have complex structure

Documents

ARoleforthePinealGlandin NeurologicalDamageFollowing ...people.csail.mit.edu/seneff/SeneffNice2014.pdf · Aluminum9adjuvantedVaccinaon "" Stephanie’Sene ... – Pineal’gland’calciﬁcaon’correlates’inversely’with

Documents

Demo: Real-time Breath Monitoring Using Wireless Signals · Demo: Real-time Breath Monitoring Using Wireless Signals Fadel Adib Zachary Kabelac Hongzi Mao Dina Katabi Robert C. Miller

Documents

Shapecollage: occlusion-aware, example-based shape ...people.csail.mit.edu/fcole/shapecollage/shapecollage.pdf · Shapecollage: occlusion-aware, example-based shape interpretation

Documents

Estimatesofsegregationandoverlapoffunctionalconnectivitynet ...people.csail.mit.edu/ythomas/publications/2014...terretainedfrequenciesbelow0.08 Hz.Spuriousvariancewasremoved using

Documents

End$to$End’Speech’Recogni0on’ using’Deep’LSTMs, CTC ...people.csail.mit.edu/jrg/meetings/CTC-Dec07.pdfChinese’Mandarin’conversaonal’telephone’speech’[Liu’etal.]’

Documents

Learning Scheduling Algorithms for Data Processing Clusters › hongzi › content › ... · cated scheduling policies. For example, instead of a rigid fair sharing policy, Decima

Documents

Supplementary ﬁle: AverageExplorer: Interactive Exploration and ...people.csail.mit.edu/junyanz/projects/averageExplorer/averageExplor… · system, k-means clustering, spectral

Documents

Collapsed Variational Bayesian Inference of the Author ...people.csail.mit.edu › ythomas › publications › 2016AuthorTopicCVB-… · Collapsed Variational Bayesian Inference

Documents

Automatic Lymphoma Classification based on …people.csail.mit.edu/yluo/Concept_subgraph_pathology_preprint.pdf · Automatic Lymphoma Classification with ... rules or supervised learning

Documents

Robust Adversarial Reinforcement Learningios can be viewed as extra forces/disturbances in the system. This paper proposes the idea of ro-bust adversarial reinforcement learning (RARL),

Documents

How to Give Good Talks - KAIST · 2019. 4. 2. · You Must Have a Goal ... Common Premise: Grab Their Attention Smart Homes that Monitor Breathing and Heart Rate Fadel Adib Hongzi

Documents

Demo: Real-time Breath Monitoring Using Wireless Signalspeople.csail.mit.edu/hongzi/content/publications/VitalRadioDemo-MobiCom.pdfuitous health-monitoring [4, 5]. Today, we see smart

Documents

Optimistic Concurrency Control for Distributed ...people.csail.mit.edu/stefje/papers/occ-dpmeans_with_fonts.pdf · Optimistic Concurrency Control for Distributed Unsupervised Learning

Documents

RegRocket: Scalable Multinomial Autologistic Regression ...people.csail.mit.edu/ibrahimsabek/pdf/19_journal_regrocket.pdf · To overcome the second limitation, RegRocket exploits

Documents

$Fractal:AnExecutionModelfor Fine ...people.csail.mit.edu/sanchez/papers/2017.fractal.isca.pdf · ﬁtsofadditionalparallelism(inmaxflow-fractal,eachtaskis373 cycles on average). We$

Fractal:AnExecutionModelfor Fine ...people.csail.mit.edu/sanchez/papers/2017.fractal.isca.pdf · ﬁtsofadditionalparallelism(inmaxflow-fractal,eachtaskis373 cycles on average). We

Documents

Rheem AC RARL-JEZ

Documents

Neural Adaptive Video Streaming with Pensievepeople.csail.mit.edu/hongzi/content/publications/Pen...Neural Adaptive Video Streaming with Pensieve Hongzi Mao, Ravi Netravali, Mohammad

Documents

Optimal Two-Stage Adaptive Enrichment Designs, using ...people.csail.mit.edu/...Liu_Optimal_two_stage_adaptive_enrichment_design_IMPACT_slides.pdfTwo-Stage Adaptive Enrichment Design

Documents

An In-depth Analysis of 3G Trafﬁc and PerformanceAn In-depth Analysis of 3G Trafﬁc and Performance Zhenxian Hu+, Yi-Chao Chen∗, Lili Qiu∗, Guangtao Xue+, Hongzi Zhu+ Nicholas

Documents