Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Persistence Length Based Exploration forContinuous Control
Riashat Islam(Joint work with Maziar Gomrokchi, Susan Amin & Doina Precup)
Reasoning and Learning Lab
20th April 2017
Exploration in Continuous Control
I Exploring environment ←→ Exploiting good behaviour
I In continuous control :default exploration is through random control noise
I High dimensional continous actionsI Many directed exploration methods (ε-greedy, Boltzmann) are
limited to discrete action spacesI Current exploration strategies are insufficient
We propose trajectory based exploration method suited forcontinuous control tasks
Motivation
Off-Policy Actor-Critic
I DDPG in continuous control[Lillicrap et. al., 2016, Silver
et. al., 2014]
However, no good exploration strategy to collect off-policy samples
I this talk : propose exploration method for off-policyactor-critic for continuous control
I Related current benchmark :VIME in on-policy TRPO [Houthooft et. al., 2016]
Persistence Length Exploration
Intuition :
I Choice of next exploratory action should dependent on thetrajectory so far
I Trajectories should fill up the entire state space
Persistence Length Exploration
I Mechanism of locally self avoiding random walk
I Adopted from physics literature to describe behaviour ofpolymer chains
I Consider trajectory upto current state to decide next action
I Pure exploration → plan trajectory to fill up entireenvironment
Persistence Length Exploration
I Self avoiding chains ind-dimensional action space
I Self avoiding trajectory
I Travel quickly aroundenvironment depending onparameterization
I Persistence length Lpquantifies stiffness of thechain
PolyRL Exploration (2D Action Space)
(a) Episode 1 (b) Episode 2
Figure: Exploratory action trajectory
Policy Gradients on MuJoCo Tasks
Few Benchmark Results (Max Return)Task Action Dim TRPO DDPGSwimmer 2D 110 150Reacher 2D -6.7 -6.6Hopper 3D 2486 2604HalfCheetah 6D 4734 7490Walker 6D 3567 3626Humanoid 17D 918 552