14
Persistence Length Based Exploration for Continuous Control Riashat Islam (Joint work with Maziar Gomrokchi, Susan Amin & Doina Precup) Reasoning and Learning Lab 20th April 2017

Persistence Length Based Exploration for Continuous Control · 2017-06-11 · Motivation O -Policy Actor-Critic I DDPG in continuous control Lillicrap et. al., 2016, Silver et. al.,

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Persistence Length Based Exploration forContinuous Control

Riashat Islam(Joint work with Maziar Gomrokchi, Susan Amin & Doina Precup)

Reasoning and Learning Lab

20th April 2017

Deep Reinforcement LearningLocomotion Tasks

Exploration in Continuous Control

I Exploring environment ←→ Exploiting good behaviour

I In continuous control :default exploration is through random control noise

I High dimensional continous actionsI Many directed exploration methods (ε-greedy, Boltzmann) are

limited to discrete action spacesI Current exploration strategies are insufficient

We propose trajectory based exploration method suited forcontinuous control tasks

Motivation

Off-Policy Actor-Critic

I DDPG in continuous control[Lillicrap et. al., 2016, Silver

et. al., 2014]

However, no good exploration strategy to collect off-policy samples

I this talk : propose exploration method for off-policyactor-critic for continuous control

I Related current benchmark :VIME in on-policy TRPO [Houthooft et. al., 2016]

Persistence Length Exploration

Intuition :

I Choice of next exploratory action should dependent on thetrajectory so far

I Trajectories should fill up the entire state space

Persistence Length Exploration

I Mechanism of locally self avoiding random walk

I Adopted from physics literature to describe behaviour ofpolymer chains

I Consider trajectory upto current state to decide next action

I Pure exploration → plan trajectory to fill up entireenvironment

Persistence Length Exploration

I Self avoiding chains ind-dimensional action space

I Self avoiding trajectory

I Travel quickly aroundenvironment depending onparameterization

I Persistence length Lpquantifies stiffness of thechain

PolyRL + DDPG

PolyRL Exploration (2D Action Space)

(a) Episode 1 (b) Episode 2

Figure: Exploratory action trajectory

PolyRL + DDPG (MuJoCo Hopper)

PolyRL + DDPG (MuJoCo Swimmer)

Policy Gradients on MuJoCo Tasks

Few Benchmark Results (Max Return)Task Action Dim TRPO DDPGSwimmer 2D 110 150Reacher 2D -6.7 -6.6Hopper 3D 2486 2604HalfCheetah 6D 4734 7490Walker 6D 3567 3626Humanoid 17D 918 552

Current Benchmark - VIMEMuJoCo Walker2D, Swimmer

Thank You

Questions...