34
Deep RL in continuous action spaces On-Policy set up -- Samriddhi Jain

Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How

Deep RL in continuous action spaces

On-Policy set up

-- Samriddhi Jain

Page 2: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How

Discrete action environments

● Atari, Board games

Page 3: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How

Continuous action environments

● MuJoCo (Multi-Joint dynamics with Contact) environments

Page 4: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How

Continuous action environments

● Simulated goal-based tasks for the Fetch and ShadowHand robots

https://openai.com/blog/ingredients-for-robotics-research/

Page 5: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How

Handling continuous actions

Policy Network

State

Distribution Parameters

Control

Sampling

Page 6: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How

Policy Gradients: review

Pitfalls:● Sample inefficiency

Page 7: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How

Policy Gradients: step size and convergence

Page 8: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How

Policy Gradients: pitfalls

● Relation between policy space and model parameter space?

Page 9: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How

Research works covered

● Approximately Optimal Approximate Reinforcement Learning, Kakade and

Langford 2002

● Trust Region Policy Optimization, Schulman et al. 2015

● Proximal Policy Optimization Algorithms, Schulman et al. 2017

Page 10: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How

Surrogate Loss

● How to improve sample efficiency?○ Use trajectories from other policies with importance sampling

● Gradients are same

Page 11: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How

Trust Region Methods

Search in a trusted region

https://reference.wolfram.com/language/tutorial/UnconstrainedOptimizationTrustRegionMethods.htmlhttps://optimization.mccormick.northwestern.edu/index.php/Trust-region_methods

Page 12: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How

Line search vs Trust region search

Page 13: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How

New loss

● Constrained:

● Penalty:

In practise, 𝛽 is harder to tune than ẟ.

Page 14: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How

Minorize Maximization theorem

Page 15: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How

By Kakade and Langford (2002)

Local approximation

Page 16: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How
Page 17: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How

Sample based estimation

Page 18: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How

Natural Policy Gradients

● Approximate using taylor expansion:

Loss reduces to the first order term, constraint to second order term

Page 19: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How

Natural Policy Gradients

Computationally (very?) expensive

O(N3)

Page 20: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How

Truncated Natural Policy Gradient

● Solutions:

○ Use kronecker vector product: ACKTR

○ Use Conjugate gradients to compute H-1g without actually inverting H:

Truncated Natural Policy Gradient

Page 21: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How

Line search for TRPO

● With the quadratic approximations, the constraint may be violated

● Solution: Enforce KL constraint, backtracking line search with exponential decay

Page 22: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How

Trust Region Policy Optimization

Page 23: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How

Results

Page 24: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How

Issues with TRPO

● Doesn’t work well with CNNs and RNNs

● Scalability

● Complexity

Page 25: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How

Proximal Policy Optimization (PPO)

● Motivation is same as that of TRPO

● Uses first order derivative solutions

● Designed to be simpler to implement

● Two versions:

○ PPO Penalty

○ PPO Clip

https://openai.com/blog/openai-baselines-ppo/

Page 26: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How

PPO adaptive KL penalty

● Considers the unconstrained objective

● Updates 𝛽k between iterations

○ If d < dtarg/1.5, 𝛽k ← 𝛽k/ 2

○ If d > dtarg × 1.5, 𝛽k←𝛽k× 2

Page 27: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How

PPO Clip objective

Page 28: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How

Surrogate objectives

Page 29: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How

PPO actor pseudocode

Sample efficient version of PPO

Page 30: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How

Results

Page 32: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How

Discussion

● Implementation matters

Implementation Matters in Deep RL: A Case Study on PPO and TRPO, https://openreview.net/forum?id=r1etN1rtPB

Page 33: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How

References

● Trust Region Policy Optimization (https://arxiv.org/pdf/1502.05477.pdf)● Constrained Policy Optimization, Achiam et al. 2017 (https://arxiv.org/pdf/1705.10528.pdf)● Proximal Policy Optimization Algorithms (https://arxiv.org/pdf/1707.06347.pdf)● Spinning Up in Deep RL (https://spinningup.openai.com/en/latest/index.html)● UC Berkeley Deep RL course

(http://rail.eecs.berkeley.edu/deeprlcourse-fa17/f17docs/lecture_13_advanced_pg.pdf)● Deep RL Bootcamp (https://sites.google.com/view/deep-rl-bootcamp/lectures)● Deep RL Series

(https://medium.com/@jonathan_hui/rl-trust-region-policy-optimization-trpo-part-2-f51e3b2e373a)● https://towardsdatascience.com/the-pursuit-of-robotic-happiness-how-trpo-and-ppo-stabilize-policy-g

radient-methods-545784094e3b

Page 34: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How

Questions?