Reinforcement Learning Value and Policy Based Bridging the ... · -Value-based learning is not...

Bridging the Gap Between Value and Policy Based Reinforcement Learning Ofir Nachum, Mohammad Norouzi, Kelvin Xu, Dale

Schuurmans

Topic: Q-Value Based RLPresenter: Michael Pham-Hung

Motivation

Value Based RLi.e. Q-Learning

+ Data efficient+ Learn from any trajectory

Motivation

Value Based RL Policy Based RL i.e. REINFORCEi.e. Q-Learning

+ Stable deep function approximators

Motivation

Policy Based RL i.e. REINFORCEValue Based RLi.e. Q-Learning

Motivation

Policy Based RL i.e. REINFORCEValue Based RLi.e. Q-Learning

Profit.

Contributions Problem: Combining the advantages of on-policy and off-policy learning.

Why is this problem important?:

- Model-free RL with deep function approximators seems like a good idea.

Why is this problem hard?:

- Value-based learning is not always stable with deep function approximators.

Limitations of prior work:

- Prior work remain potentially unstable and are not generalizable.

Key Insight: Starting from first principles rather than performing naïve approaches can be more rewarding.

Revealed: Results in flexible algorithm.

Outline Background

- Q-Learning Formulation

- Softmax Temporal Consistency

- Consistency between optimal value and policy

PCL Algorithm

- Basic PCL

- Unified PCL

Results

Limitations

Q-Learning Formulation

One-hot distribution

Hard-max Bellman temporal consistency!

Soft-max Temporal Consistency

- Augment the standard expected reward objective with a discounted entropy regularizer

- This helps encourages exploration and helps prevent early convergence to sub-optimal policies

Form of a Boltzmann distribution … No longer one hot distribution! Entropy term prefers the use of policies with more uncertainty.

Note the log-sum-exp form!

Consistency Between Optimal Value & Policy

• Normalization Factor

Consistency Between Optimal Value & Policy

Algorithm - Path Consistency Learning (PCL)•

Algorithm - Unified PCL•

Experimental Results

PCL can consistently match or beat the performance of A3C and double Q-learning.

PCL and Unified PCL are easily implementable with expert trajectories. Expert trajectories can be prioritized in the replay buffer as well.

The results of PCL against A3C and DQN baselines. Each plot shows average reward across 5 random training runs (10 for Synthetic Tree) after choosing best hyperparameters. A signal standard deviation bar clipped at the min and max. The x-axis is number of training iterations. PCL exhibits comparable performance to A3C in some tasks, but clearly outperforms A3C on the more challenging tasks. Across all tasks, the performance of DQN is worse than PCL.

The results of PCL vs. Unified PCL. Overall found that using a single model for both values and policy is not detrimental to training. Although in some of the simpler tasks PCL has an edge over Unified PCL, on the more difficult tasks, Unified PCL performs better.

The results of PCL vs. PCL augmented with a small number of expert trajectories on the hardest algorithmic tasks. We find that incorporating expert trajectories greatly improves performance.

Discussion of results

Using a single model for both values and policy is not detrimental to training

The ability for PCL to incorporate expert trajectories without requiring adjustment or correction is a desirable property in real-world applications

Critique / Limitations / Open Issues

- Only implemented on simple tasks- Addressed with Trust-PCL, which enables a continuous action space.

- Requires small learning rates- Addressed with Trust-PCL, which uses Trust Regions

- Proof was given for deterministic states but also works for stochastic states as well.

- Model-free RL with deep functions approximators seems like a good idea.

Key Insight: Starting from a theoretical approach rather than naïve approaches can be more fruitful.

Revealed: Results in a quite flexible algorithm.

Exercise Questions •

Reinforcement Learning Value and Policy Based Bridging the ... · -Value-based learning is not...

Documents

OBSERVER-BASED ADAPTIVE FUZZY ROBUST ...[2,6,16,37-39]. Recently, adaptive fuzzy (neural network) control schemes that use univer-sal approximators with adjustable parameters have

arXiv:1509.02971v5 [cs.LG] 29 Feb 2016 · As with Q learning, introducing non-linear function approximators means that convergence is no longer guaranteed. However,

Academic Precision Reconsidered: A Corpus-based Account1 · National Corpus (BNC) to demonstrate the varying frequencies of approximators in different types of discourse, namely,

2019 REAPPRAISAL VALUES (Alphabetical by Owner) · 2019. 6. 11. · 2019 REAPPRAISAL VALUES (Alphabetical by Owner) PID Name Acres Description Prior Value New Value Homestead Housesite

A Tutorial on Linear Function Approximators for Dynamic …people.csail.mit.edu/agf/Files/13FTML-RLTutorial.pdf · 2014-02-11 · 2 Dynamic Programming and Reinforcement Learning

Hornik - Multi Layer Feed Forward Networks Are Universal Approximators

Universal approximators for direct policy search in multi ......Francesca Pianosi Rodolfo Soncini-Sessa ⇤ Department of Electronics, Information, and Bioengineering, Politecnico

Expected Value and Fair Games. SyllabusSyllabus Prior Knowledge: Understanding of basic Probability theory

Universal approximators for Direct Policy Search in multi-purpose water reservoir management

Tom Schaul, Dan Horgan, Karol Gregor, Dave Silver...Universal Value Function Approximators. Tom. Schaul, Dan Horgan, Karol Gregor, Dave Silver

Beyond function approximators for batch mode reinforcement learning: rebuilding trajectories

A Musculoskeletal Ultrasound Certificate Prior to ......certificate prior to applying to a SMF would be of value for an applicant, yet 18 (45%) stated that it would positively impact

A Tutorial on Linear Function Approximators for Dynamic ...castlelab.princeton.edu/html/ORF544/Readings/Geramifard - Tutorial... · for Dynamic Programming and Reinforcement ... set

Frontiers in Time Scales and Inequalitiesfractional differentiation level having polynomials and splines as approximators. Very little is written so far about fractional approximation

Deep Belief Networks are Compact Universal Approximators · Universal approximation theorem [Cybe89] Let ’() be a non constant, bounded, and monotonically-increasing continuous

Does Corporate Diversification Destroy Value?jgraham/website/DiversificationJF.pdf · In concluding that diversification destroys value, the prior literature im-plicitly assumes that

Actuarial Valuation Report as of June 30, 2009 · “actuarial value of assets”, is equal to the expected asset value, based on the prior year actuarial value and the assumed interest

Annotation and markBuilding a corpus -up · PDF fileGRAMMAR. A1 GENERAL AND ABSTRACT TERMS A1.1.1 General actions, ... Maximizers A13.3 Degree: Boosters A13.4 Degree: Approximators

Pricing on Purpose: The Eight Steps to Implementing Value ...http-download.intuit.com/http.intuit/CMO...value billing.The difference is pricing is always done prior to the work being

Multilayer Feedforward Networks are Universal Approximators · Multilayer Feedforward Nets 361 Our general results will be proved first for CII net- works and subsequently extended