40
CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master students)

CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master

CSC2547 Presentation:Curiosity-driven exploration

Count-based VS Info gain-based

Sheng Jia, Tinglin Duan(First year master students)

Page 2: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master

1. PLAN (2011)

1. VIME (NeurIPS2016)

1. CTS (NeurIPS2016)

Page 3: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master

OutlineMotivation, Related Works and Demo

Unifying Count-Based Exploration and Intrinsic Motivation

Comparisons and Discussion

Planning to Be Surprised

Variational Information Maximizing Exploration

Page 4: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master

Outline

Unifying Count-Based Exploration and Intrinsic Motivation

Comparisons and Discussion

Planning to Be Surprised

Variational Information Maximizing Exploration

Motivation, Related Works and Demo

Page 5: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master

BackgroundRL+Curiosity

Next state

Extrinsic reward

Intrinsic reward/exploration bonus

action

History:

Page 6: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master

What is exploration?

- Reducing the agent’s uncertainty over the environment’s dynamics.

[VIME]

[Plan]

Intrinsic motivation:

[CTS] Count-based

- Use (pseudo) visitation counts to guide agents to unvisited states.

Page 7: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master

Why exploration useful? DEMO Our original plot & demo

X-axis S1, s2, s3, … . sT

Y-axis Intrinsic Reward function

timestamp

/Training Timestamp Z-ax

is In

trins

ic R

ewar

d

Sparse Reward ProblemMontezuma’s revenge

DQN DQN + Exploration bonus

Page 8: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master

Related work (Timeline)

2019 On Bonus Based Exploration Methods In The Arcade Learning Environment

2016 VIME CTS

Pseudocount in 2016 still achieves SOTA for Montezuma’s revenge”

Distillation error as a quantification of uncertainty

2011 PLAN

2018 Exploration by Random Network Distillation

2017 Count-Based Exploration with Neural Density Models

2015 Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models

2010 Formal Theory of Creativity, Fun,and Intrinsic Motivation (1990-2010)

The notion of Intrinsic Motivation

L2 prediction error using neural networks

Pseudocount + Pixel CNN

Bayesian Optimal Exploration

Approximate “PLAN”

Pseudocount exploration

Page 9: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master

Outline

Unifying Count-Based Exploration and Intrinsic Motivation

Comparisons and Discussion

Motivation, Related Works and Demo

Variational Information Maximizing Exploration

Planning to Be Surprised

Page 10: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master

[PLAN] contribution

Dynamics model

Bayes update for posterior distribution of the dynamics model

Optimal Bayesian Exploration based on:

Expected cumulative info gain fo tau steps if performing this action

Expected one-step info gain Expected cumulative info gain for tau-1 steps if performing this next action

Page 11: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master

[PLAN] Quantify “surprise” with info gain

p

𝜃

Page 12: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master

[PLAN] 1-step expected information gain

NOTE: VIME uses this as the Intrinsic reward!

“1-step expected info gain” “expected immediate info gain”

“Mutual info between next state distribution & model parameter”

Page 13: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master

[PLAN] “Planning to be surprised”

Curious Q-value

Perform an actionFollow a policy “Planning tau steps” because not actually observed yet

Cumulative steps info gain

Page 14: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master

[PLAN] Optimal Bayesian Exploration policy[Method1] Computing optimal curiosity-Q backwards for tau steps

[Method2] Policy Iteration

Repeat applyingPolicy evaluation

Policy improvement

Page 15: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master

[Plan] Non-triviality of curious Q-valueCumulative information gain fluctuates!

Cumulative != Sum

Info gain additive in expectation!

Page 16: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master

[Plan] Results

RandomGreedy w.r.t expected one-step info gain

Policy iteration (Dynamic programming approximation to optimal bayesian exploration)

Q-learning using one-step info gain

.

.

.50 states

Page 17: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master

[Plan] Results

Page 18: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master

Outline

Unifying Count-Based Exploration and Intrinsic Motivation

Comparisons and Discussion

Motivation, Related Works and Demo

Planning to Be Surprised

Variational Information Maximizing Exploration

Page 19: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master

[VIME] contribution

Dynamics model

Variational inference for posterior distribution of dynamics model

1-step exploration bonus

Page 20: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master

[VIME] Quantify the information gainedReminder: PLAN cumulative info gain

Page 21: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master

[VIME] Variational BayesWhat’s hard?

Minimize negative ELBO

Computing posterior for highly parameterized models (e.g. neural networks)

Approximate posterior by minimizing

Page 22: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master

[VIME] Optimization for variational bayes

How to minimize negative ELBO?

Take an efficient single second-order (Newton) update step to minimize negative ELBO:

Page 23: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master

[VIME] Estimate 1-step expected info gainWhat’s hard?

Computing the exact one-step expected info-gain. High-dimensional states

→ Monte-carlo estimation.

Page 24: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master

[VIME] Results (Walker-2D) Average extrinsic return

Dense reward

RL algorithm: TRPO

Page 25: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master

[VIME] Results (Swimmer-Gather) Average extrinsic return

Sparse reward

RL algorithm: TRPO

Page 26: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master

Outline

Variational Information Maximizing Exploration

Comparisons and Discussion

Motivation, Related Works and Demo

Planning to Be Surprised

Unifying Count-Based Exploration and Intrinsic Motivation

Page 27: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master

[CTS] contribution States Density model

Pseudo-count

1-step exploration bonus

Page 28: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master

[CTS] Count state visitation

Empirical distribution

These two are different states!

But we want to increment visitation counts for both when visiting either one.

Pixel difference

Empirical count

Page 29: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master

[CTS] Introduce state density model

x=s1 s2 s2X =s1

p

s

p

s

Page 30: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master

How to update CTS density model?Check the “context tree switching” paper! https://arxiv.org/abs/1111.3182

This was the difficulty of reading this paper as it only shows a bayes rule update for mixture of density models (e.g. CTS).

Remark: For pixel-cnn density model in “Count-based exploration with neural density model”, just backprop.

Page 31: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master

[CTS] Derive pseudo-count from density model

Two constraints:Linear system

Pseudo-count derived!

Solve linear system

Page 32: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master

[CTS] Results (Montezuma’s Revenge)

State: 84x84x4# Actions: 18

RL algorithm: Double DQN

Page 33: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master

Summary, Comparisons and Discussion

Outline

Variational Information Maximizing Exploration

Unifying Count-Based Exploration and Intrinsic Motivation

Motivation, Related Works and Demo

Planning to Be Surprised

Page 34: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master

Deriving posterior dynamics model/ density model

PLAN CTSVIME

Bayes rule Variational inference Bayes rule

Page 35: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master

Derive exploratory policy

[VIME] 1-step Information gain

[CTS] Pseudo-count

Policy trained with the reward augmented by intrinsic reward.

[PLAN] Directly argmax(curiosity Q)

Page 36: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master

Pseudo-count VS Intrinsic MotivationMixture model

“Unifying count-based exploration and intrinsic motivations”!

Page 37: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master

Limitations & Future Directions→ Intractable posterior & use dynamics model for expectation

Difficult to be scaled outside Tabular RL.

→ Currently maximize sum of 1-step info gain.

→ which density model leads to better generalization over states?

Learning rates of policy network VS Updating dynamic model/density model.

PLAN

VIME

CTS

Page 38: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master

Thank you!

(Appendix)

Page 39: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master

Our derivation for “Additive in expectation”

h’’ contains h’

Page 40: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master