42
Unsupervised Methods For Subgoal Discovery During Intrinsic Motivation in Model-Free Hierarchical Reinforcement Learning Jacob Rafati http://rafati.net Ph.D. Candidate Electrical Engineering and Computer Science Computational Cognitive Neuroscience Laboratory University of California, Merced Co-authored with: David C. Noelle

Unsupervised Methods For Subgoal Discovery During Intrinsic …rafati.net/slides/KEG-2019-slides.pdf · 2019. 11. 26. · Reinforcement Learning Reinforcement learning (RL) is learning

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Unsupervised Methods For Subgoal Discovery During Intrinsic …rafati.net/slides/KEG-2019-slides.pdf · 2019. 11. 26. · Reinforcement Learning Reinforcement learning (RL) is learning

Unsupervised Methods For Subgoal Discovery During

Intrinsic Motivation in Model-Free Hierarchical Reinforcement Learning

Jacob Rafati http://rafati.net

Ph.D. Candidate Electrical Engineering and Computer Science

Computational Cognitive Neuroscience Laboratory University of California, Merced

Co-authored with: David C. Noelle

Page 2: Unsupervised Methods For Subgoal Discovery During Intrinsic …rafati.net/slides/KEG-2019-slides.pdf · 2019. 11. 26. · Reinforcement Learning Reinforcement learning (RL) is learning

Games

Page 3: Unsupervised Methods For Subgoal Discovery During Intrinsic …rafati.net/slides/KEG-2019-slides.pdf · 2019. 11. 26. · Reinforcement Learning Reinforcement learning (RL) is learning

Goals & Rules• “Key components of games are goals, rules, challenge,

and interaction. Games generally involve mental or physical stimulation, and often both.”

https://en.wikipedia.org/wiki/Game

Page 4: Unsupervised Methods For Subgoal Discovery During Intrinsic …rafati.net/slides/KEG-2019-slides.pdf · 2019. 11. 26. · Reinforcement Learning Reinforcement learning (RL) is learning

Reinforcement LearningReinforcement learning (RL) is learning how to map situations (state) to actions so as to maximize numerical reward signals received during the experiences that an artificial agent has as it interacts with its environment.

et = {st, at, st+1, rt+1}experience:

(Sutton and Barto, 2017)

Objective: Learn ⇡ : S ! A to maximize cumulative rewards

Page 5: Unsupervised Methods For Subgoal Discovery During Intrinsic …rafati.net/slides/KEG-2019-slides.pdf · 2019. 11. 26. · Reinforcement Learning Reinforcement learning (RL) is learning

Super-Human Success

(Mnih. et. al., 2015)

Page 6: Unsupervised Methods For Subgoal Discovery During Intrinsic …rafati.net/slides/KEG-2019-slides.pdf · 2019. 11. 26. · Reinforcement Learning Reinforcement learning (RL) is learning

(Mnih. et. al., 2015)

Failure in a complex task

Page 7: Unsupervised Methods For Subgoal Discovery During Intrinsic …rafati.net/slides/KEG-2019-slides.pdf · 2019. 11. 26. · Reinforcement Learning Reinforcement learning (RL) is learning

Learning Representations in Hierarchical Reinforcement Learning

• Trade-off between exploration and exploitation in an environment with sparse feedback is a major challenge.

• Learning to operate over different levels of temporal abstraction is an important open problem in reinforcement learning.

• Exploring the state-space while learning reusable skills through intrinsic motivation.

• Discovering useful subgoals in large-scale hierarchical reinforcement learning is a major open problem.

Page 8: Unsupervised Methods For Subgoal Discovery During Intrinsic …rafati.net/slides/KEG-2019-slides.pdf · 2019. 11. 26. · Reinforcement Learning Reinforcement learning (RL) is learning

ReturnReturn is the cumulative sum of a received reward:

st st+1

rt+1

st�1

rtrt�1

atat�1s0 sT

Gt =TX

t0=t+1

�t0�t�1rt0

� 2 [0, 1] is the discount factor

Page 9: Unsupervised Methods For Subgoal Discovery During Intrinsic …rafati.net/slides/KEG-2019-slides.pdf · 2019. 11. 26. · Reinforcement Learning Reinforcement learning (RL) is learning

Policy Function• Policy Function: At each time step agent implements a

mapping from states to possible actions

⇡ : S ! A

• Objective: Finding an optimal policy that maximizes the cumulated rewards

⇡⇤ = argmax⇡

E⇥Gt|St = s

⇤, 8s 2 S

Page 10: Unsupervised Methods For Subgoal Discovery During Intrinsic …rafati.net/slides/KEG-2019-slides.pdf · 2019. 11. 26. · Reinforcement Learning Reinforcement learning (RL) is learning

Q-Function• State-Action Value Function is the expected return when

starting from (s,a) and following a policy thereafter

Q⇡ : S ⇥A ! R

Q⇡(s, a) = E⇡[Gt|St = s,At = a]

Page 11: Unsupervised Methods For Subgoal Discovery During Intrinsic …rafati.net/slides/KEG-2019-slides.pdf · 2019. 11. 26. · Reinforcement Learning Reinforcement learning (RL) is learning

Temporal Difference• Model-free reinforcement learning algorithm.

• State-transition probabilities or reward function are not available

• A powerful computational cognitive neuroscience model of learning in brain

• A combination of Monte Carlo method and Dynamic Programming

Q-learning Q(s, a) Q(s, a) + ↵[r + �max

a0Q(s0, a0)�Q(s, a)]

Q(s, a) ! prediction of return

r + �maxa0

Q(s0, a0) ! target value

Page 12: Unsupervised Methods For Subgoal Discovery During Intrinsic …rafati.net/slides/KEG-2019-slides.pdf · 2019. 11. 26. · Reinforcement Learning Reinforcement learning (RL) is learning

State Function Approximator

state-action Values

...

...q(s, ai;w)

Generalization

w q(s, a;w)s

Q(s, a) ⇡ q(s, a;w)

Page 13: Unsupervised Methods For Subgoal Discovery During Intrinsic …rafati.net/slides/KEG-2019-slides.pdf · 2019. 11. 26. · Reinforcement Learning Reinforcement learning (RL) is learning

Deep RLminw

L(w)

w = argminw

L(w)

L(w) = E(s,a,r,s0)⇠D

h�r +max

a0q(s0, a0;w�)� q(s, a;w)

�2i

D = {et|t = 0, . . . , T} ! Experience replay memory

Stochastic Gradient Decent method

w w �rwL(w)

Page 14: Unsupervised Methods For Subgoal Discovery During Intrinsic …rafati.net/slides/KEG-2019-slides.pdf · 2019. 11. 26. · Reinforcement Learning Reinforcement learning (RL) is learning

Q-Learning with experience replay memory

Page 15: Unsupervised Methods For Subgoal Discovery During Intrinsic …rafati.net/slides/KEG-2019-slides.pdf · 2019. 11. 26. · Reinforcement Learning Reinforcement learning (RL) is learning

Failure: Sparse feedback

(Botvinick et al., 2009) Subgoals

Page 16: Unsupervised Methods For Subgoal Discovery During Intrinsic …rafati.net/slides/KEG-2019-slides.pdf · 2019. 11. 26. · Reinforcement Learning Reinforcement learning (RL) is learning

Complex Task

Simple Tasks

Major Goals Minor Goals Actions

Hierarchy in Human Behavior & Brain Structure

Page 17: Unsupervised Methods For Subgoal Discovery During Intrinsic …rafati.net/slides/KEG-2019-slides.pdf · 2019. 11. 26. · Reinforcement Learning Reinforcement learning (RL) is learning

Hierarchical Reinforcement Learning Subproblems

• Subproblem 1: Learning a meta-policy to choose a subgoal

• Subproblem 2: Developing skills through intrinsic motivation

• Subproblem 3: Subgoal discovery

Page 18: Unsupervised Methods For Subgoal Discovery During Intrinsic …rafati.net/slides/KEG-2019-slides.pdf · 2019. 11. 26. · Reinforcement Learning Reinforcement learning (RL) is learning

Environment Meta-controller

Controller

st+1, rt+1

st

at

Agent

Critic

gt

r̃t+1atat

Meta-controller/Controller Framework

Kulkarni et. al. 2016

Page 19: Unsupervised Methods For Subgoal Discovery During Intrinsic …rafati.net/slides/KEG-2019-slides.pdf · 2019. 11. 26. · Reinforcement Learning Reinforcement learning (RL) is learning

Subproblem 1: Temporal Abstraction

Page 20: Unsupervised Methods For Subgoal Discovery During Intrinsic …rafati.net/slides/KEG-2019-slides.pdf · 2019. 11. 26. · Reinforcement Learning Reinforcement learning (RL) is learning

Room 1Room 2

Room 3 Room 4

Rooms Task

Page 21: Unsupervised Methods For Subgoal Discovery During Intrinsic …rafati.net/slides/KEG-2019-slides.pdf · 2019. 11. 26. · Reinforcement Learning Reinforcement learning (RL) is learning

Subproblem 2. Developing skills through Intrinsic Motivation

Page 22: Unsupervised Methods For Subgoal Discovery During Intrinsic …rafati.net/slides/KEG-2019-slides.pdf · 2019. 11. 26. · Reinforcement Learning Reinforcement learning (RL) is learning

……………

……………

. . .

. . .

…… Gaussianrepresentation

fully connected

fully connected

distributed

conjunctive

representation

weigh

ted

gates

st

gt

q(st, gt, a;w)

State-Goal Q Function

Page 23: Unsupervised Methods For Subgoal Discovery During Intrinsic …rafati.net/slides/KEG-2019-slides.pdf · 2019. 11. 26. · Reinforcement Learning Reinforcement learning (RL) is learning

Room 1Room 2

Room 3 Room 4

Room-1

Key

Room-3

Room-4

Room-2

Reusing the skills

Lock

Page 24: Unsupervised Methods For Subgoal Discovery During Intrinsic …rafati.net/slides/KEG-2019-slides.pdf · 2019. 11. 26. · Reinforcement Learning Reinforcement learning (RL) is learning

0 0.2 0.4 0.6 0.8 1x

0

0.2

0.4

0.6

0.8

1

y

Grid World Task with Key and Door

Reusing the skillsRoom-1

Key

Room-3

Room-4

Room-2

Lock

Page 25: Unsupervised Methods For Subgoal Discovery During Intrinsic …rafati.net/slides/KEG-2019-slides.pdf · 2019. 11. 26. · Reinforcement Learning Reinforcement learning (RL) is learning

0 0.2 0.4 0.6 0.8 1x

0

0.2

0.4

0.6

0.8

1

y

Grid World Task with Key and Door

Room-1

Key

Room-3

Room-4

Room-2

Reusing the skills

Box

Page 26: Unsupervised Methods For Subgoal Discovery During Intrinsic …rafati.net/slides/KEG-2019-slides.pdf · 2019. 11. 26. · Reinforcement Learning Reinforcement learning (RL) is learning

0 0.2 0.4 0.6 0.8 1x

0

0.2

0.4

0.6

0.8

1

y

Grid World Task with Key and Door

Reusing the skillsRoom-1

Key

Room-3

Room-4

Room-2

Lock

Page 27: Unsupervised Methods For Subgoal Discovery During Intrinsic …rafati.net/slides/KEG-2019-slides.pdf · 2019. 11. 26. · Reinforcement Learning Reinforcement learning (RL) is learning

0 0.2 0.4 0.6 0.8 1x

0

0.2

0.4

0.6

0.8

1

y

Grid World Task with Key and Door

Room-1

Key

Room-3

Room-4

Room-2

Reusing the skills

Lock

Page 28: Unsupervised Methods For Subgoal Discovery During Intrinsic …rafati.net/slides/KEG-2019-slides.pdf · 2019. 11. 26. · Reinforcement Learning Reinforcement learning (RL) is learning

0 0.2 0.4 0.6 0.8 1x

0

0.2

0.4

0.6

0.8

1

y

Grid World Task with Key and Door

Room-1

Key

Room-3

Room-4

Room-2

Reusing the skills

Lock

Page 29: Unsupervised Methods For Subgoal Discovery During Intrinsic …rafati.net/slides/KEG-2019-slides.pdf · 2019. 11. 26. · Reinforcement Learning Reinforcement learning (RL) is learning

Subproblem 3. Subgoal Discovery

(Sismek et al., 2005)

(Goel and Huber, 2003)

(Machado, et. al. 2017)

finding proper G

Page 30: Unsupervised Methods For Subgoal Discovery During Intrinsic …rafati.net/slides/KEG-2019-slides.pdf · 2019. 11. 26. · Reinforcement Learning Reinforcement learning (RL) is learning

Subproblem 3. Subgoal Discovery

• Purpose: Discovering promising states to pursue, i.e. finding

• Implementing subgoal discovery algorithm for large-scale model free reinforcement learning problem

• No access to MDP models (state-transition probabilities, environment reward function, State space)

G

Page 31: Unsupervised Methods For Subgoal Discovery During Intrinsic …rafati.net/slides/KEG-2019-slides.pdf · 2019. 11. 26. · Reinforcement Learning Reinforcement learning (RL) is learning

Subproblem 3. Candidate Subgoals

• It is close (in terms of actions) to a rewarding state.

• It represents a set of states, at least some of which tend to be along a state transition path to a rewarding state.

Page 32: Unsupervised Methods For Subgoal Discovery During Intrinsic …rafati.net/slides/KEG-2019-slides.pdf · 2019. 11. 26. · Reinforcement Learning Reinforcement learning (RL) is learning

Subproblem 3. Subgoal Discovery

• Unsupervised learning (clustering) on the limited past experience memory collected during intrinsic motivation

• Centroids of clusters are useful subgoals (e.g. rooms)

• Detecting outliers as potential subgoals (e.g. key, box)

• Boundary of two clusters can lead to subgoals (e.g. doorway between rooms)

Page 33: Unsupervised Methods For Subgoal Discovery During Intrinsic …rafati.net/slides/KEG-2019-slides.pdf · 2019. 11. 26. · Reinforcement Learning Reinforcement learning (RL) is learning

Unsupervised Subgoal Discovery

Page 34: Unsupervised Methods For Subgoal Discovery During Intrinsic …rafati.net/slides/KEG-2019-slides.pdf · 2019. 11. 26. · Reinforcement Learning Reinforcement learning (RL) is learning

Unsupervised Subgoal Discovery

Page 35: Unsupervised Methods For Subgoal Discovery During Intrinsic …rafati.net/slides/KEG-2019-slides.pdf · 2019. 11. 26. · Reinforcement Learning Reinforcement learning (RL) is learning

Unification of Hierarchical Reinforcement Learning Subproblems

• Implementing a hierarchical reinforcement learning framework that makes it possible to simultaneously perform subgoal discovery, learn appropriate intrinsic motivation, and succeed at meta-policy learning

• The unification element is using experience replay memory D

Page 36: Unsupervised Methods For Subgoal Discovery During Intrinsic …rafati.net/slides/KEG-2019-slides.pdf · 2019. 11. 26. · Reinforcement Learning Reinforcement learning (RL) is learning

Model-Free HRL

Page 37: Unsupervised Methods For Subgoal Discovery During Intrinsic …rafati.net/slides/KEG-2019-slides.pdf · 2019. 11. 26. · Reinforcement Learning Reinforcement learning (RL) is learning

Rooms

0 20000 40000 60000 80000 100000

Training steps

20

40

60

80

100

Successin

ReachingSu

bgoals%

0 20000 40000 60000 80000 100000

Training steps

0

20

40

60

80

100

Successin

SolvingTask%

Our Unified Model-Free HRL Method

Regular RL

0 20000 40000 60000 80000 100000

Training steps

0

10

20

30

40

50

EpisodeReturn

Our Unified Model-Free HRL Method

Regular RL

Page 38: Unsupervised Methods For Subgoal Discovery During Intrinsic …rafati.net/slides/KEG-2019-slides.pdf · 2019. 11. 26. · Reinforcement Learning Reinforcement learning (RL) is learning

Montezuma’s RevengeMeta-Controller Controller

Page 39: Unsupervised Methods For Subgoal Discovery During Intrinsic …rafati.net/slides/KEG-2019-slides.pdf · 2019. 11. 26. · Reinforcement Learning Reinforcement learning (RL) is learning

Montezuma’s Revenge

0 500000 1000000 1500000 2000000 2500000

Training steps

0

20

40

60

80

Successin

reachingsubgoals%

Our Unified Model-Free HRL Method

DeepMind DQN Algorithm (Mnih et. al., 2015)

0 500000 1000000 1500000 2000000 2500000

Training steps

0

50

100

150

200

250

300

350

400

Average

return

over

10episdes

Our Unified Model-Free HRL Method

DeepMind DQN Algorithm (Mnih et. al., 2015)

Page 40: Unsupervised Methods For Subgoal Discovery During Intrinsic …rafati.net/slides/KEG-2019-slides.pdf · 2019. 11. 26. · Reinforcement Learning Reinforcement learning (RL) is learning

Conclusions• Unsupervised Learning can be used to discover

useful subgoals in games.

• Subgoals can be discovered using model-free methods.

• Learning in multiple levels of temporal abstraction is the key to solve games with sparse delayed feedback.

• Intrinsic motivation learning and subgoal discovery can be unified in model-free HRL framework.

Page 41: Unsupervised Methods For Subgoal Discovery During Intrinsic …rafati.net/slides/KEG-2019-slides.pdf · 2019. 11. 26. · Reinforcement Learning Reinforcement learning (RL) is learning

References• Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A.,

Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540):529–533.

• Sutton, R. S., and Barto, A. G. (2017). Reinforcement Learning: An Introduction. MIT Press. 2nd edition.

• Botvinick, M. M., Niv, Y., and Barto, A. C. (2009). Hierarchically organized behavior and its neural foundations: A reinforcement learning perspective. Cognition, 113(3):262 – 280.

• Goel, S. and Huber, M. (2003). Subgoal discovery for hierarchical reinforcement learning using learned policies. In Russell, I. and Haller, S. M., editors, FLAIRS Conference, pages 346–350. AAAI Press.

• Kulkarni, T. D., Narasimhan, K., Saeedi, A., and Tenenbaum, J. B. (2016). Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. NeurIPS 2016.

• Machado, M. C., Bellemare, M. G., and Bowling, M. H. (2017). A laplacian framework for option discovery in reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages 2295–2304.

• Sutton, R. S., Precup, D., and Singh, S. (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1):181 – 211.

Page 42: Unsupervised Methods For Subgoal Discovery During Intrinsic …rafati.net/slides/KEG-2019-slides.pdf · 2019. 11. 26. · Reinforcement Learning Reinforcement learning (RL) is learning

Slides, Paper, and Code:

http://rafati.net

Poster Session on Wednesday.