16
Hierarchical Reinforcement Learning with Unlimited Recursive Subroutine Calls 2019-09-17 ICANN 2019 1

Hierarchical Reinforcement Learning with Unlimited Recursive ...Hierarchical reinforcement learning architecture RGoal • In RGoal, an agent's subgoal settings are similar to subroutine

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Hierarchical Reinforcement Learning with Unlimited Recursive ...Hierarchical reinforcement learning architecture RGoal • In RGoal, an agent's subgoal settings are similar to subroutine

Hierarchical Reinforcement Learning with Unlimited Recursive Subroutine Calls

2019-09-17 ICANN 2019

1

Page 2: Hierarchical Reinforcement Learning with Unlimited Recursive ...Hierarchical reinforcement learning architecture RGoal • In RGoal, an agent's subgoal settings are similar to subroutine

Humans can set suitable subgoals recursively to achieve certain tasks

2

Take an object on a high shelf

Set up a ladder

"the ladder is set up" Initial state Goal state: "holding the object"

Go to the store room Carry the ladder

"in the store room" Subgoal state: "the ladder is set up" Initial state

Sub-subgoal state: "in the store room" Initial state

The depth of this recursion is apparently unlimited.

Page 3: Hierarchical Reinforcement Learning with Unlimited Recursive ...Hierarchical reinforcement learning architecture RGoal • In RGoal, an agent's subgoal settings are similar to subroutine

Hierarchical reinforcement learning architecture RGoal

• In RGoal, an agent's subgoal settings are similar to subroutine calls in programming languages.

• Each subroutine can execute primitive actions or recursively call other subroutines.

• The timing for calling another subroutine is learned by using a standard reinforcement learning method.

• Unlimited recursive subroutine calls accelerate learning because they increase the opportunity for the reuse of subroutines in multitask settings.

3

Page 4: Hierarchical Reinforcement Learning with Unlimited Recursive ...Hierarchical reinforcement learning architecture RGoal • In RGoal, an agent's subgoal settings are similar to subroutine

Previous work: MAXQ [Dietterich 2000]

• Multi-layered hierarchical reinforcement learning architecture with a fixed number of layers.

• Accelerates learning speed • Subtask sharing, Temporal abstraction, State abstraction

4 RGoal simplifies MAXQ archicecture and extends its feature.

Page 5: Hierarchical Reinforcement Learning with Unlimited Recursive ...Hierarchical reinforcement learning architecture RGoal • In RGoal, an agent's subgoal settings are similar to subroutine

RGoal architecture

5

Agent

Sensor

Actuator

Action value function 𝑄𝑄𝐺𝐺 �̃�𝑠, 𝑎𝑎� = 𝑄𝑄 𝑠𝑠,𝑔𝑔, 𝑎𝑎� + 𝑉𝑉𝐺𝐺(𝑔𝑔)

Environment

Augmented state �̃�𝑠 = (𝑠𝑠,𝑔𝑔)

Augmented action 𝑎𝑎�

Reward 𝑟𝑟

Action 𝑎𝑎

Reward function 𝑟𝑟(�̃�𝑠, 𝑎𝑎�) State 𝑠𝑠

Subgoal 𝑔𝑔

Stack

Subroutine call 𝐶𝐶𝑚𝑚

The agent has a subgoal and a stack as its internal states.

Page 6: Hierarchical Reinforcement Learning with Unlimited Recursive ...Hierarchical reinforcement learning architecture RGoal • In RGoal, an agent's subgoal settings are similar to subroutine

Augmented state-action space

6

𝒮𝒮 = 0,0 , 0,1 , … 𝒜𝒜 = 𝑈𝑈𝑈𝑈,𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷,𝑅𝑅𝑅𝑅𝑔𝑔𝑅𝑅𝑅, 𝐿𝐿𝐿𝐿𝐿𝐿𝑅𝑅, …

augment

𝑠𝑠

Up The augmented state is a pair of state and goal �̃�𝑠 = (𝑠𝑠,𝑔𝑔) . The augmented action 𝑎𝑎� is a primitive action 𝑎𝑎 or a subroutine call 𝐶𝐶𝑚𝑚 . RGoal solves the Markov Decision Process (MDP) in the augmented state-action space. Unlike previous hierarchical RLs, caller and callee relation between subroutines is not predefined, but is learned within the framework of RL. (Subroutine call is just a state transition in the augmented state-action space.)

Right Left

Down

�̃�𝒮 = 𝒮𝒮 × ℳ �̃�𝒜 = 𝒜𝒜 ∪ 𝒞𝒞ℳ ℳ = {𝑚𝑚1,𝑚𝑚2, … } ⊆ 𝒮𝒮 𝒞𝒞ℳ = 𝐶𝐶𝑚𝑚1 ,𝐶𝐶𝑚𝑚2 , …

𝑠𝑠

𝐶𝐶𝑚𝑚2

𝐶𝐶𝑚𝑚3

𝑔𝑔 = 𝑚𝑚1

𝑔𝑔 = 𝑚𝑚2

𝑔𝑔 = 𝑚𝑚3

Up

Right Left

Down

[Levy and Shimkin 2011]

Page 7: Hierarchical Reinforcement Learning with Unlimited Recursive ...Hierarchical reinforcement learning architecture RGoal • In RGoal, an agent's subgoal settings are similar to subroutine

Value function decomposition

• Action value function 𝑄𝑄 is decomposed into two parts:

• 𝑄𝑄(𝑠𝑠,𝑔𝑔,𝑎𝑎) can be shared between tasks because it does not depend on the original goal G. – Accelerates learning speed

7

𝑄𝑄𝐺𝐺𝜋𝜋 (𝑠𝑠,𝑔𝑔),𝑎𝑎 = 𝑄𝑄𝜋𝜋 𝑠𝑠,𝑔𝑔,𝑎𝑎 + 𝑉𝑉𝐺𝐺𝜋𝜋 𝑔𝑔

before subgoal after subgoal

where 𝑉𝑉𝐺𝐺𝜋𝜋 𝑔𝑔 = Σ𝑎𝑎𝜋𝜋 𝑔𝑔,𝐺𝐺 , 𝑎𝑎 𝑄𝑄𝜋𝜋 𝑔𝑔,𝐺𝐺, 𝑎𝑎 G : original global goal g : current subgoal

[Singh 1992][Dietterich 2000]

Page 8: Hierarchical Reinforcement Learning with Unlimited Recursive ...Hierarchical reinforcement learning architecture RGoal • In RGoal, an agent's subgoal settings are similar to subroutine

Update rule of Q(s,g,a) is derived from the standard Sarsa algorithm

8

𝑄𝑄𝑔𝑔𝑛𝑛 �̃�𝑠′,𝑎𝑎�′ − 𝑄𝑄𝑔𝑔𝑛𝑛 �̃�𝑠,𝑎𝑎� = 𝑄𝑄 𝑠𝑠′,𝑔𝑔′,𝑎𝑎�′ + 𝑉𝑉𝑔𝑔 𝑔𝑔′ + 𝑉𝑉𝑔𝑔1 𝑔𝑔 + 𝑉𝑉𝑔𝑔2 𝑔𝑔1 + ⋯+ 𝑉𝑉𝑔𝑔𝑛𝑛 𝑔𝑔𝑛𝑛−1

− 𝑄𝑄 𝑠𝑠,𝑔𝑔,𝑎𝑎� + 𝑉𝑉𝑔𝑔1 𝑔𝑔 + 𝑉𝑉𝑔𝑔2 𝑔𝑔1 + ⋯+ 𝑉𝑉𝑔𝑔𝑛𝑛 𝑔𝑔𝑛𝑛−1 = 𝑄𝑄 𝑠𝑠′,𝑔𝑔′,𝑎𝑎�′ − 𝑄𝑄 𝑠𝑠,𝑔𝑔,𝑎𝑎� + 𝑉𝑉𝑔𝑔 𝑔𝑔′

When subgoal is g and stack contents are g1,g2,...gn, the agent moves though the path: s -> g -> g1 -> g2 -> ... -> gn . If a subroutine g' is called, the changed path is: s -> g' -> g -> g1 -> g2 -> ... -> gn . Therefore, the equation bellow holds:

𝑄𝑄 𝑠𝑠,𝑔𝑔,𝑎𝑎 ← 𝑄𝑄 𝑠𝑠,𝑔𝑔,𝑎𝑎 + 𝛼𝛼(𝑟𝑟 + 𝑄𝑄 𝑠𝑠′,𝑔𝑔′,𝑎𝑎′ − 𝑄𝑄 𝑠𝑠,𝑔𝑔,𝑎𝑎 + 𝑉𝑉𝑔𝑔 𝑔𝑔′ )

This equation also holds when 𝑎𝑎� is not a subroutine call, but is a primitive action. Therefore:

𝑄𝑄 �̃�𝑠,𝑎𝑎� ← 𝑄𝑄 �̃�𝑠,𝑎𝑎� + 𝛼𝛼(𝑟𝑟 + 𝑄𝑄 �̃�𝑠′,𝑎𝑎�′ − 𝑄𝑄 �̃�𝑠,𝑎𝑎� )

Efficiently executed.

Page 9: Hierarchical Reinforcement Learning with Unlimited Recursive ...Hierarchical reinforcement learning architecture RGoal • In RGoal, an agent's subgoal settings are similar to subroutine

Maze task

9

Twenty landmarks are placed on the map. Landmarks may become subgoals. For each episode, the start S and goal G are randomly selected from the landmark set. We focus on convergence speed to suboptimal solutions, rather than exact solutions. m : landmark

■■■■■■■■■■■■■■■■■■■■■■■■■ ■・・m・・・・・・・・・・・・・・■・・・・m■ ■・・・・・■・・m・・・・・・・m■・・・・・■ ■・・・・・■■■■■・・・・■■■■・・・・・■ ■・・・・m・・・・■・・・・・■・・・・・・・■ ■・・・・・■・・・■m・・・m■・・・・・・・■ ■・・・・・■・・・■■■・・・・・m・・・・・■ ■・・・・・■・m・・・■・・・・・■・・・・m■ ■・・・m■■・・・・・■・・・・・■・・・・・■ ■・・・・・■・・・・・■m■・・・■・・・・・■ ■・・・・・■・・・■・■・■・・・・・・・・・■ ■・・・・・・・・・・m■・・・・・・・■・・・■ ■■■■■・・・・・・・■■■■・・・・■m・・■ ■・・m■・・・・・・・・・・■・・・・■■■■■ ■・・・■・・m・・・m・・・■m・・・・・・・■ ■・・・・・・・・・・■・・・■■■・・・・・・■ ■・・・・・・・・・・■・・・・・・・・・・・・■ ■・・・■・・・m・・■・・・・・・・m・・・・■ ■■■■■■■■■■■■■■■■■■■■■■■■■

Page 10: Hierarchical Reinforcement Learning with Unlimited Recursive ...Hierarchical reinforcement learning architecture RGoal • In RGoal, an agent's subgoal settings are similar to subroutine

Experiment 1. Relationship between the upper limit S of the stack depth and RGoal performance

10

x 100,000 steps

episo

des/

step

s

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

S=0

S=1

S=2

S=3

S=4

S=100

A greater upper limit results in faster convergence because it increases the opportunity for the reuse of subroutines.

No subroutine calls (Sarsa)

Hierarchical RL without recursive calls

Page 11: Hierarchical Reinforcement Learning with Unlimited Recursive ...Hierarchical reinforcement learning architecture RGoal • In RGoal, an agent's subgoal settings are similar to subroutine

"Thought-mode" combines learned simple tasks to solve unknown complicated tasks

• Suppose that the optimal routes between all neighboring pairs of landmarks have been already learned.

• An approximate solution for the optimal route between distant landmarks can be obtained by connecting neighboring landmarks.

• Such solutions can be found without taking any actions within the environment. [Singh 1992] – Found by simulations within the

agent's brain • A kind of Model-based RL • A kind of Deductive reasoning

11

m4

m1 m2

m3

m1

m2

m3

Page 12: Hierarchical Reinforcement Learning with Unlimited Recursive ...Hierarchical reinforcement learning architecture RGoal • In RGoal, an agent's subgoal settings are similar to subroutine

Before evaluation of Thought-mode, the agent learns simple tasks

12

In the pre-training phase, only pairs of the start and goal within Euclidean distances of eight are selected. In the evaluation phase, arbitrary pairs are selected.

m : landmark

■■■■■■■■■■■■■■■■■■■■■■■■■ ■・・m・・・・・・・・・・・・・・■・・・・m■ ■・・・・・■・・m・・・・・・・m■・・・・・■ ■・・・・・■■■■■・・・・■■■■・・・・・■ ■・・・・m・・・・■・・・・・■・・・・・・・■ ■・・・・・■・・・■m・・・m■・・・・・・・■ ■・・・・・■・・・■■■・・・・・m・・・・・■ ■・・・・・■・m・・・■・・・・・■・・・・m■ ■・・・m■■・・・・・■・・・・・■・・・・・■ ■・・・・・■・・・・・■m■・・・■・・・・・■ ■・・・・・■・・・■・■・■・・・・・・・・・■ ■・・・・・・・・・・m■・・・・・・・■・・・■ ■■■■■・・・・・・・■■■■・・・・■m・・■ ■・・m■・・・・・・・・・・■・・・・■■■■■ ■・・・■・・m・・・m・・・■m・・・・・・・■ ■・・・・・・・・・・■・・・■■■・・・・・・■ ■・・・・・・・・・・■・・・・・・・・・・・・■ ■・・・■・・・m・・■・・・・・・・m・・・・■ ■■■■■■■■■■■■■■■■■■■■■■■■■

Page 13: Hierarchical Reinforcement Learning with Unlimited Recursive ...Hierarchical reinforcement learning architecture RGoal • In RGoal, an agent's subgoal settings are similar to subroutine

Experiment 2. Relationship between the length P of the pre-training phase and RGoal performance

13

x 100,000 steps

episo

des/

step

s

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

P=0

P=1000

P=2000

A greater value of P results in faster convergence If an agent learns simple tasks first, learning difficult tasks becomes faster because the learned simple tasks can be reused as subroutines.

Page 14: Hierarchical Reinforcement Learning with Unlimited Recursive ...Hierarchical reinforcement learning architecture RGoal • In RGoal, an agent's subgoal settings are similar to subroutine

Experiment 3. Relationship between thought-mode length T and RGoal performance

14

T is the number of simulations executed prior to the actual execution of each episode. Here, we only plot the change in score after the pre-training phase.

x 1,000 steps

episo

des/

step

s

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

2001

2005

2009

2013

2017

2021

2025

2029

2033

2037

2041

2045

2049

2053

2057

2061

2065

2069

2073

2077

2081

2085

2089

2093

2097

T=0

T=1

T=10

T=100

If thought-mode length is sufficiently long, approximate solutions are obtained in almost zero-shot time.

Page 15: Hierarchical Reinforcement Learning with Unlimited Recursive ...Hierarchical reinforcement learning architecture RGoal • In RGoal, an agent's subgoal settings are similar to subroutine

Pseudo code of RGoal

15

This algorithm only uses simple data structures and a simple single loop. Because of its simplicity, we consider RGoal to be a promising first step toward a computational model of the planning mechanism of the human brain.

Page 16: Hierarchical Reinforcement Learning with Unlimited Recursive ...Hierarchical reinforcement learning architecture RGoal • In RGoal, an agent's subgoal settings are similar to subroutine

Conclusion

• We proposed a novel hierarchical RL architecture that allows unlimited recursive subroutine calls.

• We integrated several important ideas proposed before into a single simple architecture. – Augmented action-value space[Levy and Shimkin 2011], – Value function decomposition[Singh 1992][Dietterich 2000] – Model-based RL [Sutton 1990]

• A novel mechanism called Thought-mode combines learned simple tasks to solve unknown complicated tasks rapidly, sometimes in zero-shot time.

• Because the theoretical framework and the architecture of RGoal is simple, it is easy to extend.

16