19
The Option-Critic Architecture Pierre-Luc Bacon, Jean Harb, Doina Precup Reasoning and Learning Lab McGill University, Montreal, Canada AAAI 2017

The Option-Critic Architecture - Pierre-Luc Baconpierrelucbacon.com/optioncritic-aaai2017-slides.pdf · Option-critic is a scalable solution to this problem: Online, continual and

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The Option-Critic Architecture - Pierre-Luc Baconpierrelucbacon.com/optioncritic-aaai2017-slides.pdf · Option-critic is a scalable solution to this problem: Online, continual and

The Option-Critic Architecture

Pierre-Luc Bacon, Jean Harb, Doina Precup

Reasoning and Learning LabMcGill University, Montreal, Canada

AAAI 2017

Page 2: The Option-Critic Architecture - Pierre-Luc Baconpierrelucbacon.com/optioncritic-aaai2017-slides.pdf · Option-critic is a scalable solution to this problem: Online, continual and

Intelligence:

the ability to generalize and adapt efficiently to new anduncertain situations

• Having good representations is key

“[...] solving a problem simply means representing it so asto make the solution transparent.” — Simon, 1969

1 / 18

Page 3: The Option-Critic Architecture - Pierre-Luc Baconpierrelucbacon.com/optioncritic-aaai2017-slides.pdf · Option-critic is a scalable solution to this problem: Online, continual and

Reinforcement Learning: a general framework for AI

Equipped with a good state representation, RL has led toimpressive results:

• Tesauro’s TD Gammon (1995),

• Watson’s Daily-Double Wagering in Jeopardy! (2013),

• Human-level video game play in the Atari games (2013),

• AlphaGo (2016)...

The ability to abstract knowledge temporally over manydifferent time scales is still missing.

2 / 18

Page 4: The Option-Critic Architecture - Pierre-Luc Baconpierrelucbacon.com/optioncritic-aaai2017-slides.pdf · Option-critic is a scalable solution to this problem: Online, continual and

Temporal abstraction

Higher level steps

Choosing the type of coffee maker, type of coffee beans

Medium level steps

Grind the beans, measure the right quantity of water, boil thewater

Lower level steps

Wrist and arm movements while adding coffee to the filter, ...

3 / 18

Page 5: The Option-Critic Architecture - Pierre-Luc Baconpierrelucbacon.com/optioncritic-aaai2017-slides.pdf · Option-critic is a scalable solution to this problem: Online, continual and

Temporal abstraction in AI

A cornerstone of AI planning since the 1970’s:

• Fikes et al. (1972), Newell (1972, Kuipers (1979), Korf(1985), Laird (1986), Iba (1989), Drescher (1991) etc.

It has been shown to :

• Generate shorter plans

• Reduce the complexity of choosing actions

• Provide robustness against model misspecification

• Improve exploration by taking shortcuts in the environment

4 / 18

Page 6: The Option-Critic Architecture - Pierre-Luc Baconpierrelucbacon.com/optioncritic-aaai2017-slides.pdf · Option-critic is a scalable solution to this problem: Online, continual and

Temporal abstraction in RL

Options (Sutton, Singh, Precup 2000) can represent courses ofaction at variable time scales:

High level

Low level

Trajectory, time

5 / 18

Page 7: The Option-Critic Architecture - Pierre-Luc Baconpierrelucbacon.com/optioncritic-aaai2017-slides.pdf · Option-critic is a scalable solution to this problem: Online, continual and

Options framework

An option ω is a triple:

1. initiation set: Iω2. internal policy: πω3. termination condition: βω

Example

Robot navigation: if there is no obstacle in front (Iω), goforward (πω) until you get too close to another object (βω)

We can derive a policy over options πΩ that maximizes theexpected discounted sum of rewards:

E

[ ∞∑t=0

γtr(st, at)

∣∣∣∣∣ s0, ω0

]6 / 18

Page 8: The Option-Critic Architecture - Pierre-Luc Baconpierrelucbacon.com/optioncritic-aaai2017-slides.pdf · Option-critic is a scalable solution to this problem: Online, continual and

Contribution of this work

The problem of constructing/discovering good options hasbeen a challenge for more than 15 years.

Option-critic is a scalable solution to this problem:

• Online, continual and model-free (but models can be usedif desired)

• Requires no a priori domain knowledge, decomposition, orhuman intervention

• Learns in a single task, at least as fast as other methodswhich do not use temporal abstraction

• Applies to general continuous state and action spaces

7 / 18

Page 9: The Option-Critic Architecture - Pierre-Luc Baconpierrelucbacon.com/optioncritic-aaai2017-slides.pdf · Option-critic is a scalable solution to this problem: Online, continual and

Actor-Critic Architecture (Sutton 1984)

Valuefunction

Environment

Policy

atst

Actor

rt

Gradient

Critic TD error

• The policy (actor) is decoupled from its value function.

• The critic provides feedback to improve the actor

• Learning is fully online

8 / 18

Page 10: The Option-Critic Architecture - Pierre-Luc Baconpierrelucbacon.com/optioncritic-aaai2017-slides.pdf · Option-critic is a scalable solution to this problem: Online, continual and

Option-Critic Architecture

πΩ

QU , AΩ

Environment

atst

πω, βω

rt

Gradients

CriticTD error

ωt

Options

Policy over options

• Parameterize internal policies and termination conditions• Policy over options is computed by a separate process

9 / 18

Page 11: The Option-Critic Architecture - Pierre-Luc Baconpierrelucbacon.com/optioncritic-aaai2017-slides.pdf · Option-critic is a scalable solution to this problem: Online, continual and

Main result: Gradient updates

• The gradient wrt. the internal policy parameters θis given by:

E[∂ log πω,θ(a|s)

∂θQU (s, ω, a)

]This has the usual interpretation: take better primitivesmore often inside the option

• The gradient wrt. the termination parameters ν isgiven by:

E[−∂βω,ν(s′)

∂νAπΩ(s′, ω)

]where AπΩ = QπΩ − VπΩ is the advantage function Thismeans that we want to lengthen options that have alarge advantage

10 / 18

Page 12: The Option-Critic Architecture - Pierre-Luc Baconpierrelucbacon.com/optioncritic-aaai2017-slides.pdf · Option-critic is a scalable solution to this problem: Online, continual and

Results: Options transfer

Hallways

Walls

Initial goalRandom goal

after 1000 episodes

11 / 18

Page 13: The Option-Critic Architecture - Pierre-Luc Baconpierrelucbacon.com/optioncritic-aaai2017-slides.pdf · Option-critic is a scalable solution to this problem: Online, continual and

Results: Options transfer

0 500 1000 1500 2000Episodes

0

100

200

300

400

500

600

Step

s

SARSA(0)AC-PGOC 4 optionsOC 8 options

Goal moves randomly

Using temporal abstractionsdiscovered by option-critic

Primitive actions

• Learning in the first task no slower than using primitives

• Learning once the goal is moved faster with the options

12 / 18

Page 14: The Option-Critic Architecture - Pierre-Luc Baconpierrelucbacon.com/optioncritic-aaai2017-slides.pdf · Option-critic is a scalable solution to this problem: Online, continual and

Results: Learned options are intuitive

Probability of terminating in a particular state, for each option:

Option 1 Option 2

Option 3 Option 4

• Terminations are more likely near hallways (although thereare no pseudo-rewards provided)

13 / 18

Page 15: The Option-Critic Architecture - Pierre-Luc Baconpierrelucbacon.com/optioncritic-aaai2017-slides.pdf · Option-critic is a scalable solution to this problem: Online, continual and

Results: Nonlinear function approximation

Policy over options

Termination functions

Internal policies

Shared representationConvolutional layersLast 4 frames

Same architecture as DQN (Mnih & al., 2013) for the 4 firstlayers but hybridized with options and the policy over them.

14 / 18

Page 16: The Option-Critic Architecture - Pierre-Luc Baconpierrelucbacon.com/optioncritic-aaai2017-slides.pdf · Option-critic is a scalable solution to this problem: Online, continual and

Performance matching or better than DQN

(a) Asterix (b) Ms. Pacman

Option-CriticDQN

Option-CriticDQN

Avg

.Sco

re

Epoch Epoch50 100 150 2000 50 100 150 2000

0

2000

4000

6000

8000

10000

500

1000

1500

2000

2500

(c) Seaquest (d) Zaxxon

Option-CriticDQN

Option-CriticDQN

Epoch Epoch50 100 150 2000 50 100 150 2000

0

2000

4000

6000

8000

10000

0

2000

4000

6000

8000

15 / 18

Page 17: The Option-Critic Architecture - Pierre-Luc Baconpierrelucbacon.com/optioncritic-aaai2017-slides.pdf · Option-critic is a scalable solution to this problem: Online, continual and

Interpretable and specialized options in Seaquest

Option 1: downward shooting sequences Option 2: upward shooting sequences

Transition from option 1 to 2

Action trajectory, time White: option 1 Black: option 2

16 / 18

Page 18: The Option-Critic Architecture - Pierre-Luc Baconpierrelucbacon.com/optioncritic-aaai2017-slides.pdf · Option-critic is a scalable solution to this problem: Online, continual and

Conclusion

Our results seem to be the first to be:

• fully end-to-end

• within a single task

• at speed comparable or better than using just primitivemethods

Using ideas from policy gradient methods, option-critic

• provides continual option construction

• can be used with nonlinear function approximators

• can incorporate regularizers or pseudo-rewards easily

17 / 18

Page 19: The Option-Critic Architecture - Pierre-Luc Baconpierrelucbacon.com/optioncritic-aaai2017-slides.pdf · Option-critic is a scalable solution to this problem: Online, continual and

Future work

• Learn initiation sets:I Would require a new notion of stochastic initiation

functions

• More empirical results !

Try our code :https://github.com/jeanharb/option_critic

18 / 18