The Option-Critic Architecture

Pierre-Luc Bacon, Jean Harb, Doina Precup

Reasoning and Learning LabMcGill University, Montreal, Canada

AAAI 2017

Intelligence:

the ability to generalize and adapt efficiently to new anduncertain situations

• Having good representations is key

“[...] solving a problem simply means representing it so asto make the solution transparent.” — Simon, 1969

1 / 18

Reinforcement Learning: a general framework for AI

Equipped with a good state representation, RL has led toimpressive results:

• Tesauro’s TD Gammon (1995),

• Watson’s Daily-Double Wagering in Jeopardy! (2013),

• Human-level video game play in the Atari games (2013),

• AlphaGo (2016)...

The ability to abstract knowledge temporally over manydifferent time scales is still missing.

2 / 18

Temporal abstraction

Higher level steps

Choosing the type of coffee maker, type of coffee beans

Medium level steps

Grind the beans, measure the right quantity of water, boil thewater

Lower level steps

Wrist and arm movements while adding coffee to the filter, ...

3 / 18

Temporal abstraction in AI

A cornerstone of AI planning since the 1970’s:

• Fikes et al. (1972), Newell (1972, Kuipers (1979), Korf(1985), Laird (1986), Iba (1989), Drescher (1991) etc.

It has been shown to :

• Generate shorter plans

• Reduce the complexity of choosing actions

• Provide robustness against model misspecification

• Improve exploration by taking shortcuts in the environment

4 / 18

Temporal abstraction in RL

Options (Sutton, Singh, Precup 2000) can represent courses ofaction at variable time scales:

High level

Low level

Trajectory, time

5 / 18

Options framework

An option ω is a triple:

1. initiation set: Iω2. internal policy: πω3. termination condition: βω

Example

Robot navigation: if there is no obstacle in front (Iω), goforward (πω) until you get too close to another object (βω)

We can derive a policy over options πΩ that maximizes theexpected discounted sum of rewards:

[ ∞∑t=0

γtr(st, at)

∣∣∣∣∣ s0, ω0

]6 / 18

Contribution of this work

The problem of constructing/discovering good options hasbeen a challenge for more than 15 years.

Option-critic is a scalable solution to this problem:

• Online, continual and model-free (but models can be usedif desired)

• Requires no a priori domain knowledge, decomposition, orhuman intervention

• Learns in a single task, at least as fast as other methodswhich do not use temporal abstraction

• Applies to general continuous state and action spaces

7 / 18

Actor-Critic Architecture (Sutton 1984)

Valuefunction

Environment

Policy

Gradient

Critic TD error

• The policy (actor) is decoupled from its value function.

• The critic provides feedback to improve the actor

• Learning is fully online

8 / 18

Option-Critic Architecture

QU , AΩ

Environment

πω, βω

Gradients

CriticTD error

Options

Policy over options

• Parameterize internal policies and termination conditions• Policy over options is computed by a separate process

9 / 18

Main result: Gradient updates

• The gradient wrt. the internal policy parameters θis given by:

E[∂ log πω,θ(a|s)

∂θQU (s, ω, a)

]This has the usual interpretation: take better primitivesmore often inside the option

• The gradient wrt. the termination parameters ν isgiven by:

E[−∂βω,ν(s′)

∂νAπΩ(s′, ω)

]where AπΩ = QπΩ − VπΩ is the advantage function Thismeans that we want to lengthen options that have alarge advantage

10 / 18

Results: Options transfer

Hallways

Initial goalRandom goal

after 1000 episodes

11 / 18

Results: Options transfer

0 500 1000 1500 2000Episodes

SARSA(0)AC-PGOC 4 optionsOC 8 options

Goal moves randomly

Using temporal abstractionsdiscovered by option-critic

Primitive actions

• Learning in the first task no slower than using primitives

• Learning once the goal is moved faster with the options

12 / 18

Results: Learned options are intuitive

Probability of terminating in a particular state, for each option:

Option 1 Option 2

Option 3 Option 4

• Terminations are more likely near hallways (although thereare no pseudo-rewards provided)

13 / 18

Results: Nonlinear function approximation

Policy over options

Termination functions

Internal policies

Shared representationConvolutional layersLast 4 frames

Same architecture as DQN (Mnih & al., 2013) for the 4 firstlayers but hybridized with options and the policy over them.

14 / 18

Performance matching or better than DQN

(a) Asterix (b) Ms. Pacman

Option-CriticDQN

Epoch Epoch50 100 150 2000 50 100 150 2000

(c) Seaquest (d) Zaxxon

Option-CriticDQN

Epoch Epoch50 100 150 2000 50 100 150 2000

15 / 18

Interpretable and specialized options in Seaquest

Option 1: downward shooting sequences Option 2: upward shooting sequences

Transition from option 1 to 2

Action trajectory, time White: option 1 Black: option 2

16 / 18

Conclusion

Our results seem to be the first to be:

• fully end-to-end

• within a single task

• at speed comparable or better than using just primitivemethods

Using ideas from policy gradient methods, option-critic

• provides continual option construction

• can be used with nonlinear function approximators

• can incorporate regularizers or pseudo-rewards easily

17 / 18

Future work

• Learn initiation sets:I Would require a new notion of stochastic initiation

functions

• More empirical results !

Try our code :https://github.com/jeanharb/option_critic

18 / 18

The Option-Critic Architecture - Pierre-Luc...

Documents

The critic

The Dance Critic

CRITIC to EMexppsylearn86d

Critic Page 5

SPRING 2016 OPTION STUDIOS - Cornell AAP Option...Low-Res Department of Architecture Spring 2016 ARCH 4101/4102/5101/5116/8913 Micheal Jefferson, Visting Critic Suzanne Lettieri, Visiting

Option-Critic in Cooperative Multi-agent Systems · Option-Critic in Cooperative Multi-agent Systems Extended Abstract Jhelum Chakravorty1,3, Patrick Nadeem Ward1,3, Julien Roy2,3,

A Critic Writes

Dinner - Cruise Critic · Dinner - Cruise Critic ... Dinner PLATES

Critic Page 8

Critic Etnographer

Critic Issue 2

CRITIC SDS

Everyone’s a Critic

Actor-Critic Algorithms

Social Compass meeting · Web viewContents Council Meetings5 Technical support5 Author meets critic 16 Author meets critic 26 Author meets critic 36 Author meets critic 47 Author

john's art critic

Jbw - Institutional Critic

Art critic

Extending Perl Critic

Jesus Critic