Welcome!

NIP

S 2

00

7 W

orksh

op

Welcome!

Hierarchical organization of behavior

•Thank you for coming

•Apologies to the skiers…

•Why we will be strict about timing

•Why we want the workshop to be

interactive

Rewards/punishments may be delayedOutcomes may depend on sequence of actions Credit assignment problem

RL: Decision making

Goal: maximize reward (minimize punishment)Goal: maximize reward (minimize punishment)

RL in a nutshell: formalization

states - actions - transitions - rewards - policy - long term values

Com

ponen

ts of a

n R

L ta

sk

Policy: p(S,a)State values: V(S)State-action values: Q(S,a)

Policy: p(S,a)State values: V(S)State-action values: Q(S,a)

S1

S3S2

44 00 22 22

RL

RL in a nutshell: forward search

S1

S3

S2L

R

L

RL

R

= 4

= 0

= 2

= 2

Model b

ase

d R

L

learn model through experience (cognitive map)choosing actions is hardgoal directed behavior; cortical

Model = T(ransitions) and R(ewards)

S1

S3S2

44 00 22 22

RL

Trick #1: Long-term values are recursiveQ(S,a) = r(S,a) + V(Snext)

Trick #1: Long-term values are recursiveQ(S,a) = r(S,a) + V(Snext)

RL in a nutshell: cached valuesM

odel-fre

e R

L

temporal difference learning

Q(S,a) = r(S,a) + max Q(S’,a’)

TD learning:start with initial (wrong) Q(S,a)

PE = r(S,a) + max Q(S’,a’) - Q(S,a)

Q(S,a)new = Q(S,a)old + PE

S1

S3S2

44 00 22 22

RL

RL in a nutshell: cached valuesM

odel-fre

e R

L

choosing actions is easy (but need lots of practice to learn)habitual behavior; basal ganglia

temporal difference learning

S1

S3S2

44 00 22 22

RL

Trick #2: Can learn values without a model

Trick #2: Can learn values without a model

Q(S1,L) 4

Q(S1,R) 2

Q(S2,L) 4

Q(S2,R) 0

Q(S3,L) 2

Q(S3,R) 2

RL in real world tasks…

model based vs. model free learning and control

Q(S1,L) 4

Q(S1,R) 2

Q(S2,L) 4

Q(S2,R) 0

Q(S3,L) 2

Q(S3,R) 2 S1

S3

S2L

R

L

RL

R

= 4

= 0

= 2

= 2

S1

S3S2

44 00 22 22

RL

Scaling problem!

Scaling problem!

Real-world behavior is hierarchicalH

iera

rchica

l RL: W

hat is

it?

1. set water temp

2. get wet

3. shampoo

4. soap

5. turn off water

6. dry off

add hot

success

add coldwait 5sec

too c

old

too hot

changejust right

simplified control, disambiguation, encapsulation

1. pour coffee

2. add sugar

3. add milk

4. stir

HRL: (in)formal framework

Termination condition = (sub)goal stateOption policy learning: via pseudo reward (model based or model free)

Hie

rarch

ical R

L: Wh

at is

it?

options - skills - macros - temporally abstract actions(Sutton, McGovern, Dietterich, Barto, Precup, Singh, Parr…)

Option: set water temperature

S1

S2S8

…

S1

0.80.10.1

S2

0.10.10.8

S3

010

S1 (0.1)

S2 (0.1)

S3 (0.9)

…initiation set policy

termination

conditions

S: start G: goalOptions: going to doorsActions: + 2 door options

HRL: a toy exampleH

iera

rchica

l RL: W

hat is

it?

Advantages of HRL1. Faster learning

(mitigates scaling problem)

Hie

rarch

ical R

L: Wh

at is

it?

RL: no longer ‘tabula rasa’

2. Transfer of knowledge from previous tasks(generalization, shaping)

Disadvantages (or: the cost) of HRLH

iera

rchica

l RL: W

hat is

it?

1. Need ‘right’ options - how to learn them?2. Suboptimal behavior (“negative transfer”;

habits)3. More complex learning/control structure

no free lunches…

Documents

Welcome!