12
N I P S 2 0 0 7 W o r k s h o p Welcome! Hierarchical organization of be •Thank you for coming •Apologies to the skiers… •Why we will be strict about timing •Why we want the workshop to be interactive

Welcome!

  • Upload
    savea

  • View
    21

  • Download
    0

Embed Size (px)

DESCRIPTION

Thank you for coming Apologies to the skiers… Why we will be strict about timing Why we want the workshop to be interactive. Welcome!. NIPS 2007 Workshop. Hierarchical organization of behavior. RL: Decision making. Goal: maximize reward (minimize punishment). - PowerPoint PPT Presentation

Citation preview

Page 1: Welcome!

NIP

S 2

00

7 W

orksh

op

Welcome!

Hierarchical organization of behavior

•Thank you for coming

•Apologies to the skiers…

•Why we will be strict about timing

•Why we want the workshop to be

interactive

Page 2: Welcome!

Rewards/punishments may be delayedOutcomes may depend on sequence of actions Credit assignment problem

RL: Decision making

Goal: maximize reward (minimize punishment)Goal: maximize reward (minimize punishment)

Page 3: Welcome!

RL in a nutshell: formalization

states - actions - transitions - rewards - policy - long term values

Com

ponen

ts of a

n R

L ta

sk

Policy: p(S,a)State values: V(S)State-action values: Q(S,a)

Policy: p(S,a)State values: V(S)State-action values: Q(S,a)

S1

S3S2

44 00 22 22

RL

Page 4: Welcome!

RL in a nutshell: forward search

S1

S3

S2L

R

L

RL

R

= 4

= 0

= 2

= 2

Model b

ase

d R

L

learn model through experience (cognitive map)choosing actions is hardgoal directed behavior; cortical

Model = T(ransitions) and R(ewards)

S1

S3S2

44 00 22 22

RL

Page 5: Welcome!

Trick #1: Long-term values are recursiveQ(S,a) = r(S,a) + V(Snext)

Trick #1: Long-term values are recursiveQ(S,a) = r(S,a) + V(Snext)

RL in a nutshell: cached valuesM

odel-fre

e R

L

temporal difference learning

Q(S,a) = r(S,a) + max Q(S’,a’)

TD learning:start with initial (wrong) Q(S,a)

PE = r(S,a) + max Q(S’,a’) - Q(S,a)

Q(S,a)new = Q(S,a)old + PE

S1

S3S2

44 00 22 22

RL

Page 6: Welcome!

RL in a nutshell: cached valuesM

odel-fre

e R

L

choosing actions is easy (but need lots of practice to learn)habitual behavior; basal ganglia

temporal difference learning

S1

S3S2

44 00 22 22

RL

Trick #2: Can learn values without a model

Trick #2: Can learn values without a model

Q(S1,L) 4

Q(S1,R) 2

Q(S2,L) 4

Q(S2,R) 0

Q(S3,L) 2

Q(S3,R) 2

Page 7: Welcome!

RL in real world tasks…

model based vs. model free learning and control

Q(S1,L) 4

Q(S1,R) 2

Q(S2,L) 4

Q(S2,R) 0

Q(S3,L) 2

Q(S3,R) 2 S1

S3

S2L

R

L

RL

R

= 4

= 0

= 2

= 2

S1

S3S2

44 00 22 22

RL

Scaling problem!

Scaling problem!

Page 8: Welcome!

Real-world behavior is hierarchicalH

iera

rchica

l RL: W

hat is

it?

1. set water temp

2. get wet

3. shampoo

4. soap

5. turn off water

6. dry off

add hot

success

add coldwait 5sec

too c

old

too hot

changejust right

simplified control, disambiguation, encapsulation

1. pour coffee

2. add sugar

3. add milk

4. stir

Page 9: Welcome!

HRL: (in)formal framework

Termination condition = (sub)goal stateOption policy learning: via pseudo reward (model based or model free)

Hie

rarch

ical R

L: Wh

at is

it?

options - skills - macros - temporally abstract actions(Sutton, McGovern, Dietterich, Barto, Precup, Singh, Parr…)

Option: set water temperature

S1

S2S8

S1

0.80.10.1

S2

0.10.10.8

S3

010

S1 (0.1)

S2 (0.1)

S3 (0.9)

…initiation set policy

termination

conditions

Page 10: Welcome!

S: start G: goalOptions: going to doorsActions: + 2 door options

HRL: a toy exampleH

iera

rchica

l RL: W

hat is

it?

Page 11: Welcome!

Advantages of HRL1. Faster learning

(mitigates scaling problem)

Hie

rarch

ical R

L: Wh

at is

it?

RL: no longer ‘tabula rasa’

2. Transfer of knowledge from previous tasks(generalization, shaping)

Page 12: Welcome!

Disadvantages (or: the cost) of HRLH

iera

rchica

l RL: W

hat is

it?

1. Need ‘right’ options - how to learn them?2. Suboptimal behavior (“negative transfer”;

habits)3. More complex learning/control structure

no free lunches…