On Linking Reinforcement Learning with Unsupervised Learning Cornelius Weber, FIAS presented at...

Preview:

Citation preview

On Linking Reinforcement Learningwith Unsupervised Learning

Cornelius Weber, FIAS

presented at Honda HRI, Offenbach, 17th March 2009

for taking action, we need only the relevant features

x

y

z

unsupervisedlearningin cortex

reinforcementlearning

in basal ganglia

state spaceactor

Doya, 1999

actor

state space

1-layer RL model of BG ...

go left?

go right?... is too simple to handle complex input

complex input(cortex)

need another layer(s) to pre-process complex data

feature detection

action selection

actor

state space

models’ background:

- gradient descent methods generalize RL to several layers Sutton&Barto RL book (1998); Tesauro (1992;1995)

- reward-modulated Hebb Triesch, Neur Comp 19, 885-909 (2007), Roelfsema & Ooyen, Neur Comp 17, 2176-214 (2005); Franz & Triesch, ICDL (2007)

- reward-modulated activity leads to input selection Nakahara, Neur Comp 14, 819-44 (2002)

- reward-modulated STDP Izhikevich, Cereb Cortex 17, 2443-52 (2007), Florian, Neur Comp 19/6, 1468-502 (2007); Farries & Fairhall, Neurophysiol 98, 3648-65 (2007); ...

- RL models learn partitioning of input space e.g. McCallum, PhD Thesis, Rochester, NY, USA (1996)

sensory input

reward

action

scenario: bars controlled by actions, ‘up’, ‘down’, ‘left’, ‘right’;

reward given if horizontal bar at specific position

model that learns the relevant features

top layer: SARSA RL

lower layer: winner-take-all feature learning

both layers: modulate learning by δ

RL weights

featureweights

input

action

SARSA with WTA input layer

note: non-negativity constraint on weights

Energy function: estimation error of state-action value

identities used:

RL action weights

feature weights

data

learning the ‘short bars’ data

reward

action

short bars in 12x12 average # of steps to goal: 11

RL action weights

feature weights

input reward 2 actions (not shown)

data

learning ‘long bars’ data

WTAnon-negative

weights

SoftMaxnon-negative

weights

SoftMaxno weight

constraints

Discussion

- simple model: SARSA on winner-take-all network with δ-feedback

- learns only the features that are relevant for action strategy

- theory behind: derivation of value function estimation (approx.)

- non-negative coding aids feature extraction

- link between unsupervised- and reinforcement learning

- demonstration with more realistic data needed

Bernstein FocusNeurotechnology,BMBF grant 01GQ0840

EU project 231722“IM-CLeVeR”,call FP7-ICT-2007-3

Frankfurt Institutefor Advanced Studies,FIAS

Sponsors

Bernstein FocusNeurotechnology,BMBF grant 01GQ0840

EU project 231722“IM-CLeVeR”,call FP7-ICT-2007-3

Frankfurt Institutefor Advanced Studies,FIAS

Sponsors

thank you ...

Recommended