53
Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors in arm movements. Nature. 1998 392(6675):494-7.

Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors

Presentation next week: cerebellum and supervised learning

Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors in arm movements.

Nature. 1998 392(6675):494-7.

Page 2: Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors

Motor learning: learning algorithms - network and distributed representations - supervised learning

- perceptrons and LMS- backpropagation- reinforcement learning

- unsupervised learning - Hebbian networks

Page 3: Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors

Motor learning

- supervised learning - knowledge of desired behavior is specified

x

y

i.e. for every input x, we know the corresponding desired output y

Page 4: Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors

Motor learning - supervised learning

e.g. learning mapping between joint configuration and end point

Vision gives you information about both values (or could use proprioception for joint angles)

Page 5: Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors

Motor learning - supervised learning

- limited feedback from the periphery - just get a ‘good’ or ‘bad’ evaluation - have to adjust behavior to maximize ‘good’ evaluation

=> reinforcement learning

e.g. maze learning

Sequence of actions leads to a reward - how do we learn the appropriate sequence?

Page 6: Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors

Motor learning - unsupervised learning

- no feedback from the periphery - rely on statistics of the inputs (or outputs) to find structure in the data

e.g. clustering of data x2

x1

Page 7: Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors

Motor learning - unsupervised learning

- no feedback from the periphery - rely on statistics of the inputs (or outputs) to find structure in the data

e.g. clustering of data x2

x1

Develop representations based on properties of the data

Page 8: Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors

Motor learning

- supervised motor learning - parameterized models - non-parametric, ‘neural network’ models - reinforcement learning

- unsupervised learning - Hebbian learning - principle components analysis - independent components analysis

Page 9: Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors

Supervised motor learning - learning parameterized models

linear regression

We know the general structureof the model:

y = a*x + b

but we don’t know parameters a or b.

x

y

-10 -5 0 5 10-80

-60

-40

-20

0

20

40

60

80

We want to estimate a and b based on paired data sets {xi} and {yi}

Page 10: Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors

-10 -5 0 5 10-80

-60

-40

-20

0

20

40

60

80

Parameterized models

Linear regression

y = a*x + b

analytical solution (Intro stats):

b = xiyi/xixi

a = <y> - b<x>

<x> is the expected value of x, i.e. the meanx

y

This is from Intro stats – single step of calculation across all data

Page 11: Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors

Linear regression using iterative gradient descent

y* = a* x + b*; a*, b* the correct parameters, y* is the observed data

assume initial parameters a and b, and define an error term:

E = 1/2(y – y*)2; y is the value predicted by the current parameters y is the target value

we want to find parameters which minimize this error - move the parameters to reduce the errors a = a + da; da is the change in a to reduce the error b = b + db; db is the change in b to reduce the error

choose da, db along the gradient of the error

Parameterized models

Page 12: Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors

y* = a* x + b* E = 1/2(y – y*)2

find the gradient of the error wrt to the parameters:

dE/da = -(y – y*)dy*/da; = -(y – y*)x;

dE/db = -(y – y*);

choose a = a - (y – y*)x; b = b - (y – y*);

with 0< < 1 to control speed of learning

Parameterized models

Page 13: Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors

x

y

e.g. iterative gradient descent for linear regression

Parameterized models

Page 14: Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors

learn limb parameters for 2dof:

x = l1*cos(1) + l2*cos(1+2)y = l1*sin(1) - l2*sin(1+2)

x

y

Parameterized models

Page 15: Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors

Motor learning and representations - how are properties of the limb represented by the CNS?

Distributed representations - parameters are not explicitly fit - both the parameters and the model structure are identified

angle

end

posi

tion

Learn parameters and model within a distributed network

Page 16: Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors

Distributed models - network architecture

1

2

…1

2

x yW

y1 = w11*x1 + w21*x2

y2 = w12*x1 + w22*x2

=> y = Wx

inputs outputs

as shown here, this is just linear regression

1

2

1

2

w11

w22

w12

w21

Page 17: Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors

Distributed network models

1

2

1

x yW

inputs outputs

simple network one layer linear units

3

y = Wx

2

from inputs x and corresponding outputs y*, find W that best approximates the function

Page 18: Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors

Distributed network models

1

2

1

x yW

inputs outputs

simple network one layer linear units

3

2

To fit the network parameters:

define error:

E = ½(y – y*)2

take derivitive wrt weights: dE/dW = -(y - y*)xT

update weights: W = W - u (y - y*)xT

or changing weight by weight: wij = wij - u (yj - yj*)xi

T

i.e. similar to the rule for linear regression

this is Widrow-Hoff/ adaline/ LMS rule - least mean squares rule

Page 19: Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors

Distributed network models - linear units, single layer networks

1

21

x yW

inputs outputs

32

batch mode: learn from all the data at once

W = W + udE/dW

online mode: learn from each data point at a time

wi = wi + udEi/dwi; for {xi,yi}, the ith data point

Page 20: Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors

Distributed network models

1

21

x yW

inputs outputs

32

linear units, single layer networks

- essentially linear regression

- gradient descent learning rule leads to LMS update rule to change weights iteratively

Page 21: Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors

Distributed network models

- more complicated computations- classification: learn to assign data points to correct category

x2

x1

Page 22: Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors

Distributed network models

- more complicated computations- classification: learn to assign data points to correct category

x2

x1

y = 1

y = -1

We want to classify the inputs (x) to outputs of either y = {-1,1} i.e. categorize the data

Page 23: Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors

Distributed network models

- more complicated computations- classification: learn to assign data points to correct category

x2

x1

w

y = w*x > 0

y = w*x < 0

The weight vector acts to project the inputs to produce the outputs - if we take y = sign(w*x), we’re can do classification

Page 24: Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors

Distributed network models - categorization (non-linear transformation)

Learning in nonlinear networks - outputs are non-linear function of their inputs:

sigmoidal ‘squashing’ function g(Wx) = 1/(1 - exp(Wx))

1

2

1

x yW

patterns category (0,1)

3

-10 -5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Wx

g(W

x)

works like a ‘bistable’ categorization unit can also use g(x) = sign(x) (Perceptrons)

Page 25: Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors

Distributed network models - categorization (non-linear transformation)

1

2

1

x yW

patterns category (0,1)

3

Learning in nonlinear networks

y = g(Wx) = 1/(1 - exp(Wx))

Find the gradient:

E = ½(y – y*)2

dE/dw = -(y - y*) g’(Wx) x

note that: g’(z) = g(z)(1 – g(z))

this is the basic neural network learning rule

Page 26: Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors

Distributed network models

non-linear units, single layer networks

- ‘logistic’, non-linear regression

- allows learning of categorization problems

1

21

x yW

patterns category (0,1)

3

Page 27: Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors

Distributed network models - single layer, classification networks

x2

x1

x1 x2 y0 0 00 1 01 0 01 1 1

find a network to perform logical AND function

Page 28: Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors

Distributed network models - single layer, classification networks

x2

x1

x1 x2 y0 0 00 1 01 0 01 1 1

find a network to perform AND logical function

Page 29: Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors

Distributed network models - single layer, classification networks

logical AND

x2

x1

- choose W = [10 10]

x1 x2

y

1 1

- need an offset to the inputs to shift the origin

x

W

y

-.6 -.6

x1 x2 Wx threshold(y) 0 0 -1.2 00 1 -.2 01 0 -.2 01 1 .8 1

Page 30: Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors

Distributed network models - single layer, classification networks

x2

x1

x1 x2 y0 0 00 1 11 0 11 1 0

find a network to perform logical XOR function

What weights will make this work? there are none single layer networks are computationally limited

Page 31: Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors

Distributed network models - multiple layer networks

x1 x2

y

1 -2

x

y

h1 h2 -.5 -1.5

11 1

1 x1 x2 h1 h2 Wh y

0 0 0 0 0 0 0 1 1 0 1 11 0 1 0 1 11 1 1 1 -1 0

-.5

- more complicated computations can be performed with multiple layer networks - can characterize problems which are not linearly separable

XOR can be solved with multi-layered network

Page 32: Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors

Distributed network models - learning in multiple layer networks

1

2

1

2

1

2

x hW

inputs outputs

yV

Consider a linear network:

h = Wx y = Vh

NB: there’s not much point to multiple layers with linear units since it can all be reexpressed as a single linear network:

y = VWx = W’x; i.e. just redefine your weight matrix

Page 33: Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors

Distributed network models - learning in multiple layer networks

y

1

2

1

2

1

2

x hW

inputs outputs

V

linear network: h = Wx y = Vh

Form the error: E = ½(y – y*)2

To update the weights V, from h to y:

dE/dV = (y - y*) dy/dV = (y – y*) h

i.e. the same rule as for the single layer network

Page 34: Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors

Distributed network models - learning in multiple layer networks

y

1

2

1

2

1

2

x hW

inputs outputs

V

linear network: h = Wx y = Vh

To update the weights W, from x to h use chain rule:

dE/dW = (y - y*) dy/dW; = (y - y*) dy/dh dh/dW; = (y – y*) V x - this is the gradient for the ‘hidden’ layer

Page 35: Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors

Distributed network models - learning in multiple layer networks

y

1

2

1

2

1

2

x hW

inputs outputs

V

non-linear network:

h = g(Wx) y = g(Vh)

Updating weights V is same as before:

dE/dV = (y – y*) g’(Vh) h

Page 36: Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors

Distributed network models - learning in multiple layer networks

y

1

2

1

2

1

2

x hW

inputs outputs

V

To update the weights W use chain rule:

dE/dW = (y - y*) dy/dW; = (y - y*) dy/dh dh/dW; = (y – y*) g’(Vh) V g’(Wx) x

Essentially, we’re propagating the error backwards through the network, changing weights according to how much they affect the output

=> Backpropagation learning

Page 37: Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors

1

2

1

2

1

2

x hW

inputs outputs

yV

Distributed network models - backpropagation learning in multiple layer networks

1. Find out how much of the error in output is due to V- the responsibility will be due to the activity of h: dE/dV = (y-y*)h

- change V according to this responsibility

2. Find out how much of the error is due to W - units in h which have a large output weight V will be more responsible for the error (i.e. weight error by V): (y-y*) V - values in h will be due to activities in x (i.e. weight h responsibility by x): dE/dW = (y-y*) V x - change W according to this ‘accumulated’ responsibility

linear network: h = Wx y = Vh

Page 38: Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors

Learning in multi-layer neural networks - backpropagation learning

- allows for simple learning of arbitrarily complex input/output mappings

- with enough ‘neurons’, most any mapping is possible

- results in ‘distributed’ representations

- knowledge of the mapping is distributed across neuronal populations not individual cells

- changes in restricted regions of the input state space will result in restricted changes of the output

Page 39: Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors

Learning in multi-layer neural networks - backpropagation learning

- much slower than paramaterized models - network needs to estimate the parameters and model structure from scratch - convergence can be slow, especially if the error surface is shallow

- speed can be increased by altering the learning rate (annealing) or by using conjugate gradient descent - or with ‘momentum’ W = W – udE/dW – n< change in W last time>

error

parameters

Page 40: Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors

Learning in multi-layer neural networks - backpropagation learning

- local minima - error surface might have small ‘basins’ which can trap the network

error

parameters

global mininumlocal mininum

Start the network in different initial conditions to find the global mininum

Page 41: Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors

Learning in multi-layer neural networks - backpropagation learning

- Choosing the learning rate - small values for u can take long time for network to converge - large values can lead to instability

+

learning rate too high learning rate ok

Page 42: Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors

Motor learning: learning algorithms - gradient descent - change model parameters to reduce error in prediction

- parameterized models - non-parametric models

- single layer, linear and non-linear networks- LMS/adaline learning rules

- multi layer, non-linear networks- back propagation learning

- in all of the above, we knew the correct answer and tried to match it - i.e. ‘supervised learning’

- But what if our knowledge of outcome is limited?

=> reinforcement learning

Page 43: Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors

Reinforcement learning - supervised learning, but with limited feedback

1

2

1

inputs outputs

3

2

environment

evaluation: {good, bad}

The environment sends back a global signal saying good or bad (1 or -1) depending on system performance e.g. move the limb and bump into things (pain as a reinforcer)

Page 44: Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors

Reinforcement learning - supervised learning, but with limited feedback

Using a global reinforcement signal to train a network Basic idea - start with initial network

- produce an output based on a given inpuyt - but add noise to the network to explore

- evaluate the output

- find those units with large activity

- change weights so that they’ll be large the next time the input is given

Page 45: Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors

Reinforcement learning - supervised learning, but with limited feedback

Using a global reinforcement signal to train a network Associative reward-penalty algorithm (AR-P)

1

2

1

3

2

x yW

Consider probabilistic outputs: p(y) = 1/(1+exp(-Wx))

The output produced on any given trial is therefore stochastic, with expected value determined by the sigmoid: <y> = tanh(-Wx)

We then use gradient descent to get the update rule:

dW = { u+ (y - <y>) W, if r is rewardu- (-y - <y>) W, if r is penalty

Page 46: Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors

Reinforcement learning - supervised learning, but with limited feedback

Using a global reinforcement signal to train a network Associative reward-penalty algorithm (AR-P)

1

2

1

3

2

x yW

dW = { u+ (y - <y>) W, if r is reward u- (-y - <y>) W, if r is penalty

1) if expected value is close to what it actually did, then don’t change things (nothing new)

2) if expected value is different from what it did, and it was rewarded, then change W so that it will do it again

3) if expected value is different from what it did, and it was penalized, then change W so that it won’t do it again

=> trial and error learning

Page 47: Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors

Reinforcement learning - supervised learning, but with limited feedback

Using a global reinforcement signal to train a network

Much slower than gradient descent

More biologically plausible - how is error backpropagated in supervised learning?

- more directly ethologically plausible- based on direct reward/penalty feedback- information about survival

Page 48: Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors

Motor learning: learning algorithms - gradient descent - change model parameters to reduce error in prediction

- parameterized models - non-parametric models

- single layer, linear and non-linear networks- LMS/adaline learning rules

- multi layer, non-linear networks- back propagation learning

- reinforcement learning - AR-P networks - Q learning, TD learning, dynamic programming

- unsupervised learning

Page 49: Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors
Page 50: Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors
Page 51: Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors
Page 52: Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors
Page 53: Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors