Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors

Presentation next week: cerebellum and supervised learning

Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors in arm movements.

Nature. 1998 392(6675):494-7.

Motor learning: learning algorithms - network and distributed representations - supervised learning

- perceptrons and LMS- backpropagation- reinforcement learning

- unsupervised learning - Hebbian networks

Motor learning

- supervised learning - knowledge of desired behavior is specified

x

y

i.e. for every input x, we know the corresponding desired output y

Motor learning - supervised learning

e.g. learning mapping between joint configuration and end point

Vision gives you information about both values (or could use proprioception for joint angles)

Motor learning - supervised learning

- limited feedback from the periphery - just get a ‘good’ or ‘bad’ evaluation - have to adjust behavior to maximize ‘good’ evaluation

=> reinforcement learning

e.g. maze learning

Sequence of actions leads to a reward - how do we learn the appropriate sequence?

Motor learning - unsupervised learning

- no feedback from the periphery - rely on statistics of the inputs (or outputs) to find structure in the data

e.g. clustering of data x2

x1

Motor learning - unsupervised learning

- no feedback from the periphery - rely on statistics of the inputs (or outputs) to find structure in the data

e.g. clustering of data x2

x1

Develop representations based on properties of the data

Motor learning

- supervised motor learning - parameterized models - non-parametric, ‘neural network’ models - reinforcement learning

- unsupervised learning - Hebbian learning - principle components analysis - independent components analysis

Supervised motor learning - learning parameterized models

linear regression

We know the general structureof the model:

y = a*x + b

but we don’t know parameters a or b.

x

y

-10 -5 0 5 10-80

-60

-40

-20

0

20

40

60

80

We want to estimate a and b based on paired data sets {xi} and {yi}

-10 -5 0 5 10-80

-60

-40

-20

0

20

40

60

80

Parameterized models

Linear regression

y = a*x + b

analytical solution (Intro stats):

b = xiyi/xixi

a = <y> - b<x>

<x> is the expected value of x, i.e. the meanx

y

This is from Intro stats – single step of calculation across all data

Linear regression using iterative gradient descent

y* = a* x + b*; a*, b* the correct parameters, y* is the observed data

assume initial parameters a and b, and define an error term:

E = 1/2(y – y*)2; y is the value predicted by the current parameters y is the target value

we want to find parameters which minimize this error - move the parameters to reduce the errors a = a + da; da is the change in a to reduce the error b = b + db; db is the change in b to reduce the error

choose da, db along the gradient of the error


y* = a* x + b* E = 1/2(y – y*)2

find the gradient of the error wrt to the parameters:

dE/da = -(y – y*)dy*/da; = -(y – y*)x;

dE/db = -(y – y*);

choose a = a - (y – y*)x; b = b - (y – y*);

with 0< < 1 to control speed of learning


x

y

e.g. iterative gradient descent for linear regression


learn limb parameters for 2dof:

x = l1*cos(1) + l2*cos(1+2)y = l1*sin(1) - l2*sin(1+2)

x

y


Motor learning and representations - how are properties of the limb represented by the CNS?

Distributed representations - parameters are not explicitly fit - both the parameters and the model structure are identified

angle

end

posi

tion

Learn parameters and model within a distributed network

Distributed models - network architecture

1

2

…1

2

x yW

y1 = w11*x1 + w21*x2

y2 = w12*x1 + w22*x2

=> y = Wx

inputs outputs

as shown here, this is just linear regression

1

2

1

2

w11

w22

w12

w21

Distributed network models

1

2

1

x yW

inputs outputs

simple network one layer linear units

3

y = Wx

2

from inputs x and corresponding outputs y*, find W that best approximates the function


1

2

1

x yW

inputs outputs

simple network one layer linear units

3

2

To fit the network parameters:

define error:

E = ½(y – y*)2

take derivitive wrt weights: dE/dW = -(y - y*)xT

update weights: W = W - u (y - y*)xT

or changing weight by weight: wij = wij - u (yj - yj*)xi

T

i.e. similar to the rule for linear regression

this is Widrow-Hoff/ adaline/ LMS rule - least mean squares rule

Distributed network models - linear units, single layer networks

1

21

x yW

inputs outputs

32

batch mode: learn from all the data at once

W = W + udE/dW

online mode: learn from each data point at a time

wi = wi + udEi/dwi; for {xi,yi}, the ith data point


1

21

x yW

inputs outputs

32

linear units, single layer networks

- essentially linear regression

- gradient descent learning rule leads to LMS update rule to change weights iteratively


- more complicated computations- classification: learn to assign data points to correct category

x2

x1



x2

x1

y = 1

y = -1

We want to classify the inputs (x) to outputs of either y = {-1,1} i.e. categorize the data



x2

x1

w

y = w*x > 0

y = w*x < 0

The weight vector acts to project the inputs to produce the outputs - if we take y = sign(w*x), we’re can do classification

Distributed network models - categorization (non-linear transformation)

Learning in nonlinear networks - outputs are non-linear function of their inputs:

sigmoidal ‘squashing’ function g(Wx) = 1/(1 - exp(Wx))

1

2

1

x yW

patterns category (0,1)

3

-10 -5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Wx

g(W

x)

works like a ‘bistable’ categorization unit can also use g(x) = sign(x) (Perceptrons)

Distributed network models - categorization (non-linear transformation)

1

2

1

x yW


3

Learning in nonlinear networks

y = g(Wx) = 1/(1 - exp(Wx))

Find the gradient:

E = ½(y – y*)2

dE/dw = -(y - y*) g’(Wx) x

note that: g’(z) = g(z)(1 – g(z))

this is the basic neural network learning rule


non-linear units, single layer networks

- ‘logistic’, non-linear regression

- allows learning of categorization problems

1

21

x yW


3

Distributed network models - single layer, classification networks

x2

x1

x1 x2 y0 0 00 1 01 0 01 1 1

find a network to perform logical AND function


x2

x1

x1 x2 y0 0 00 1 01 0 01 1 1

find a network to perform AND logical function


logical AND

x2

x1

- choose W = [10 10]

x1 x2

y

1 1

- need an offset to the inputs to shift the origin

x

W

y

-.6 -.6

x1 x2 Wx threshold(y) 0 0 -1.2 00 1 -.2 01 0 -.2 01 1 .8 1


x2

x1

x1 x2 y0 0 00 1 11 0 11 1 0

find a network to perform logical XOR function

What weights will make this work? there are none single layer networks are computationally limited

Distributed network models - multiple layer networks

x1 x2

y

1 -2

x

y

h1 h2 -.5 -1.5

11 1

1 x1 x2 h1 h2 Wh y

0 0 0 0 0 0 0 1 1 0 1 11 0 1 0 1 11 1 1 1 -1 0

-.5

- more complicated computations can be performed with multiple layer networks - can characterize problems which are not linearly separable

XOR can be solved with multi-layered network

Distributed network models - learning in multiple layer networks

1

2

1

2

1

2

x hW

inputs outputs

yV

Consider a linear network:

h = Wx y = Vh

NB: there’s not much point to multiple layers with linear units since it can all be reexpressed as a single linear network:

y = VWx = W’x; i.e. just redefine your weight matrix


y

1

2

1

2

1

2

x hW

inputs outputs

V

linear network: h = Wx y = Vh

Form the error: E = ½(y – y*)2

To update the weights V, from h to y:

dE/dV = (y - y*) dy/dV = (y – y*) h

i.e. the same rule as for the single layer network


y

1

2

1

2

1

2

x hW

inputs outputs

V


To update the weights W, from x to h use chain rule:

dE/dW = (y - y*) dy/dW; = (y - y*) dy/dh dh/dW; = (y – y*) V x - this is the gradient for the ‘hidden’ layer


y

1

2

1

2

1

2

x hW

inputs outputs

V

non-linear network:

h = g(Wx) y = g(Vh)

Updating weights V is same as before:

dE/dV = (y – y*) g’(Vh) h


y

1

2

1

2

1

2

x hW

inputs outputs

V

To update the weights W use chain rule:

dE/dW = (y - y*) dy/dW; = (y - y*) dy/dh dh/dW; = (y – y*) g’(Vh) V g’(Wx) x

Essentially, we’re propagating the error backwards through the network, changing weights according to how much they affect the output

=> Backpropagation learning

1

2

1

2

1

2

x hW

inputs outputs

yV

Distributed network models - backpropagation learning in multiple layer networks

1. Find out how much of the error in output is due to V- the responsibility will be due to the activity of h: dE/dV = (y-y*)h

- change V according to this responsibility

2. Find out how much of the error is due to W - units in h which have a large output weight V will be more responsible for the error (i.e. weight error by V): (y-y*) V - values in h will be due to activities in x (i.e. weight h responsibility by x): dE/dW = (y-y*) V x - change W according to this ‘accumulated’ responsibility


Learning in multi-layer neural networks - backpropagation learning

- allows for simple learning of arbitrarily complex input/output mappings

- with enough ‘neurons’, most any mapping is possible

- results in ‘distributed’ representations

- knowledge of the mapping is distributed across neuronal populations not individual cells

- changes in restricted regions of the input state space will result in restricted changes of the output


- much slower than paramaterized models - network needs to estimate the parameters and model structure from scratch - convergence can be slow, especially if the error surface is shallow

- speed can be increased by altering the learning rate (annealing) or by using conjugate gradient descent - or with ‘momentum’ W = W – udE/dW – n< change in W last time>

error

parameters


- local minima - error surface might have small ‘basins’ which can trap the network

error

parameters

global mininumlocal mininum

Start the network in different initial conditions to find the global mininum


- Choosing the learning rate - small values for u can take long time for network to converge - large values can lead to instability

+

learning rate too high learning rate ok

Motor learning: learning algorithms - gradient descent - change model parameters to reduce error in prediction

- parameterized models - non-parametric models

- single layer, linear and non-linear networks- LMS/adaline learning rules

- multi layer, non-linear networks- back propagation learning

- in all of the above, we knew the correct answer and tried to match it - i.e. ‘supervised learning’

- But what if our knowledge of outcome is limited?

=> reinforcement learning

Reinforcement learning - supervised learning, but with limited feedback

1

2

1

inputs outputs

3

2

environment

evaluation: {good, bad}

The environment sends back a global signal saying good or bad (1 or -1) depending on system performance e.g. move the limb and bump into things (pain as a reinforcer)


Using a global reinforcement signal to train a network Basic idea - start with initial network

- produce an output based on a given inpuyt - but add noise to the network to explore

- evaluate the output

- find those units with large activity

- change weights so that they’ll be large the next time the input is given


Using a global reinforcement signal to train a network Associative reward-penalty algorithm (AR-P)

1

2

1

3

2

x yW

Consider probabilistic outputs: p(y) = 1/(1+exp(-Wx))

The output produced on any given trial is therefore stochastic, with expected value determined by the sigmoid: <y> = tanh(-Wx)

We then use gradient descent to get the update rule:

dW = { u+ (y - <y>) W, if r is rewardu- (-y - <y>) W, if r is penalty


Using a global reinforcement signal to train a network Associative reward-penalty algorithm (AR-P)

1

2

1

3

2

x yW

dW = { u+ (y - <y>) W, if r is reward u- (-y - <y>) W, if r is penalty

1) if expected value is close to what it actually did, then don’t change things (nothing new)

2) if expected value is different from what it did, and it was rewarded, then change W so that it will do it again

3) if expected value is different from what it did, and it was penalized, then change W so that it won’t do it again

=> trial and error learning


Using a global reinforcement signal to train a network

Much slower than gradient descent

More biologically plausible - how is error backpropagated in supervised learning?

- more directly ethologically plausible- based on direct reward/penalty feedback- information about survival

Motor learning: learning algorithms - gradient descent - change model parameters to reduce error in prediction

- parameterized models - non-parametric models

- single layer, linear and non-linear networks- LMS/adaline learning rules

- multi layer, non-linear networks- back propagation learning

- reinforcement learning - AR-P networks - Q learning, TD learning, dynamic programming

- unsupervised learning

Documents

Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors