View
213
Download
0
Tags:
Embed Size (px)
Citation preview
Presentation next week: cerebellum and supervised learning
Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors in arm movements.
Nature. 1998 392(6675):494-7.
Motor learning: learning algorithms - network and distributed representations - supervised learning
- perceptrons and LMS- backpropagation- reinforcement learning
- unsupervised learning - Hebbian networks
Motor learning
- supervised learning - knowledge of desired behavior is specified
x
y
i.e. for every input x, we know the corresponding desired output y
Motor learning - supervised learning
e.g. learning mapping between joint configuration and end point
Vision gives you information about both values (or could use proprioception for joint angles)
Motor learning - supervised learning
- limited feedback from the periphery - just get a ‘good’ or ‘bad’ evaluation - have to adjust behavior to maximize ‘good’ evaluation
=> reinforcement learning
e.g. maze learning
Sequence of actions leads to a reward - how do we learn the appropriate sequence?
Motor learning - unsupervised learning
- no feedback from the periphery - rely on statistics of the inputs (or outputs) to find structure in the data
e.g. clustering of data x2
x1
Motor learning - unsupervised learning
- no feedback from the periphery - rely on statistics of the inputs (or outputs) to find structure in the data
e.g. clustering of data x2
x1
Develop representations based on properties of the data
Motor learning
- supervised motor learning - parameterized models - non-parametric, ‘neural network’ models - reinforcement learning
- unsupervised learning - Hebbian learning - principle components analysis - independent components analysis
Supervised motor learning - learning parameterized models
linear regression
We know the general structureof the model:
y = a*x + b
but we don’t know parameters a or b.
x
y
-10 -5 0 5 10-80
-60
-40
-20
0
20
40
60
80
We want to estimate a and b based on paired data sets {xi} and {yi}
-10 -5 0 5 10-80
-60
-40
-20
0
20
40
60
80
Parameterized models
Linear regression
y = a*x + b
analytical solution (Intro stats):
b = xiyi/xixi
a = <y> - b<x>
<x> is the expected value of x, i.e. the meanx
y
This is from Intro stats – single step of calculation across all data
Linear regression using iterative gradient descent
y* = a* x + b*; a*, b* the correct parameters, y* is the observed data
assume initial parameters a and b, and define an error term:
E = 1/2(y – y*)2; y is the value predicted by the current parameters y is the target value
we want to find parameters which minimize this error - move the parameters to reduce the errors a = a + da; da is the change in a to reduce the error b = b + db; db is the change in b to reduce the error
choose da, db along the gradient of the error
Parameterized models
y* = a* x + b* E = 1/2(y – y*)2
find the gradient of the error wrt to the parameters:
dE/da = -(y – y*)dy*/da; = -(y – y*)x;
dE/db = -(y – y*);
choose a = a - (y – y*)x; b = b - (y – y*);
with 0< < 1 to control speed of learning
Parameterized models
x
y
e.g. iterative gradient descent for linear regression
Parameterized models
learn limb parameters for 2dof:
x = l1*cos(1) + l2*cos(1+2)y = l1*sin(1) - l2*sin(1+2)
x
y
Parameterized models
Motor learning and representations - how are properties of the limb represented by the CNS?
Distributed representations - parameters are not explicitly fit - both the parameters and the model structure are identified
angle
end
posi
tion
Learn parameters and model within a distributed network
Distributed models - network architecture
1
2
…1
2
x yW
y1 = w11*x1 + w21*x2
y2 = w12*x1 + w22*x2
=> y = Wx
inputs outputs
as shown here, this is just linear regression
1
2
1
2
w11
w22
w12
w21
Distributed network models
1
2
1
x yW
inputs outputs
simple network one layer linear units
3
y = Wx
2
from inputs x and corresponding outputs y*, find W that best approximates the function
Distributed network models
1
2
1
x yW
inputs outputs
simple network one layer linear units
3
2
To fit the network parameters:
define error:
E = ½(y – y*)2
take derivitive wrt weights: dE/dW = -(y - y*)xT
update weights: W = W - u (y - y*)xT
or changing weight by weight: wij = wij - u (yj - yj*)xi
T
i.e. similar to the rule for linear regression
this is Widrow-Hoff/ adaline/ LMS rule - least mean squares rule
Distributed network models - linear units, single layer networks
1
21
x yW
inputs outputs
32
batch mode: learn from all the data at once
W = W + udE/dW
online mode: learn from each data point at a time
wi = wi + udEi/dwi; for {xi,yi}, the ith data point
Distributed network models
1
21
x yW
inputs outputs
32
linear units, single layer networks
- essentially linear regression
- gradient descent learning rule leads to LMS update rule to change weights iteratively
Distributed network models
- more complicated computations- classification: learn to assign data points to correct category
x2
x1
Distributed network models
- more complicated computations- classification: learn to assign data points to correct category
x2
x1
y = 1
y = -1
We want to classify the inputs (x) to outputs of either y = {-1,1} i.e. categorize the data
Distributed network models
- more complicated computations- classification: learn to assign data points to correct category
x2
x1
w
y = w*x > 0
y = w*x < 0
The weight vector acts to project the inputs to produce the outputs - if we take y = sign(w*x), we’re can do classification
Distributed network models - categorization (non-linear transformation)
Learning in nonlinear networks - outputs are non-linear function of their inputs:
sigmoidal ‘squashing’ function g(Wx) = 1/(1 - exp(Wx))
1
2
1
x yW
patterns category (0,1)
3
-10 -5 0 5 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Wx
g(W
x)
works like a ‘bistable’ categorization unit can also use g(x) = sign(x) (Perceptrons)
Distributed network models - categorization (non-linear transformation)
1
2
1
x yW
patterns category (0,1)
3
Learning in nonlinear networks
y = g(Wx) = 1/(1 - exp(Wx))
Find the gradient:
E = ½(y – y*)2
dE/dw = -(y - y*) g’(Wx) x
note that: g’(z) = g(z)(1 – g(z))
this is the basic neural network learning rule
Distributed network models
non-linear units, single layer networks
- ‘logistic’, non-linear regression
- allows learning of categorization problems
1
21
x yW
patterns category (0,1)
3
Distributed network models - single layer, classification networks
x2
x1
x1 x2 y0 0 00 1 01 0 01 1 1
find a network to perform logical AND function
Distributed network models - single layer, classification networks
x2
x1
x1 x2 y0 0 00 1 01 0 01 1 1
find a network to perform AND logical function
Distributed network models - single layer, classification networks
logical AND
x2
x1
- choose W = [10 10]
x1 x2
y
1 1
- need an offset to the inputs to shift the origin
x
W
y
-.6 -.6
x1 x2 Wx threshold(y) 0 0 -1.2 00 1 -.2 01 0 -.2 01 1 .8 1
Distributed network models - single layer, classification networks
x2
x1
x1 x2 y0 0 00 1 11 0 11 1 0
find a network to perform logical XOR function
What weights will make this work? there are none single layer networks are computationally limited
Distributed network models - multiple layer networks
x1 x2
y
1 -2
x
y
h1 h2 -.5 -1.5
11 1
1 x1 x2 h1 h2 Wh y
0 0 0 0 0 0 0 1 1 0 1 11 0 1 0 1 11 1 1 1 -1 0
-.5
- more complicated computations can be performed with multiple layer networks - can characterize problems which are not linearly separable
XOR can be solved with multi-layered network
Distributed network models - learning in multiple layer networks
1
2
1
2
1
2
x hW
inputs outputs
yV
Consider a linear network:
h = Wx y = Vh
NB: there’s not much point to multiple layers with linear units since it can all be reexpressed as a single linear network:
y = VWx = W’x; i.e. just redefine your weight matrix
Distributed network models - learning in multiple layer networks
y
1
2
1
2
1
2
x hW
inputs outputs
V
linear network: h = Wx y = Vh
Form the error: E = ½(y – y*)2
To update the weights V, from h to y:
dE/dV = (y - y*) dy/dV = (y – y*) h
i.e. the same rule as for the single layer network
Distributed network models - learning in multiple layer networks
y
1
2
1
2
1
2
x hW
inputs outputs
V
linear network: h = Wx y = Vh
To update the weights W, from x to h use chain rule:
dE/dW = (y - y*) dy/dW; = (y - y*) dy/dh dh/dW; = (y – y*) V x - this is the gradient for the ‘hidden’ layer
Distributed network models - learning in multiple layer networks
y
1
2
1
2
1
2
x hW
inputs outputs
V
non-linear network:
h = g(Wx) y = g(Vh)
Updating weights V is same as before:
dE/dV = (y – y*) g’(Vh) h
Distributed network models - learning in multiple layer networks
y
1
2
1
2
1
2
x hW
inputs outputs
V
To update the weights W use chain rule:
dE/dW = (y - y*) dy/dW; = (y - y*) dy/dh dh/dW; = (y – y*) g’(Vh) V g’(Wx) x
Essentially, we’re propagating the error backwards through the network, changing weights according to how much they affect the output
=> Backpropagation learning
1
2
1
2
1
2
x hW
inputs outputs
yV
Distributed network models - backpropagation learning in multiple layer networks
1. Find out how much of the error in output is due to V- the responsibility will be due to the activity of h: dE/dV = (y-y*)h
- change V according to this responsibility
2. Find out how much of the error is due to W - units in h which have a large output weight V will be more responsible for the error (i.e. weight error by V): (y-y*) V - values in h will be due to activities in x (i.e. weight h responsibility by x): dE/dW = (y-y*) V x - change W according to this ‘accumulated’ responsibility
linear network: h = Wx y = Vh
Learning in multi-layer neural networks - backpropagation learning
- allows for simple learning of arbitrarily complex input/output mappings
- with enough ‘neurons’, most any mapping is possible
- results in ‘distributed’ representations
- knowledge of the mapping is distributed across neuronal populations not individual cells
- changes in restricted regions of the input state space will result in restricted changes of the output
Learning in multi-layer neural networks - backpropagation learning
- much slower than paramaterized models - network needs to estimate the parameters and model structure from scratch - convergence can be slow, especially if the error surface is shallow
- speed can be increased by altering the learning rate (annealing) or by using conjugate gradient descent - or with ‘momentum’ W = W – udE/dW – n< change in W last time>
error
parameters
Learning in multi-layer neural networks - backpropagation learning
- local minima - error surface might have small ‘basins’ which can trap the network
error
parameters
global mininumlocal mininum
Start the network in different initial conditions to find the global mininum
Learning in multi-layer neural networks - backpropagation learning
- Choosing the learning rate - small values for u can take long time for network to converge - large values can lead to instability
+
learning rate too high learning rate ok
Motor learning: learning algorithms - gradient descent - change model parameters to reduce error in prediction
- parameterized models - non-parametric models
- single layer, linear and non-linear networks- LMS/adaline learning rules
- multi layer, non-linear networks- back propagation learning
- in all of the above, we knew the correct answer and tried to match it - i.e. ‘supervised learning’
- But what if our knowledge of outcome is limited?
=> reinforcement learning
Reinforcement learning - supervised learning, but with limited feedback
1
2
1
inputs outputs
3
2
environment
evaluation: {good, bad}
The environment sends back a global signal saying good or bad (1 or -1) depending on system performance e.g. move the limb and bump into things (pain as a reinforcer)
Reinforcement learning - supervised learning, but with limited feedback
Using a global reinforcement signal to train a network Basic idea - start with initial network
- produce an output based on a given inpuyt - but add noise to the network to explore
- evaluate the output
- find those units with large activity
- change weights so that they’ll be large the next time the input is given
Reinforcement learning - supervised learning, but with limited feedback
Using a global reinforcement signal to train a network Associative reward-penalty algorithm (AR-P)
1
2
1
3
2
x yW
Consider probabilistic outputs: p(y) = 1/(1+exp(-Wx))
The output produced on any given trial is therefore stochastic, with expected value determined by the sigmoid: <y> = tanh(-Wx)
We then use gradient descent to get the update rule:
dW = { u+ (y - <y>) W, if r is rewardu- (-y - <y>) W, if r is penalty
Reinforcement learning - supervised learning, but with limited feedback
Using a global reinforcement signal to train a network Associative reward-penalty algorithm (AR-P)
1
2
1
3
2
x yW
dW = { u+ (y - <y>) W, if r is reward u- (-y - <y>) W, if r is penalty
1) if expected value is close to what it actually did, then don’t change things (nothing new)
2) if expected value is different from what it did, and it was rewarded, then change W so that it will do it again
3) if expected value is different from what it did, and it was penalized, then change W so that it won’t do it again
=> trial and error learning
Reinforcement learning - supervised learning, but with limited feedback
Using a global reinforcement signal to train a network
Much slower than gradient descent
More biologically plausible - how is error backpropagated in supervised learning?
- more directly ethologically plausible- based on direct reward/penalty feedback- information about survival
Motor learning: learning algorithms - gradient descent - change model parameters to reduce error in prediction
- parameterized models - non-parametric models
- single layer, linear and non-linear networks- LMS/adaline learning rules
- multi layer, non-linear networks- back propagation learning
- reinforcement learning - AR-P networks - Q learning, TD learning, dynamic programming
- unsupervised learning