Upload
yamin
View
26
Download
0
Tags:
Embed Size (px)
DESCRIPTION
580.691 Learning Theory Reza Shadmehr State estimation theory. A. B. - PowerPoint PPT Presentation
Citation preview
580.691 Learning Theory
Reza Shadmehr
State estimation theory
A B
Subject was instructed to pull on a knob that was fixed on a rigid wall. A) EMG recordings from arm and leg muscles. Before biceps is activated, the brain activates the leg muscles to stabilize the lower body and prevent sway due to the anticipated pulling force on the upper body. B) When a rigid bar is placed on the upper body, the leg muscles are not activated when biceps is activated. (Cordo and Nashner, 1982)
A B
100
Hz
Effect of eye movement on the memory of a visual stimulus. In the top panel, the filled circle represents the fixation point, the asterisk indicates the location of the visual stimulus, and the dashed circle indicates the receptive field a cell in the LIP region of the parietal cortex. A) Discharge to the onset and offset of a visual stimulus in the cell’s receptive field. Abbreviations: H. eye, horizontal eye position; Stim, stimulus; V. eye, vertical eye position. B) Discharge during the time period in which a saccade brings the stimulus into the cell’s receptive field. The cell’s discharge increased before the saccade brought the stimulus into the cell’s receptive field. (From (Duhamel et al., 1992)
Why predict sensory consequences of motor commands?
Subject looked at a moving cursor while a group of dots appeared on the screen for 300ms. In some trials the dots would remain still (A) while in other trials they would move together left or right with a constant speed (B). Subject indicated the direction of motion of the dots. From this result, the authors estimated the speed of subjective stationarity, i.e., the speed of dots for which the subject perceived them to be stationary. C) The unfilled circles represent performance of control subjects. Regardless of the speed of the cursor, they perceived the dots to be stationary only if their speed was near zero. The filled triangles represent performance of subject RW. As the speed of the cursor increased, RW perceived the dots to be stationary if their speed was near the speed of the cursor. (Haarmeier et al., 1997)
Disorders of agency in schizophrenia relate to an inability to compensate for sensory consequences of self-generated motor commands. In a paradigm similar to that shown in the last figure, volunteers estimated whether during motion of a cursor the background moved to the right or left. By varying the background speed, at each cursor speed the experimenters estimated the speed of perceptual stationarity, i.e., the speed of background motion for which the subject saw the background to be stationary. They then computed a compensation index as the difference between speed of eye movement and speed of background when perceived to be stationary, divided by speed of eye movement. The subset of schizophrenic patients who had delusional symptoms showed a greater deficit than control in their ability to compensate for sensory consequences of self-generated motor commands. (From (Lindner et al., 2005))
A
B
C
D
Combining predictions with observations
Parameter variance depends only on input selection and noise
*( ) * ( )
( ) *( ) 2
(1) (1) (1,1) (2) (1,2) ( ) (1, )
1 (1)
(2) (1) (2,1) (2) (2,2) ( ) (2, )
1 (2)
1* 2
0,
, , , , , ,
, , , , , ,
,
i T i
i i
n n
T TML
n n
T TML
TML
y
y y N
D y y y
X X X
D y y y
X X X
N X X
w x
x x x
w y
x x x
w y
w w
A noisy process produces n data points and we form an ML estimate of w.
We run the noisy process again with the same sequence of x’s and re-estimate w:
The distribution of the resulting w will have a var-cov that depends only on the sequence of inputs, the bases that encode those inputs, and the noise sigma.
The Gaussian distribution and its var-cov matrix2
22
1 ( )( ) exp
22
xp x
11 1( ) exp ( ) ( )
2(2 ) | |
T
np C
C
x x x x x
( )( )ij i i j j i j i jc E x x x x E x x x x
21 12 1 2 1 1
212 1 2 2 2 2
21 1 2 2
n n
n n
n n n n n
C
A 1-D Gaussian distribution is defined as
In n dimensions, it generalizes to
When x is a vector, the variance is expressed in terms of a covariance matrix C, where ρij corresponds to the degree of correlation between variables xi and xj
2 2
( )( )
( ) ( )
i x i yxy xyi
x yxx yyi x i yi i
x yC C
C Cx y
-2 -1 0 1 2 3
-2
-1
0
1
2
3
-2 -1 0 1 2 3
-2
-1
0
1
2
3
21 12 1 2
212 1 2 2
,N C
C
x μ
1x
2x
0 1 0.9 2,
0 0.9 2 2N
x
-3 -2 -1 0 1 2 3 4
-3
-2
-1
0
1
2
3
4
0 1 0.1 2,
0 0.1 2 2N
x0 1 0.9 2
,0 0.9 2 2
N
x
x1 and x2 are positively correlated x1 and x2 are not correlated x1 and x2 are negatively correlated
Parameter uncertainty: Example 1
• Input history:
1 1 2 2
1 1 2
2 1 2
1* 2
2
ˆ
var cov ,[ ],
cov , var
,
0.5 0.25 0,
0.5 0 1
T
ML
T
y w x w x
w w wN E
w w w
N X X
N
x w
w w
w
1 0 0.5
1 0 0.5
1 0 0.5
1 0 0.5
0 1 0.5
1x 2x*y
x1 was “on” most of the time. I’m pretty certain about w1. However, x2 was “on” only once, so I’m uncertain about w2.
1w
2w
-0.5 0 0.5 1 1.5 2
-0.5
0
0.5
1
1.5
2
1 0
1 0
1 0
1 0
0 1
X
Parameter uncertainty: Example 2
• Input history:
1 1
1 1
1 1
1 1
1 0
X
1 1 2
2 1 2
1* 2
2
var cov ,[ ],
cov , var
,
0.5 1 1,
0.5 1 1.25
ML
T
w w wN E
w w w
N X X
N
w w
w
1 1 1
1 1 1
1 1 1
1 1 1
1 0 0.5
1x 2x*y
x1 and x2 were “on” mostly together. The weight var-cov matrix shows that what I learned is that:
I do not know individual values of w1 and w2 with much certainty.
x1 appeared slightly more often than x2, so I’m a little more certain about the value of w1.
-0.5 0 0.5 1 1.5 2
-0.5
0
0.5
1
1.5
2
1w
2w1 2 1w w
Parameter uncertainty: Example 3
• Input history:
1 1 2
2 1 2
1* 2
2
var cov ,[ ],
cov , var
,
0.5 1.25 0.25,
0.5 0.25 0.25
ML
T
w w wN E
w w w
N X X
N
w w
w
0 1 0.5
0 1 0.5
0 1 0.5
0 1 0.5
1 1 1
1x 2x*y
x2 was mostly “on”. I’m pretty certain about w2, but I am very uncertain about w1. Occasionally x1 and x2 were on together, so I have some reason to believe that:
1w
2w1 2 1w w
-0.5 0 0.5 1 1.5 2
-0.5
0
0.5
1
1.5
2
Effect of uncertainty on learning rate
• When you observe an error in trial n, the amount that you should change w should depend on how certain you are about w. The more certain you are, the less you should be influenced by the error. The less certain you are, the more you should “pay attention” to the error.
( 1) ( ) ( ) ( ) ( ) ( )n n n n n T ny w w k x w
mx1 mx1
Kalman gain
error
Rudolph E. Kalman (1960) A new approach to linear filtering and prediction problems. Transactions of the ASME–Journal of Basic Engineering, 82 (Series D): 35-45.
Research Institute for Advanced Study7212 Bellona Ave, Baltimore, MD
Example of the Kalman gain: running estimate of average
( )
*( ) * ( ) *( ) 2
1( ) ( )
1
1( 1) ( )
1
1( ) ( ) ( ) ( 1) ( ) ( 1) ( )
1
( ) ( 1) ( ) (
1
; 0,
1 1 1
1
1
1
1 1 1 11 1
1
i
i i i
nT
nn T T i
i
nn i
i
nn i n n n n n
i
n n n n
x
y w y y N
X
w X X X yn
w yn
w y y n w y w yn n n n
w w y wn
y
1)
Kalman gain: learning rate decreases as the number of samples increase
As n increases, we trust our past estimate w(n-1) a lot more than the new observation y(n)
Past estimate New measure
w(n) is the online estimate of the mean of y
Example of the Kalman gain: running estimate of variance
22 ( )( )
1
2( ) ( )
1
1 2 2( ) ( ) ( ) ( )
1
22 ( ) ( )( 1)
22 2 ( ) ( )( 1) ( 1)
22 2 ( ) ( ) 2( ) ( 1) ( 1)
1ˆ
1
1
1ˆ1
1 1ˆ ˆ
1ˆ ˆ ˆ
ni
ni
ni n
i
ni n n n
i
n nn
n nn n
n nn n n
y E yn
y wn
y w y wn
n y wn
y wn n
y wn
sigma_hat is the online estimate of the var of y
( ) ( ) * ( ) 2
1 ( ) ( ) ( )
* *
0,
var
n n T n
n n nn n n T
n n
Tn n n n
Tn n
n n T
n n T n
y N
y
P
E E E
E
E
trace P E
x w
w w k x w
w
w w w w
w w w w
w w
w w
We note that P is simply the var-cov matrix of our model weights. It represents the uncertainty in our estimates of the model parameters.
We want to update the weights in such a way as to minimize the trace of this variance. The trace is the sum of the squared errors between our estimates of w and the true estimates.
Some observations about variance of model parameters
Trace of parameter var-cov matrix is the sum of squared parameter errors
1 1 1 2
2 2 1 2
2( ) ( )21 11 1
1 1
( )2 ( )21 2 1 2
1
var cov ,0, ,
cov , var0
1 1var
1var var
T
n ni i
i i
ni i
i
P E
w w w wN P N
w w w w
w w E w wn n
trace P w w w wn
ww
w 0
Our objective is to find learning rate k (Kalman gain) such that we minimize the sum of the squared error in our parameter estimates. This sum is the trace of the P matrix. Therefore, given observation y(n), we want to find k such that we minimize the variance of our estimate w.
1 1( ) ( ) ( )n n n n n nn n n Ty
w w k x w
( ) ( ) * ( ) 2
1
1( ) ( )
1 1( ) ( ) ( )
0,n n T n
n n
n nn n T
n n n n n nn n n T
y N
y
y
x w
w
x w
w w k x w
Objective: adjust learning gain in order to minimize model uncertainty
1 1
1 1 1 1
var
var
n n n n
Tn n n n n n n n
n n n n
Tn n n n n n n n
P
E E E
P
E E E
w
w w w w
w
w w w w
a prior variance of parameters
a posterior variance of parameters
my estimate of w* before I see y in trial n, given that I have seen y up to n-1
error in trial n
my estimate after I see y in trial n
Hypothesis about data observation in trial n
1 1( ) ( ) ( )
1 1( ) ( ) * ( ) ( )
1 1* * ( ) ( ) * ( ) ( ) ( ) ( )
1( ) ( ) * ( ) ( )
n n n n n nn n n T
n n n n n nn n T n n T
n n n n n nn n T n n n n T
n nn n T n n
y
I
w w k x w
w w k x w x w
w w w w k x w k k x w
k x w w k
Evolution of parameter uncertainty
* *
1 1( ) ( ) * ( ) ( ) ( ) ( ) * ( ) ( )
1 1( ) ( ) * * ( ) ( ) ( ) ( ) ( ) (
Tn n n n n n
Tn n n nn n T n n n n T n n
T Tn n n nn n T n n T n n n T n
P E
E I I
E I I
w w w w
k x w w k k x w w k
k x w w w w k x k k
)
1( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )
1( ) ( ) ( ) ( ) ( ) 2 ( )
T
Tn nn n T n n T n n n T n T
Tn n n nn n T n n T n n T
I P I E
P I P I
k x k x k k
k x k x k k
1 1( ) ( ) ( )
1 1( ) ( ) * ( ) ( )
1( ) ( ) ( ) ( ) ( ) ( ) *
1 1
1( ) ( ) ( ) ( ) ( ) ( )
var
var
var
n n n n n nn n n T
n n n n n nn n T n n T
n n n nn n T n n n n T
n n n n
n n n n
Tn nn n T n n T n n
y
I
P
P
I P I
w w k x w
w w k x w x w
w k x w k k x w
w
w
k x k x k
( )
1( ) ( ) ( ) ( ) ( ) 2 ( )
n T
Tn n n nn n T n n T n n TP I P I
k
k x k x k k
Find K to minimize trace of uncertainty
1 1 1( ) ( ) ( ) ( )
1( ) ( ) ( ) 2 ( )
1 1 1( ) ( ) ( ) ( ) ( ) 2 ( )2
n n n n n n n nn n T n n T
n nn n T n n T
n n n n n nn n T n n T n n T
tr P tr P tr P tr P
tr P
tr P tr P tr P
x k k x
k x x k
x k k x x k
1( ) ( ) ( ) ( ) ( ) 2 ( )
1 1 1 1( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) 2 ( )
Tn n n nn n T n n T n n T
n n n n n n n nn n T n n T n n T n n T n n T
P I P I
P P P P
k x k x k k
x k k x k x x k k k
tr aB atr B
T
T
tr A tr A
P P
1 1( ) ( ) ( ) 2 ( ) ( ) ( ) 2 ( ) ( )
1( ) ( ) 2 ( ) ( )
1( ) ( ) 2 ( ) ( )
n n n nn n T n n T n T n n n T
n nn T n n n T
n nn T n n T n
tr P tr P
P tr
P
k x x k x x k k
x x k k
x x k k
scalar
Find K to minimize trace of uncertainty
1 1 1( ) ( ) ( ) ( ) 2 ( ) ( )
1 1 1( ) ( ) ( ) ( ) 2 ( ) ( )
1 1( ) ( ) ( ) 2( )
2
2
2 2
n n n n n n n nn n T n T n n T n
n n n n n nn T n n T n n T n
n n n n n nn n T nn
tr P tr P tr P P
tr P P P
dtr P P P
d
x k x x k k
k x x x k k
x x x kk
( )
1 ( )( )
1( ) ( ) 2
0n
n n nn
n nn T n
P
P
xk
x x
The Kalman gain
If I have a lot of uncertainty about my model, P is large compared to sigma. I will learn a lot from the current error.
If I am pretty certain about my model, P is small compared to sigma. I will tend to ignore the current error.
Update of model uncertainty
1 1 1 1( ) ( ) ( ) ( ) ( ) ( ) ( ) 2 ( )
11 1( ) ( ) ( ) ( ) 2
1 1 1 1( ) ( ) ( ) 2 ( )
11 1( ) ( ) ( ) 2
n n n n n n n n n nn n T n n T n n T n n T
n n n nn n n T n
Tn n n n n n n n n nn n T n n T
n n n nn n T n
P P P P P
P P
P P P P P
P P
x k k x k x x k
k x x x
x x x x
x x x
1( )
11 1 1( ) ( ) ( ) 2 ( ) ( ) 2
1 1( ) ( ) 2 ( )
11 1 1 1( ) ( ) ( ) 2 ( )
1( ) ( )
n nn T
n n n n n nn n T n n T n
Tn n n nn T n n T
n n n n n n n nn n T n n T
n n n nn n T
P
P P P
P P
P P P P
P I P
x
x x x x x
x x x
x x x x
k xModel uncertainty decreases with every data point that you observe.
* *( 1) ( )
( ) ( ) * ( ) 2
10 10
1 1( ) ( ) ( )
1 ( )( )
1( ) ( ) 2
1( ) ( )
1
1
0,
,
n n
n n T n
n n n n n nn n n T
n n nn
n nn T n
n n n nn n T
n n n n
n n n n
y N
P
y
P
P
P I P
P P
w w
x w
w
w w k x w
xk
x x
k x
w w
*w *w *w
y y y
In this model, we hypothesize that the hidden variables, i.e., the “true” weights, do not change from trial to trial.
Observedvariables
Hidden variable
x x x
A priori estimate of mean and variance of the hidden variable before I observe the first data point
Update of the estimate of the hidden variable after I observed the data point
Forward projection of the estimate to the next trial
* * ( )( 1) ( )
( ) ( ) * ( ) 2
10 10
1 ( )( )
1( ) ( ) 2
1 1( ) ( ) ( )
1( ) ( )
1
1
0,
0,
,
nn n w w
n n T ny y
n n nn
n nn T n
n n n n n nn n n T
n n n nn n T
n n n n
n n n n T
A N Q
y N
P
P
P
y
P I P
A
P AP A Q
w w ε ε
x w
w
xk
x x
w w k x w
k x
w w
*w *w *w
y y y
In this model, we hypothesize that the hidden variables change from trial to trial.
x x x
A priori estimate of mean and variance of the hidden variable before I observe the first data point
Update of the estimate of the hidden variable after I observed the data point
Forward projection of the estimate to the next trial
1n n n n TP AP A Q
• Learning rate is proportional to the ratio between two uncertainties: my model vs. my measurement.
• After we observe an input x, the uncertainty associated with the weight of that input decreases.
• Because of state update noise Q, uncertainty increases as we form the prior for the next trial.
Uncertainty about my model parameters
1 ( )( )
1( ) ( ) 2
n n nn
n nn T n
P
P
xk
x x
Uncertainty about my measurement
1( ) ( )n n n nn n TP I P k x
2. Bayesian state estimation
3. Causal Inference
4. The influence of priors
5. Behaviors that are not Bayesian
* *( 1) ( )
( ) ( ) * ( ) 2 0,
n n
n n T ny N
w w
x w
Comparison of Kalman gain to LMS
See derivation of this in homework
( )( )
2
1 1( ) ( ) ( )
1 1( ) ( ) ( )2
n n nn
n n nn n n T
n nn nn n T n
P
y
Py
xk
w w k x w
w x w x
In the Kalman gain approach, the P matrix depends on the history of all previous and current inputs. In LMS, the learning rate is simply a constant that does not depend on past history.
1 1( ) ( ) ( )n n nn n T ny w w x w x
With the Kalman gain, our estimate converges on a single pass over the data set. In LMS, we don’t estimate the var-cov matrix P on each trial, but we will need multiple passes before our estimate converges.
* * ( ) 2( 1) ( )
( ) ( ) * ( ) 2
0.99 0,
=1 0,
nn n w w
n n T ny y
w aw a N q
y x w x N
( )nk
1n nP
2 4 6 8 10
0.65
0.7
0.75
0.8
2 4 6 8 10
2
2.5
3
3.5
4
4.5
5
High noise in the state update model produces increased uncertainty in model parameters. This produces high learning rates.
2 4 6 8 100.5
0.55
0.6
0.65
0.7
0.75
0.8
( )nk
2 4 6 8 10
2
2.5
3
3.5
4
4.5
5
2 22, 1q
2 21, 1q 2 21, 2q
1n nP
2 22, 1q
2 21, 1q
High noise in the measurement also increases parameter uncertainty. But this increase is small relative to measurement uncertainty. Higher measurement noise leads to lower learning rates.
Effect of state and measurement noise on the Kalman gain
* * ( ) 2 2( 1) ( )
( ) ( ) * ( ) 2 2
0, 1
=1 0, 1
nn n w w
n n T ny y
w aw N q q
y x w x N
( )nk
1n nP
Learning rate is higher in a state model that has high auto-correlations (larger a). That is, if the learner assumes that the world is changing slowly (a is close to 1), then the learner will have a large learning rate.
0.99
0.50
0.10
a
a
a
2 4 6 8 10
0.5
0.55
0.6
0.65
0.7
0.75
0.8
2 4 6 8 101
2
3
4
5
Effect of state transition auto-correlation on the Kalman gain
1 ( )( )
1( ) ( ) 2
1 1( ) ( ) ( )
1( ) ( )
1
1
n n nn
n nn T n
n n n n n nn n n T
n n n nn n T
n n n n
n n n n T
P
P
y
P I P
A
P AP A Q
xk
x x
w w k x w
k x
w w
Kalman filter as a model of animal learninglight
tone
xSuppose that x represents inputs from the environment: a light and a tone.
Suppose that y represents rewards, like a food pellet.
* * ( )( 1) ( )
( ) * * ( ) 21 1 2 2
0,
0,
nn n w w
n ny y
A N Q
y x w x w N
w w ε ε
*w *w *w
y y y
x x x
Animal’s model of the experimental setup
1( ) ( )ˆ n nn n Tyx w
Animal’s expectation on trial n
Animal’s learning from trial n
Various forms of classical conditioning in animal psychology
Table from Peter Dayan
Not explained by LMS, but predicted by the Kalman filter.
Sharing Paradigm
Train: {x1,x2} -> 1
Test: x1 -> ?, x2 -> ?
Result: x1->0.5, x2->0.5
10 20 30 400
0.5
1
1.5yyhat
0 10 20 30 400
0.2
0.4
0.6
0.8
w1w2
10 20 30 400.2
0.25
0.3
0.35
0.4
0.45
0.5
P11P22
10 20 30 40
0.1
0.15
0.2
0.25
0.3
0.35
0.4
k1k2
0 10 20 30 400
1
x1x2y
10 20 30 400
0.5
1
1.5
yyhat
0 10 20 30 400
0.1
0.2
0.3
0.4
0.5
0.6
0.7
w1w2
Learning with Kalman gain LMS
Blocking
Kamin (1968) Attention-like processes in classical conditioning. In: Miami symposium on the prediction of behavior: aversive stimulation (ed. MR Jones), pp. 9-33. Univ. of Miami Press.
Kamin trained an animal to continuously press a lever to receive food. He then paired a light (conditioned stimulus) and a mild electric shock to the foot of the rat (unconditioned stimulus). In response to the shock, the animal would reduce the lever-press activity. Soon the animal learned that the light predicted the shock, and therefore reduced lever pressing in response to the light. He then paired the light with a tone when giving the electric shock. After this second stage of training, he observed than when the tone was given alone, the animal did not reduce its lever pressing. The animal had not learned anything about the tone.
Blocking Paradigm
Train: x1 -> 1, {x1,x2} -> 1
Test: x2 -> ?, x1 -> ?
Result: x2 -> 0, x1 -> 1
0 10 20 30 40
0
1
x1x2y
10 20 30 400
0.5
1
1.5
yyhat
0 10 20 30 40-0.25
0
0.25
0.5
0.75
1
1.25
w1w2
10 20 30 40
0.1
0.2
0.3
0.4
0.5
P11P22
10 20 30 400
0.1
0.2
0.3
0.4
0.5
0.6
k1k2
Learning with Kalman gain LMS
0 10 20 30 40-0.2
0
0.2
0.4
0.6
0.8
1
1.2
w1w2
10 20 30 400
0.5
1
1.5
yyhat
Backwards Blocking Paradigm
Train: {x1,x2} -> 1, x1 -> 1
Test: x2 -> ?
Result: x2 -> 0
0 10 20 30 40 50 600
1
x1x2y
0 10 20 30 40 50 600
0.5
1
1.5
yyhat
0 10 20 30 40 50 60-0.2
0
0.2
0.4
0.6
0.8
1
w1w2
0 10 20 30 40 50 60
0.1
0.2
0.3
0.4
0.5
P11P22
0 10 20 30 40 50 60-0.4
-0.2
0
0.2
0.4
k1k2
Learning with Kalman gain LMS
0 10 20 30 40 50 600
0.5
1
1.5
yyhat
0 10 20 30 40 50 600
0.2
0.4
0.6
0.8
1
w1w2
Different output models
* * ( )( 1) ( )
1* *1 1 1 2 2 2 1 2
0,nn n w w
y
A N Q
y b x w b x w b b
w w ε ε
Case 1: the animal assumes an additive model. If each stimulus predicts one reward, then if the two are present together, they predict two rewards.
light
tone
xSuppose that x represents inputs from the environment: a light and a tone.
Suppose that y represents a reward, like a food pellet.
* * ( )( 1) ( )
( ) * ( ) 2( )
0,
0,
nn n w w
n T nn y y
A N Q
y N
w w ε ε
x w
Case 2: the animal assumes a weighted average model. If each stimulus predicts one reward, then if the two are present together, they still predict one reward, but with higher confidence.
The weights b1 and b2 should be set to the inverse of the variance (uncertainty) with which each stimulus x1 and x2 predicts the reward.
General case of the Kalman filter
* * ( )( ) ( 1)
( ) ( ) * ( )( )
10 10
11 1( ) ( ) ( ) ( )
1 1( ) ( ) ( )
1( ) ( )
1
1
0,
0,
,
nn n w w
n n nn y y
T Tn n n nn n n n
n n n n n nn n n
n n n nn n
n n n n
n n n
A N Q
H N R
P
P H H P H R
y H
P I H P
A
P AP
w w ε ε
y x w ε ε
w
k x x x
w w k x w
k x
w w
n TA Q
A priori estimate of mean and variance of the hidden variable
before I observe the first data point
Update of the estimate of the hidden variable after I observed
the data point
Forward projection of the estimate to the next trial
nx1
mx1
How to set the initial var-cov matrix
* * ( )( ) ( 1)
( ) ( ) * ( )( )
10
0,
0,
?
nn n w w
n n nn y y
A N Q
H N R
P
w w ε ε
y w ε ε
1 11 1n n n n TP P H R H
In homework, we will show that in general:
Now if we have absolutely no prior information on w, then before we see the first data point P(1|0) is infinity, and therefore its inverse in zero. After we see the first data point, we will be using the above equation to update our estimate. The updated estimate will become:
111 1
111 1
T
T
P H R H
P H R H
A reasonable and conservative estimate of the initial value of P would be to set it to the above value. That is, set:
110 1TP H R H
Data fusion
Suppose that we have two sensors that independently measure something. We would like to combine their measures to form a better estimate. What should the weights be?
21
22
1
1 22 2 2 21 2 1 2
2 21 2
1 22 2 2 21 2 1 2
2 22 1
1 22 2 2 21 2 1 2
1 1 1 1ˆ
1 1
x y y
y y
y y
Suppose that we know that sensor 1 gives us measurement y1 and has Gaussian noise with variance:
And similarly, sensor 2 has gives us measurement y2 and has Gaussian noise with variance:
A good idea is to weight each sensor inversely proportional to its noise:
2 22 1
1 22 2 2 21 2 1 2
x̂ y y
To see why this makes sense, let’s put forth a generative model that describes our hypothesis about how the data that we are observing is generated:
*x
2y1y
*x
2y1y
( )* * 2( 1) ( ) **
21( ) * ( )
( ) 22
0,
01 ,
1 0
nn n xx
n nn y y
x ax N q
x N R R
y ε ε 0
Observed variables
Hidden variable
Data fusion via Kalman filter
21( ) * ( )
( ) 22
10
10
(1)(1) 1
(1)2
1 111 1 0 1
2 21 2
2 211 1 2
2 21 2
2 211(1) 1 1 2
2 21 2
01 1 ,
1 10
ˆ 0
1 1
1
1 1
n nn y y
T
T
x N R R H
x
P
y
y
P P H R H
P
P H R
y ε ε 0
y
k
22
2 2 2 22 21 1 1 21 2
2 2 21 2 1
2 2 2 22 2 1 2
2 211 1 0 10 (1) (1)(1) (1) 2 1
1 22 2 2 21 2 1 2
10
1 10
ˆ ˆ ˆx x Hx y y
k y
See homework for this
priors
our first observation
variance of our posterior estimate
Notice that after we make our first observation, the variance of our posterior is better than the variance of either sensor.
*x
2y1y
2 211 1 2
2 21 2
2 22 2 2 2 2 21 21 1 1 2 1 22 2
1 22 2
2 2 2 2 2 21 22 2 1 2 2 12 2
1 2
because
because
P
What our sensors tell us
The real world
-2 0 2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
-2.5 0 2.5 5 7.5 10 12.5 150
0.1
0.2
0.3
0.4
Sensor 1 Sensor 2Combined
Sensor 1
Sensor 2
Combined
Combining equally noisy sensors Combining sensors with unequal noise
(1) 21 11 ,y N y (1) 2
2 22 ,y N y
2 2 2 2(1) (2)2 1 1 21 12 2 2 2 2 2
1 2 1 2 1 2
ˆ ,x N y y
Mean of the posterior, and its variance
pro
ba
bil
ity
Puzzling results: Savings and memory despite “washout”
Gain=eye displacement divided by target displacement
Result 1: After changes in gain, monkeys exhibit recall despite behavioral evidence for washout.
Kojima et al. (2004) Memory of learning facilitates saccade adaptation in the monkey. J Neurosci 24:7531.
1
Result 2: Following changes in gain and a long period of washout, monkeys exhibit no recall.
Result 3: Following changes in gain and a period of darkness, monkeys exhibit a “jump” in memory.
Puzzling results: Improvements in performance without error feedback
Kojima et al. (2004) J Neurosci 24:7531.
The learner’s hypothesis about the structure of the world
*w *w
y y
x x
A
1. The world has many hidden states. What I observe is a linear combination of these states.
2. The hidden states change from trial to trial. Some change slowly, others change fast.
3. The states that change fast have larger noise than states that change slow.
*1**2
*( ) *( 1) ( )
( ) ( ) *( ) ( ) 2
0.99 0
0 0.50
0,
0,
n n nw w
n n T n ny y
w
w
A
A N Q
y N
w
w w ε ε
x w ε ε
slow system
fast system
state transition equation
output equation
0 50 100 150 200 250 300-1.5
-1
-0.5
0
0.5
1
1.5
yyhat
0 50 100 150 200 250 300-1
0
1
*1**2
*( ) *( 1) ( )
( ) ( ) *( ) ( ) 2
2
0.99 0
0 0.50
0,
0.00004 0
0 0.01
0,
0.04
n n nw w
n n T n ny y
w
w
A
A N Q
Q
y N
w
w w ε ε
x w ε ε
0 50 100 150 200 250 300-0.4
-0.2
0
0.2
0.4
0.6
w1w2
Simulations for savingsx1x2y
0 50 100 150 200 250 300-0.5
-0.25
0
0.25
0.5
0.75
1
1.25
k1k2
The critical assumption is that in the fast system, there is much more noise than in the slow system. This produces larger learning rate in the fast system.
0 50 100 150 200 250 300-1
-0.5
0
0.5
1
x1x2y
0 50 100 150 200 250 300-1.5
-1
-0.5
0
0.5
1
1.5
yyhat
0 50 100 150 200 250 300-0.4
-0.2
0
0.2
0.4
w1w2
Simulations for spontaneous recovery despite zero error feedback
1 1( ) ( ) ( )
1
n n n n n nn n n T
n n n n
y
A
w w k x w
w w
error clamp
In the error clamp period, estimates are made yet the weight update equation does not see any error. Therefore, the effect of Kalman gain in the error-clamp period is zero. Nevertheless, weights continue to change because of the state update equations. The fast weights rapidly rebound to zero, while the slow weights slowly decline. The sum of these two changes produces a “spontaneous recovery” after washout.
Mean gain at start of recovery = 0.83
Mean gain at start of recovery = 0.86
Mean gain at end of recovery = 0.87
% gain change = 1.2%% gain change = 14.4%
Mean gain at end of recovery = 0.95
Target extinguished during recoveryTarget visible during recovery
Changes in representation without error feedback
Seeberger et al. (2002) Brain Research 956:374-379.
Massed vs. Spaced training: effect of changing the inter-trial interval
Learning reaching in a force field
ITI = 8min
ITI = 1min
Dis
cri
min
ati
on
pe
rfo
rma
nc
e (
se
c)
Rats were trained on an operant conditional discrimination in which an ambiguous stimulus (X) indicated both the occasions on which responding in the presence of a second cue (A) would be reinforced and the occasions on which responding in the presence of a third cue (B) would not be reinforced (X --> A+, A-, X --> B-, B+). Both rats with lesions of the hippocampus and control rats learned this discrimination more rapidly when the training trials were widely spaced (intertrial interval of 8 min) than when they were massed (intertrial interval of 1 min). With spaced practice, lesioned and control rats learned this discrimination equally well. But when the training trials were massed, lesioned rats learned more rapidly than controls.
Han
, J.S
., G
alla
gh
er, M
. & H
olla
nd
, P.
Hip
po
cam
pu
s 8:
138-
46 (
1998
)
Performance in a water maze
Sisti, Glass, Shors (2007) Neurogenesis and the spacing effect: learning over time enhances memory and the survival of new neurons. Learning and Memory 14:368.
4 trials a day for 4 days16 trials in one day
Massed vs. Spaced training: effect of changing the inter-trial interval
*w *w
y
x
A *w
y
x
A *w A *w AA
The learner’s hypothesis about the structure of the world
1. The world has many hidden states. What I observe is a linear combination of these states.
2. The hidden states change from trial to trial. Some change slowly, others change fast.
3. The states that change fast have larger noise than states that change slow.4. The state changes can occur more frequently than I can make observations.
Inter-trial interval
0 50 100 150 200 250 300
0
0.5
1
1.5
yyhat
0 500 1000 1500 2000 2500 3000
0
0.5
1
1.5
yyhat
ITI=2 ITI=20
*1**2
*( ) *( 1) ( )
( ) ( ) *( ) ( ) 2
2
0.999 0
0 0.40
0,
0.00008 0
0 0.1
0,
0.04
n n nw w
n n T n ny y
w
w
A
A N Q
Q
y N
w
w w ε ε
x w ε ε
1( ) ( )
1
211 11 11 22 12
211 22 12 22 11
n n n nn n T
n n n n T
P I P
P AP A Q
a P a a PQ
a a P a P
k xWhen there is an observation, the uncertainty for each hidden variable decreases proportional to its Kalman gain.
When there are no observations, the uncertainty decreases in proportion to A squared, but increases in proportion to state noise Q.
1000 1020 1040 1060 1080 11000.106
0.108
0.11
0.112
0.114
0.116
0.118
P22
1000 1020 1040 1060 1080 11000.0126
0.0128
0.013
0.0132
0.0134
0.0136
P11
Uncertainty for the slow state Uncertainty for the fast stateITI=20
Beyond a minimum ITI, increased ITI continues to increase the uncertainty of the slow state but has little effect on the fast state uncertainty. The longer ITI increases the total learning by increasing the slow state’s sensitivity to error.
0 20 40 60 80 100 120 140
0
0.5
1
1.5
y
yhatspaced
yhatmassed
0 20 40 60 80 100 120 140-0.2
0
0.2
0.4
0.6
0.8
w1massed
w1spaced
w2massed
w2spaced
0 20 40 60 80 100 120 140
0.2
0.4
0.6
0.8
k1massed
k1spaced
k2massed
k2spaced
0 20 40 60 80 100 120 140
-0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
P11massed
P11spaced
P22massed
P22spaced
P12massed
P12spaced
Observation number
Performance in spaced training depends largely on the slow state. Therefore, spaced training produces memories that decay little with passage of time.
ITI=14 ITI=2
ITI=98
Performance during training
Test at 1 week
ITI=14
ITI=2
ITI=98
Testing at 1 day or 1 week (averaged together)
Pav
lik, P
. I. a
nd
An
der
son
, J. R
. ( 2
005)
. Pra
ctic
e an
d f
org
etti
ng
eff
ects
on
vo
cab
ula
ry m
emo
ry:
An
act
ivat
ion
-bas
ed m
od
el o
f th
e sp
acin
g e
ffec
t.
Co
gn
itiv
e S
cien
ce, 2
9, 5
59-5
86.
Spaced training results in better retention in learning a second language
On Day 1, subjects learned to translate written Japanese words into English. They were given a Japanese word (written phonetically), and then given the English translation. This “study trial” was repeated twice. Afterwards, the were given the Japanese word and had to write the translation. If their translation was incorrect, the correct translation was given.
The ITI between word repetition was either 2, 14, or 98 trials.
Performance during training was better when the ITI was short. However, retention was much better for words that were observed with longer ITI. (The retention test involved two groups; one at 1 day and other at 7 days. Performance was slightly better for the 1 day group but the results were averaged in this figure.)
DM Wolpert et al. (1995) Science 269:1880
x
y
u u
x
y
Motor command
Sensory measurement
State of our body x
y
Application of Kalman filter to problems in sensorimotor control
When we move our arm in darkness, we may estimate the position of our hand based on three sources of information:
• proprioceptive feedback.
• a forward model of how the motor commands have moved our arm.
• by combining our prediction from the forward model with actual proprioceptive feedback.
Experimental procedures:
Subject holds a robotic arm in total darkness. The hand is briefly illuminated. An arrow is displayed to left or right, showing which way to move the hand. In some cases, the robot produces a constant force that assists or resists the movement. The subject slowly moves the hand until a tone is sounded. They use the other hand to move a mouse cursor to show where they think their hand is located.
DM Wolpert et al. (1995) Science 269:1880
DM Wolpert et al. (1995) Science 269:1880
assistive
resistive
( 1) ( ) ( ) ( )
( ) ( ) ( )
0,
0,
n n n nx x
n n ny y
A Bu N Q
C N R
x x ε ε
y x ε ε
x
y
u u
x
y
Motorcommand
Sensory measurement
State of the body
x
y
A
B
C
The generative model, describing actual dynamics of the limb
The model for estimation of sensory state from sensory feedback
( 1) ( ) ( ) ( )
( ) ( ) ( )
ˆ ˆ 0,
ˆ 0,
ˆ 1.4
n n n nx x
n n ny y
A Bu N Q
C N R
B B
x x ε ε
y x ε ε
For whatever reason, the brain has an incorrect model of the arm. It overestimates the effect of motor commands on changes in limb position.
0 0 0
10
10 0 0 1
101
(1) (1)
110 10(1)
11 1 0 1(1) (1)
11 1 0(1)
21 11 2
21 11
ˆ
ˆ ˆˆ ˆ
ˆˆ ˆ
ˆ ˆ ˆ
ˆ ˆ ˆ
ˆ
ˆ ˆˆ ˆ
ˆ ˆ
T T
T
T
P
A Bu
C
C
P C CP C R
P I C P
A Bu
P AP A Q
x x
0
x x
y x
y x
k
x x k y y
k
x x
Initial conditions: the subject can see the hand and has no uncertainty regarding its position and velocity
Forward model of state change and feedback
Actual observation
Estimate of state incorporates the prior and the observation
Forward model to establish the prior and the uncertainty for the
next state
0 0.2 0.4 0.6 0.8 1 1.2 1.40
5
10
15
20
x t ˆ ˆx SD x
0 0.2 0.4 0.6 0.8 1 1.2 1.4Time sec
0
0.002
0.004
0.006
0.008
0.01
0.012
0.014
Actual and estimated position
Kalman gain
0 0.5 1 1.5 2
0.8
1
1.2
1.4
1.6
1.8
Bias at end of movement (cm)
Variance at end of movement (cm^2)
Total movement time (sec)
x̂ x
P
0 0.2 0.4 0.6 0.8 1 1.2 1.4-1.5
-1
-0.5
0
0.5
1
1.5
Motor command u
Time of “beep”
For movements of various length
A single movement
Po
s (c
m)