Upload
vuliem
View
223
Download
0
Embed Size (px)
Citation preview
358LE
AR
NIN
G R
EPR
ESE
NT
A T
I 0 NS B
Y R
EC
m C
ULA
TIO
N
GeoffreyE
, Hinton
Computer Science and Psychology Departments, University
of T
orontoT
oronto M5S lA
4, Canada
James L
. McC
lellandPsychology and C
omputer Science D
epartments, C
amegie-M
ellon University,
, Pittsburgh
, PA 15213
AB
STR
AC
T
We describe a new learning procedure for networks that contain groups
of non-
linear units arranged in a closed loop, The aim
of the learning is to discover codes
that allow the activity vectors in a ""isible
" group to be represented
by activity
vectors in a "hidden
" group, One way to test w
hether a code is an accurate
representation is to try to reconstruct the visible vector from the hidden vector. T
hedifference betw
een the original and the reconstructed visible vectors is called thereconstruction error, and the learning procedure aim
s to minllnize this eX
Tor. T
helearning procedure has tw
o passes, On the first pass, the original visible vector is
passed around the loop, and on the second pass an average
of the original vector and
the reconstructed,vector is passed around the loop, T
he learning procedure changes, each w
eight by an amount proportional to the product of the "presynaptic
" activity
, and the
difference in the-post-synaptic activity on the tw
o passes, This procedure is
much sim
pler to implem
ent than methods like back-propagation, S
imulations in
simple networks show that it usually converges rapidly on a good set
of codes, and
analysis shows that in certain restricted cases It perfonns gradient d
escent in the
squared reconstruction error,
INT
RO
DU
CT
ION
Supervised gradient-descent learning procedures such as back-propagation
have been shown to construct interesting internal r
epresentations in "hidden
" units
that are not part
of the input or output
of a connectionist netw
ork. One criticism
ofback-propagation is that it requires a teacher to specify the desired output vectors, Itis possible to dispense w
ith the teacher in the case of "encoder, netw
orks 2 in which
the desired output vector is identical with the input vector (see Fig, 1), T
he pulposeof an encoder netw
ork is to learn good "codes" in the intennediate, hidden units, for, exam
ple, there are iess hidden units than input units, an encoder netWork w
illperfonndata-com
pression, It is also possible to introduce other kinds
of constraints
on the hidden units, so we can view an encoder network as a way
of ensU
ring that theinput can be reconstructed from
the activity in the hidden units 'whilst also m
aking
This research w
as supported by contract NO
OO
14-86-OO167 from the
Office
of N
aval Research
and a grant from the C
anadian National Science and E
ngineering Research C
ouncil, Geoffrey H
intonis a fellow
of the C
anadian Institute for Advanced R
esearch, We thank: M
ike Franzini, Conrad
Galland and G
eoffrey Goodhill for helpful discussions and help w
ith the simulations,
€I Am
erican Institute of Physics 1988
- !
359
the hidden units satisfy soxrte other constraint.
A second criticism
of back- propagation is that it is neurally im
plausible (and,hard to im
plement in hardw
are) because it requires all the connections to, be used
packwards and it requires the units to use different input-output functions for the
forward and backw
ard passes, Recirculation is designed to overcom
e this seeondcriticism in the special case
of encoder networks.
output units
input units
Fig,!, A diagram
of a three layer encoder netw
ork that ieams good codes-using
back-propagation, On the forw
ard pass, activity:flowsfrom
the input units in thebottom
layer to the output units in,the top layer, On the :backward pass, error-
derivatives flow from
the top layer to the bottom layer,
Instead of
using a separate group
of units for the input and output w
e use thevery sarne group of" visible
units, so the,inputvector is the initial state
of this glO
UP
and the output vector is, the state after infonnation has passed around the loop, The
difference between the activity
of a visible unit before and' aft~r sending activity
arowid the loop is the ,
derivative of
the squared reconstruction eXT
or, So, if the
visible units are linear , w
e can perfonn'gradient descent in the squared error
changing each
of a visible unit
's incoming weights
by an am
ount proportional to theproduct
of this difference and the activity
of the hidden unit from
which
, thecqrinection em
anates, So learning the w
eights from the hidden units t() the output,
units is simple. T
he harder problem is to learn the w
eights on corinectionscomirig
into hidden units because there is no direct specification
of the desired states of these
qnits. Back-propagation solves this problem
by back-p
ropagating eXTor-derivatives
from the o
utput units to generate eXTor-derivatives for the hidden
uriits,R
ecirculation solves the problem in a quite different w
ay that is easier to implem
entbut much harder to analyse,
360
TH
E R
EC
IRC
UL
AT
ION
PRO
CE
DU
RE
We i
ntroduce the recirculation procedure by considering a very ,sim
plearchitecture in w
hich there is just one group of hidden units, Each visible unit has a
qirected connection to every hidden unit , and each hidden unit has a directed
connection to every visible unit, The total input received by a unit is
LY
jW
jj
where
Yj is the sta~e of the ilb unit
jj
is the weight on the connection from
the ilb tothe J'Ih unit and
is the threshold of the J 'Ih unit, T
he threshold tenD can be
eliminated by giving every unit an extra input connection w
hose activity level isfIX
ed at 1, The w
eight on this special connection is the negative of the threshold, and
it can be learned in just the same w
ay as the other weights, This method of
implem
enting thresholds will be assum
ed throughout the paper,
The functions relating inputs to outputs of visible and hidden units are sm
oothm
onotonic functions with bounded derivatives, For hidden units w
e use the logisticfunction:
o(x.)
Other smooth' monotonic functions would serve as well, For visible units, our
mathem
atical analysis focuses on the linear case in which the output equals the total
input, though in simulations w
e use the logistic functio~,W
e have already given a verbal description' of the
learning rule for the hidden-to-visible connections, T
he weight
jj
, from the
lh hidden unit to the ith
visibleunit is changed as follow
s:
jj = E
Y/!) (Y
j(O)-yP))
(2)
(3)
where
(O)
is the state of the
ith visible unit at time 0 and
Yi(2)
is its state at time 2
after activity has passed around the loop once, The rule for the visible-to-hidden
connections is identical:
jj
EY
j (2)(y/!)-y/3))(4)
where
(l) is the state of the
lh hidden unit at tim
e 1 (on the fIrst pass around theloop) and
y/3) is its state at tim
e 3 (00: the second pass around the loop), Fig, 2
shows the nemrork exploded in time,
In general
' this rule for changing the visible-to-hidden co.nnections does' not
perfonn steepest descent in the squared reconstruction error , so it behaves differentlyfrom
back-propagation, This raises tw
o issues: Under w
hat conditions does it work
and under what conditions does it approxim
ate steepest descent?
(1)
361
time =
I
time =
0
Fig, 2, A diagram
showing the states of the visi~le and ~dden units explode~in
time, T
he visible units are at the bottom and the hidden unIts are at the top, T
lDle
goes from left to right.
CO
ND
ITIO
NS U
ND
ER
Wm
CH
RE
CIR
CU
LAT
ION
,A
PPRO
XIM
AT
ES G
RA
DIE
NT
DE
SCE
NT
For the simple architecture show
n in Fig, 2, the recirculation learning pro ~edure
changes the visible- to- hidden weights in the ~
irection, ~
f steepest descent m the
squared reconstruction error provided the followm
g conditions hold:1. T
he visible units are linear,2, The weights are symmetrical (Le,
jj =W
jj
for all
j),
3, The visible units have high regression,
Regres!!ion " m
eans that , after one pass around the lo~p, instead of setting
, the
activity of a visible unit
to be equal to its current total mput
(2), as qetenm
ned
by Eq I , w
e set its activity to be
(2) A
Yj(O
) (I-
A)X
P)(5)
where the regression
, A, is close to 1. U
sing high regression ensures that the visible, units
only change state slightly so that when the new
visible vector is sent around theloop again on the second pass , it has very sim
ilar effects to the fIrSt pass, In order to
make the learning rule for the hidden units as sim
ilar as ?~ssible to tl;te rule f ~r thevisible units , w
e also use regression in computing the activIty of the hidden um
ts onthe second pass
YP)
AY
/!) + (I-
A)O
(XP))
(6)
For a given input vector , the squared reconstruction erroris ,
&L
(Yk(2)- yiO
))2
For a hidden
unit
362
dE
dE
dYi2)dX
(2)
L,
-'----'-
L, (Y
i2)-(0)) Yk
(2)
dYP)
k dYk (2) dx
(2) dYP
)
where
dYi2)
;....
(2)
For a visible~
to-hiciden weight
iJE
--'- =
Yj
(1) Y
i(O)--:-
ji dY
P)
, using Eq 7 and the assumption that
jk for all
= Y
!(I) (O
) (L
, Yk(2) y,/(2)W
jk L, Yk (O
) y,/(2) Waw
..
The assum
ption that the visible units are linear(with a- gradient of 1) m
eans thatfor all
, y,/(2) = 1. So usingEq 1 we have
(I) (O) (x,(3)-x, (I))
awoo )
)1 N
ow, w
ith sufficiently high regression, w
e can assume thatthe states rifunits
only change slightly with tim
e so that
Yf(l)(xP)-
xP))
""
cr(xP)) cr(xP))
(1 ~ A.)
(y/3) yp)J
and(O
)
""
(2)
So by substituting in Eq 8 w
e get
dE "" ~Y
i(2)(Y
P)-Y
P)Jawoo (1-
11.)
An interesting property of E
q 9 is that it does not contain a term for the gradient
of the input-output function of unit
so recirculation learning can be applied evenwhen unitj uses an
unknown
non- linearity, To do back- propagation it is necessary to
know the gradient of the non-
linearity, but recircUlation
measures
the gradient bym
easuring the effect of a small difference in input
, so the term
y/3)-y/l)
implicitly
contains the gradient.
(7)
(8)
(9)
363
A SIM
UL
AT
ION
OF R
EC
IRC
UL
AT
ION
From a biological standpoint, the symmetry requirement that
ij ji
unrealistic unless it can be shown that this sym
metry of the w
eights can be learned:T
o investigate what w
ould happen if symm
etry was not enforced (and if the visible
units used the sarne non- linearity as the hidden units), we applied the recirculation
learning procedure to a network w
ith 4 visible units and 2 hidden units, The visible
vectors were 1000
, 0100, OOlO and 0001, so the 2 hidden units had to learn 4
different codes to represent these four visible vectors, All the w
eights and biases inthe netw
ork were started at sm
all random values unifonnly distributed in the range
;....0.5 to +0.5, W
e used regression in the hidden units, even though this is not strictlynecessary, but we ignored the term If(I-
A.) in E
q 9,
Using an e of 20 and a A
. of 0,75 for both the visible and the hidden units , the
network learned to produce a reconstruction error of less than 0, 1 on every unit in an
average of 48 weight updates (w
ith a maxim
um of 202 in 100 sim
ulations), Each
weight update w
as performed after trying all four training cases and the change w
asthe sum
of the four changes prescribed by Eq 3 or 4 as appropriate, T
he rmal
reconstruction error was m
easured using a regression of 0, even though high
regression was used during the learning, T
he learning speed is comparable w
ithback- propagation
, though a precise comparison is hard because the optim
al values ofe are different in the tw
o cases, Also
, the fact that we ignored the tenn If (I-
A.)
when m
odifying the visible- to-hidden weights
' means that recirculation tends to
change the visible- to-hidden weights m
ore slowly than the hidden- to-visible w
eightsand this w
ould also help back- propagation,
It i,s not imm
ediately obvious why the recirculation learning procedure w
orksw
hen the weights are not constrained to be sym
metrical, so w
e compared the w
eightchanges prescribed by the recirculation' procedure w
ith the weight changes that
would cause steepest, (jescent in the sum
squared reconstruction error (i,the w
eightchanges prescribed by back-propagation), As expected , recirculation and back-
propagation agree on the weight changes for the hidden- to-visible connections, even
though the gradient of the logistic function is not taken into account in weight
adjustments under recirculation, (C
onrad Galland has observed that this agreem
entis only slightly affected by using visible units that have the non- linear input-outputfunction show
n in Eq 2 because at any stage of the learning, all the visible units tend
to have similar slopes for their input-output functions, so the non- linearity scales all
the weight changes by approxim
ately the same am
ount,F
or the visible-to-hidden connections, recirculation initially prescribes weight
changes that are only randomly related to the direction of steepest descent , so these
changes do not help to improve the perform
ance of the system, A
s the learningproceeds , how
ever, these changes come to agree w
ith the direction of steepestdescent, T
he crucial observation is that this agreement occurs
after the hidden- to-
visible weights have changed in such a w
ay that they are approximately aligned
(symmetrical up to a constant factor) w
ith the visible- to-hidden weights, So it
appears that changing the hidden- to-visible w
eights, in t
he direction of steepest
descent creates the conditions that are necessary for the recirculation procedure tocause changes in the visible- to-hidden ,w
eights that follow the direction of steepest
descent, It is not hard to see w
hy this happens if we start w
ith random, zero-m
ean
364
visible-to-hidden weights, If the visible-to-
hidden weight
ji is positive, hidden unit
j will tend to have a higher than average activity level when the
itk visible unit has a
higher than average activity, So
Yj w
ill tend to be higher than average when the
reconstructed value of
Yi
should be higher than average -- i,e, when the tenD
(Yi (O
)-yP))
in Eq 3 is positive, It w
ill also be lower than average w
hen this tennisnegative, These relationships will be reversed if
ji is negative
, so
W ij
will grow
faster when
is positive than it will when
ji is negative, S
molensky4 presents a
mathem
atical analysis that shows w
hy sim
ilar learning procedure createssym
metrical w
eights in a purely linear system, W
illiams5 also analyses a related
learning rule for linear systems w
hich he calls the "symm
etric error correctionprocedure and he show
s that it perfonns principle components analysis, In our
simulations of recirculation
, the visible-to-hidden weights becom
e aligned with the
corresponding hidden-to-visible w
eights, though the hidden-to-visible weights are
generally of larger magnitude,
A PIC
TU
RE
OF R
EC
mC
UL
AT
ION
To gain m
ore insight into the conditions under which recirculation learning
produces the appropriate changes in the visible-to-hidden weights, w
e introduce thepictorial representation show
n in Fig, 3,T
he initial visible vectoris m
apped intothe reconstructed vector, C
, so the error vector is
AC
, U
sing high r~gression, thevisible vector that is sent around the loop on the second pass is
where the
difference vector
AP
is a small fraction of the error .vector
Ac.
If the regression issufficiently high and all the non-linearities in the system
have bounded derivativesand the weights have bounded magnitudes, the difference vectors
, BQ,
and
will be very sm
all and we can assw
ne that, to fIrst order, the system behaves linearly
in these difference vectors, If, for exam
ple, we moved
so as to double the lengthof
AP
we would also double the length of
BQ
and
CR
,
Fig, 3, A diagram showing some vectors
, P)
over the visible units, thc;:ir
hidden" images
, Q)
over the hidden units, and their "visible" images
(C, R
)over the visible units, The vectors
B'
and C' are the hidden and visible im
ages ofafter the visible-to-hidden w
eights have been changed by the learning procedure,
365
Suppose we change the visible- to-hidden w
eights in the manner prescribed by
Eq4
, using a very small value of E, Let
Q'
be the hidden image of
(Le, the im
ageof
in the hidden units) after the weight changes. T
o first orderQ
' w
ill lie between
and on the line
BQ
, T
his follows from
the observation that Eq 4 has the effect
of moving each
y/3) towards y/l) by an amount proportional to their difference,
Since is close to
Q,
a weight change that moves the hidden image of
from
Q'
will move the hidden image of
from
to w
here B
' lies on the extension of
the line
BQ
as show
n in Fig, 3, If the hidden-to-visible weights are not changed
, thevisible image of
will move from C to
where C
' lies on the extension of the lineC
R
as shown in Fig, 3, S
o the visible- to-hidden weight changes w
ill reduce thesquared reconstruction error provided the vector
CR
is approxim
ately parallel to thevector
AP,
But why should we expect the vector
CR
to be aligned with the vector
AP?
general we should not , except w
hen the visible- to-hidden and hidden- to-visible
weights are approximately aligned, T
he learning in the hidden- to-visible
connections has a tendency to cause' this alignment, In addition
, it is easy to modify
the recirculation learning procedure so as to increase the tendency for the learning inthe hidden- to-visible connections to cause alignm
ent. Eq 3 has the effect of m
ovingthe visible image of
closer to
by an amount proportional to the m
agnitude of theerror vector
AC
, If w
e apply the same rule on the next pass around the loop, w
emove the visible image of
closer to
by an amount proportional to the m
agnitudeof
PR,
If the vector
CR
is anti-
aligned with the vector
the magnitude of
AC
w
illexceed the magnitude of
so the result of these two m
ovements w
ill be toim
prove the alignment b
etween~ and
CR
, W
e have not yet tested this modified
procedure through simulations , how
ever,
This is only an infonnal argument and m
uch work rem
ains to be done inestablishing the precise conditions under w
hich the recirculation learning procedureapproximates steepest descent, T
he infonnal argument applies equally w
ell tosystem
s that contain longer loops which have several groups of hidden units
arranged in series, At each stage in the loop, the sam
e learning procedure can beapplied, and, the w
eight changes will approxim
ate gradient descent provided thedifference of the tw
o visible vectors that are sent around the loop aligns with the
, difference of their images, W
e have not yet done enough simulations to develop a
clear picture of the conditions under which the changes in the hidden- to-visible
weights produce the required alignm
ent,
USING A IllERARCHYOF CLOSED LOOPS
Instead of using a single loop that contains many hidden layers in series
, it is
possible to use a more m
odular system, E
ach module consists of one " visible" group
and one " hidden" group connected in a closed loop, but the visible group for onem
odule is actually composed of the hidden groups of several low
er level modules, as
shown in Fig, 4, Since the sam
e learning rule is used for both visible and hiddenunits, there is no problem
in applyjng it to systems in w
hich some units are the
visible units of one module and the h
idden units of another, B
allard6 has
experimented w
ith back- propagation in this kind of system, and w
e have run some
simulations of recirculation using the architecture show
n in Fig, 4, Thenetw
ork
366
learned to encode a set of vectors specified over the. bottom layer, A
fter learning,each of the vectors becam
e an attractor and the network w
as capable of completing a
partial vector, even though this involved passing infonnation through several layers,
00000000
Fig 4, A netW
ork in which the hidden units of the bottom
two m
odiIles are thevisible units of the top m
odule,
CO
NC
LU
SION
We have described a sim
ple learning procedure that is capable of fonning
representations in non-linear hidden units whose input-output f
unctions have
bounded derivatives, The procedure is easy to im
plement in hardw
are, even if the
non-linearity is unknown, G
iven some strong assum
ptions, the procedure perlonnsgradient descent in the reconstruction elT
or, If the symm
etry assumption is violated
the learning procedure still works because the changes in the hidden-to-visible
weights produce sym
metry, If the assum
ption about the linearity of the visible unitsis violated, the procedure still w
orks in the cases we have sim
ulated, For the generalcase ofa loop with many non-linear stages, w
e have aninfonnal pic~e of acondition that m
ust hold for the procedure to approximate gradient descent, but w
edo not have a fonnal analysis, and w
e do not. have sufficient experience with
simulations to give an em
pirical description of the general conditions under which
the learning procedure works,
RE
FER
EN
CE
S
1. D, E
, Rum
elhart, G, E, Hinton and R, J, Williams,
Nature
323, 533-536 (1986),
2, D, H, Ackley, G, E, Hinton and T, J, Sejnowski
Cognitive Science
147-169(1985), 3, G
, Cottrell, J, L. E
hnan and D, Z
ipser, Proc, Cognitive Science Society, Seattle
WA (1987),
4, p, Smolensky, Teclmical Report CU-C
S-355-87, University of Colorado at
Boulder (1986),
5, R, 1. W
illiams, T
eclmical R
eport 8501, Institute of C
ognitive Science, University
of California, San D
iego (1985).6, D
, H, B
allard, Proc, Am
erican Association for A
rtificial Intelligence, Seattle, W
A(1987),
367
SCHEMA FOR MOTOR
CO
NT
RO
L
UTILIZING A NETWORK MODEL OF THE
CE
RE
BE
LL
UM
James C. Houk, Ph.
Northwestern University Medical School, Chicago, Illinois
60201
AB
STR
AC
T
This paper outlines a schema for mOvement control
based on two stages of signal processing. The higher stage
is a neural network model that treats the cerebellum as an
array of, adjustable motor pattern generators. This netw
orkuses sen$ory input topreset and to trigger elem
ental pattern generators and to evaluate their perform
ance. The
actual patterned outputs, however, are produced by intrin-
sic circuitry that i~cludes recurrent loops and is thus
capable of self-sustained activity. These patterned
outputs are sent as motor commands to local feedback
systems called motor servos. The latter control the forces
and lengths of individual muscles. O
verall control is thus
achieved in two stages: (1) an adaptive cerebellar network
generates an array of feedforward motor commands and (2) a
set of local feedback systems translates these commands
into actual movements.
INT
RO
DU
CT
ION
There is considerable evidence that the cerebellum is
involved in the adaptive control of'movement1; although the
manner in which this control is achieved is not well under-
stood. As a means of probing these cerebellar mechanisms,
my colleagues and I have been conducting microelectrode
studies of the neural messages that flow through the inter-
mediate division of the cerebellum and onward to limb
muscles via the rubrospinal tract. We regard this cerebel~
lorubrospinal pathway as a useful model system for studying
general problems of sensorimotor integration and adaptive
brain function. A summary of our findings has been pub-
lished as a book chapter
On the basis of these and other neurophysiological
results, I recently hypothesized that the cerebellum func-
tions as an array of adjustable motor pattern generators
The outputs from these pattern generators are assumed to
function as motor commands, i. e., as neural control signals
that are sent to lower-level motor systems where they
produce movements. According to this hypothesis, the
cerebellum uses its extensive sensory input to preset the
(9 Am
erican Institute of Physics 1988