12
IIEI IKANSAC LIONSON INIOKMAIION IHIOKY, VOI 35. NO 4, JUIY 1YXY 709 On the Capacity of Associative Memories with Linear Threshold Functions AMIR DEMBO Ah\truct -For a5wiative memorie5 composed of N linear threshold function\ without \elf-connections, even when the Hamming distance5 between the de5ired memories are within aN and (1 ~ a)N, there are sets of w e (1 ~ 2n) ' (for a < 1/2), the elements of which cannot be simulta- neou\lq \table. A 5imilar phenomenon holds for the sum of outer products connection matrix. We characterize the programming rules which over- come this difficulty by allowing any set of linearly independent vectors as \table memories. Thk clarr extend\ the spectral schemes suggested previ- ou\ly. For 4pectral xheme\ with randomly chosen O[ N/ln NI memories, we \how that almo5t all of the Hamming sphere around each memory is directlj attracted, rimilarly to what wa\ rhown by McEliece er al. for the \urn of outer product5 xheme. I. INTRODUCTION N 1982, Hopfield [l] introduced a new description of I associative or content-addressable memory, based on his studies of collective computation in neural networks (cf. also earlier works of Anderson, Kohonen, Grossberg, and Little on the same subject). Hopfield and others have claimed that this memory is attractive for many applica- tions, such as the solution of complex minimization prob- lems [2], [3], and nonlinear classifications [4]. The memory is implemented via a network of N pair- wise connected neurons. Each neuron can be in one of two states: off (denoted by - 1) and on (denoted by + 1). Thus the state of the network is defined by an N-dimensional vector U, composed of fl's, with ith component corre- sponding to the state of the ith neuron. The synaptic connections between the neurons are modeled by an N X N real-valued matrix W, with the (i, j) element w,, represent- ing the strength of the synaptic connection from neuron j to neuron i. Given a state vector U, the new input vector to the neurons is Wu, and each neuron simulates a threshold function, i.e., its new state U, will be + 1 if and only if (iff) ( Wu), 2 t,, and - 1 iff (Wu), < t, with t being a real-val- ued N-dimensional threshold vector. Thus the network is characterized by the pair (W, t). For every pair of N-dimensional vectors, U, U, U real valued and U composed of f 1 - s, let U +, U denote the relations U, 2 t, iff U, = + 1, and U, < t, iff U, = - 1 (where we sometimes omit the t under the arrow for convenience). The mode of operation of the network defines the order in Manuscript received April 10. 1087: revised September 10x8. Thc author uah with Bell Laboratories. Murray Hill. NJ 07974. He is now with the Information Systems Laboratory. Stanford University. Stanford. CA 94305. IEl<F. I.og Number XY20034 which the components are updated. However, all the modes share the same set of stable states since a vector U of & 1 - s is a stable state of the network iff Wu -+t U. Given an arbitrary input vector U('), the network gener- ates an infinite sequence of states U('), u(l),- . ., zdn); . ., with the following relations: a) b) in the parallel mode of operation Wu'") +, U('' ' '); in the serial modes an index N 2 io( n) 2 1 is chosen (either deterministically or randomly), and the vec- tor U(") is generated: U!:) = ( WU(")),~,, and U!") = U!") + t, for all i f io, then U(") dl uo7+l). Since the number of possible states is finite (2N), every sequence of states becomes periodic after a finite number of iterations. When the period is one, the network is in a stable state U to which it converged from U(('), i.e., U(()) is in the domain of attraction of U. However, there may be memories (W, t) for which not all inputs converge to stable states, but rather some of the inputs are attracted by loops of period two and above. A simple example of this phe- nomemon is W = - I, t = 0, in which every U(()) belongs to the loop U(') -+, - U(') +(U(") which is of period two in the parallel mode and 2N in the serial (deterministic) mode. When one desires to use this network as an associative memory, one seeks a programming rule for determining the value of (W, t). In this work we analyze the perfor- mance of various programming rules in terms of the fol- lowing two properties: given an arbitrary set of desired memories, ul, u2; . -, uk, these vectors should indeed be stable vectors, i.e., Wu' probe vectors that are close (in the Hamming dis- tance) to U' ought to be in its domain of attraction. next section concentrates on property A, while U', i = 1 1 9 , . . . K. Section I11 deals with property B. The last section contains our concluding remarks. The cupucity C of the subclass of networks without self-connections (i.e., for all i, w,, = 0) was defined in [5] as the maximal number K such that for any U', u2; . ., uK. it is possible to find (W, t) having property A. The hyper- plane counting arguments of [7], [18] are used in [5] to show that C 5 N. We note that if self-connections are allowed, then the capacity is 2N by using always the trivial system (l,O), for 001 8-9448/S9/0700-0709$01 .OO 01 989 IEEE

On the capacity of associative memories with linear threshold functions

  • Upload
    a

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Page 1: On the capacity of associative memories with linear threshold functions

I I E I I K A N S A C L I O N S O N I N I O K M A I I O N I H I O K Y , VOI 35. NO 4, J U I Y 1 Y X Y 709

On the Capacity of Associative Memories with Linear Threshold Functions

AMIR DEMBO

Ah\truct -For a5wiative memorie5 composed of N linear threshold function\ without \elf-connections, even when the Hamming distance5 between the de5ired memories are within a N and (1 ~ a ) N , there are sets of w e ( 1 ~ 2n) ' (for a < 1/2), the elements of which cannot be simulta- neou\lq \table. A 5imilar phenomenon holds for the sum of outer products connection matrix. We characterize the programming rules which over- come this difficulty by allowing any set of linearly independent vectors as \table memories. Thk clarr extend\ the spectral schemes suggested previ- ou\ly. For 4pectral xheme\ with randomly chosen O[ N/ln NI memories, we \how that almo5t all of the Hamming sphere around each memory is directlj attracted, rimilarly to what wa\ rhown by McEliece er al. for the \urn of outer product5 xheme.

I . INTRODUCTION

N 1982, Hopfield [ l ] introduced a new description of I associative or content-addressable memory, based on his studies of collective computation in neural networks (cf. also earlier works of Anderson, Kohonen, Grossberg, and Little on the same subject). Hopfield and others have claimed that this memory is attractive for many applica- tions, such as the solution of complex minimization prob- lems [2], [3], and nonlinear classifications [4].

The memory is implemented via a network of N pair- wise connected neurons. Each neuron can be in one of two states: off (denoted by - 1) and on (denoted by + 1). Thus the state of the network is defined by an N-dimensional vector U, composed of f l ' s , with ith component corre- sponding to the state of the ith neuron. The synaptic connections between the neurons are modeled by an N X N real-valued matrix W, with the ( i , j ) element w,, represent- ing the strength of the synaptic connection from neuron j to neuron i . Given a state vector U, the new input vector to the neurons is Wu, and each neuron simulates a threshold function, i.e., its new state U , will be + 1 if and only if (iff) ( W u ) , 2 t , , and - 1 i f f (Wu) , < t , with t being a real-val- ued N-dimensional threshold vector. Thus the network is characterized by the pair ( W , t ) .

For every pair of N-dimensional vectors, U, U , U real valued and U composed of f 1 - s, let U +, U denote the relations U , 2 t , i f f U, = + 1, and U , < t , iff U, = - 1 (where we sometimes omit the t under the arrow for convenience). The mode of operation of the network defines the order in

Manuscript received April 10. 1087: revised September 10x8. Thc author uah with Bell Laboratories. Murray Hill. NJ 07974. He is

now with the Information Systems Laboratory. Stanford University. Stanford. CA 94305.

IEl<F. I.og Number XY20034

which the components are updated. However, all the modes share the same set of stable states since a vector U of & 1 - s is a stable state of the network iff Wu -+t U.

Given an arbitrary input vector U('), the network gener- ates an infinite sequence of states U('), u ( l ) , - . ., zdn); . ., with the following relations:

a) b)

in the parallel mode of operation Wu'") +, U('' ' '); in the serial modes an index N 2 io( n ) 2 1 is chosen (either deterministically or randomly), and the vec- tor U ( " ) is generated: U!:) = ( WU(")) ,~, , and U ! " ) =

U!") + t , for all i f io , then U ( " ) d l u o 7 + l ) .

Since the number of possible states is finite ( 2 N ) , every sequence o f states becomes periodic after a finite number of iterations. When the period is one, the network is in a stable state U to which it converged from U(('), i.e., U(()) is in the domain of attraction of U. However, there may be memories (W, t ) for which not all inputs converge to stable states, but rather some of the inputs are attracted by loops of period two and above. A simple example of this phe- nomemon is W = - I , t = 0, in which every U(()) belongs to the loop U(') -+, - U(') +(U(") which is of period two in the parallel mode and 2 N in the serial (deterministic) mode.

When one desires to use this network as an associative memory, one seeks a programming rule for determining the value of ( W , t ) . In this work we analyze the perfor- mance of various programming rules in terms of the fol- lowing two properties:

given an arbitrary set of desired memories, u l , u2; . -, u k , these vectors should indeed be stable vectors, i.e., Wu' probe vectors that are close (in the Hamming dis- tance) to U' ought to be in its domain of attraction.

next section concentrates on property A, while

U', i = 1 1 9 , . . . K .

Section I11 deals with property B. The last section contains our concluding remarks.

The cupucity C of the subclass of networks without self-connections (i.e., for all i, w,, = 0) was defined in [5] as the maximal number K such that for any U', u2; . ., u K . i t is possible to find (W, t ) having property A. The hyper- plane counting arguments of [7], [18] are used in [ 5 ] to show that C 5 N.

We note that if self-connections are allowed, then the capacity is 2 N by using always the trivial system ( l , O ) , for

001 8-9448/S9/0700-0709$01 .OO 01 989 IEEE

Page 2: On the capacity of associative memories with linear threshold functions

710 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 35, NO. 4, JULY 19x9

which every f 1 vector is stable. This is not the only trivial system. For example, every W = Z + E with the elements of E satisfying for all i , C;=,le,,I <1, is a trivial system for t = 0, (as U + E u -+o U for every f l - s vector U). Thus, forbidding self-connections is one plausible systematic way to exclude trivial systems from our discussion.

In [19] and [20] it was shown that essentially C =1 when self-connections are forbidden. Thus the upper bound of [5] is somewhat misleading. A similar result (C = 2 ) is obtained here for Hopfield's scheme for sum of outer products [W = C~=lu'u'T, t = 01 which was investigated in [6]. Since the sets that force this low capacity are composed of vectors with extremely small (or extremely large) Ham- ming distances, we investigate here C( a), 1 / 2 > a > 0, which is the capacity C when the desired memories satisfy I( U', u')l I ( 1 - 2 a ) N for every K 2 i # j > 1 (i.e., the Hamming distances between pairs are within a N and ( 1 - a ) N ) . We show that asymptotically C( a) I 1 / ( 1 - 2 a ) , independent of N , for both structures. Thus only for almost-orthogonal sets of vectors (i.e., l(u', u')l I o ( N ) ) the capacity indeed grows with N . We then characterize the class of sign-preserving systems (SPS) that are the programming rules for which property A holds for any set of linearly independent desired memories. To exclude triv- ial systems, we restrict the discussion to singular connec- tion matrices, and conclude that the capacity of SPS is at least N . We also show that the SPS is a natural generaliza- tion of the structures just mentioned, as well as the spec- tral algorithms studied in [21]-[23].

While the worst-case capacity (as defined in [ 5 ] ) of the former structures is poor, when the desired memories are randomly chosen and only average performance is of inter- est the picture changes. We prove that for the subclass of networks without self-connections, with K I N , and for the sum-of-outer-products structure with K I ( N - 1)/41n N (cf. [6]) , the probability that property A fails approaches zero as N -+ 00. Intuitively, the reason for this phenomenon is that almost all sets of random &1 vectors are almost orthogonal.

Another definition of memory capacity (cf. [6] ) is the maximal number of desired memories K ( p), 1 / 2 > p > 0, such that each of them attracts elements in the Hamming sphere of radius pN around it with probability that ap- proaches one for N -+ 00. Upper bounds on the probability of no attraction for given networks are presented in Sec- tion 111. It is first demonstrated (by an example), that if a network is designed (by the spectral method) for a set of desired memories having a subset with small Hamming distances, the constraint that each one of the vectors in this subset should be a stable state, affects the shapes of the domain of attractions of the other memories as well. This undesired property is typical of networks based on linear threshold functions, because of the ''mixing" inher- ent to these systems. We also obtain upper bounds on the probability of no attraction similar to [6] (using simpler arguments) for an extended class of sum-of-outer-products constructions and for a class of spectral algorithms.

A

11. WORST-CASE CAPACITY AND SIGN-PRESERVING SYSTEMS

We first show that the conventional constructions have limited capacity.

Proposition I ([IS, theorem I], [20, theorem 21): The worst-case capacity of the subclass of networks with wlolo =

0 for agiven N > i o 2 1 is C = l . Proposition 2: The worst-case capacity of the sum-

of-outer-products networks ( t = 0, W = C:- lu'u'T, so w,, # 0, for all i ) , is C = 2 for N 2 6.

Proof: For K = 2, Wu' = Nu' + ( u2, U') U', and Wu2 = N u 2 + ( u 2 , u')u' . Since u2 # U', I(u2, u ' ) l< N , and thus Wu' -+o U', Wu2 +o u2, i.e., C 2 2. Choose now ul, u2, u3 having the same ( N -2) last components, and first two components, (- 1, - 1 ) in U', (+ 1, - 1 ) in u2, and (+ 1, + 1 ) in u3. Then the first component of Wu' is - N + ( N - 2) + ( N -4) = N -6 so that for every N 2 6, Wu' +ou2, and c < 3.

The sets that achieve the capacity in both structures contain a pair of vectors with Hamming distance one. Similar ''extremal sets" can be constructed with a pair of vectors with Hamming distance ( N - 1). Thus in [19] and [20] the feeling is that by excluding these examples the capacity may become as large as N . We will find the worst-case capacity C( a), with

a = [ 1 - max, #,I( U', U') I/ N ] / 2 (0 < a < 1 / 2 )

reflecting the normalized minimal Hamming distance in the set {U',.. 0 , U"}. When N -+ 00, while a < 1 / 2 is held constant, C(a) remains constant and does not grow with N .

Proposition 3: For the sum-of-outer-products networks ( t = 0, w = E:,"= ' U V T ) ,

1 c ( a ) = - 11-2.1

for large enough N , and 1 / 2 > a > 0.

Proof: We have that Wu' = Nu' + C, + U', u')u' , thus Wu' -+o U' whenever C , + r l ( ~ ' , u') l < N . However, C,zrl(~- ' , U')[ I ( K - l ) ( l - 2 a ) N (by the definition of a). Therefore, C( a) 2 [ 1 / ( 1 - 2a)l. Now choose { U',. . . , u K } so that the first component of U' is - 1 , whereas for u 2 ; - - , u K it is +l. The next ( 1 - 2 a ) N - 1 components are equal for U', . . . , U", whereas the suffix of 2 a N compo- nents is composed of K orthogonal vectors (each of size 2 a N ) . For finite K and a > 0 this is always possible using large enough N (choose 2 a N = 2 L 2 K and use K columns of a Hadamard matrix of size 2 L X2'd). Thus for all i # j , I( U', uJ)l I ( 1 - 2 a ) N , and the first component of Wu' equals - N + ( K - 1)[(1- 2 a ) N - 21. For

1 K 2 1 + wu'++u',

( 1 - 2 a ) - - N

Page 3: On the capacity of associative memories with linear threshold functions

DEMBO: ON THE CAPACITY OF ASSOCIATIVE MEMORIES WITH LINEAR THRESHOLD FUNCTIONS

\

711

so that as N + M,

1 1 - 2 a

c( a ) < - + 1.

Proposition 4: For any subclass of networks with w,oio = 0 for given N 2 io 2 1, C( a ) I [ 1 / ( 1 - 2a)l . Furthermore, for

Proof a) Upper bound on C(a): Without loss of gener- ality, choose io = N . Now take K = 2L and consider the standard K X K Hadamard matrix. Generate an N X K matrix by repeating each row, except the last one, for ( N - 1 ) / ( K - 1) times, and let { ul, - -, U" } be the columns of t h s matrix. An example of this set for K = 4, N = 10, a = 0.4 is

N + 00, C( a ) = [ 1 / ( 1 - 2a)l.

u ' = ( 1 1 1 1 1 1 1 1 1 1)

u 2 = ( 1 1 1 - 1 -1 - 1 1 1 1 - 1 )

u 3 = ( 1 1 1 1 1 1 - 1 - 1 - 1 - 1 )

u 4 = ( 1 1 1 - 1 -1 - 1 - 1 -1 - 1 1).

Since the columns of the original K X K Hadamard ma- trix are orthogonal, it is easily verified that

or that 1

1 - - I N K = l + 1

I

( 1 - 2 a ) + - N

Thus it is sufficient in order to complete the proof to show that there is no ( W , t ) with this set { u l ; . - , u K } as its stable states (independently of the specific values of L, N ) .

, K on the last row of W and on the last element of t can be rewritten as [ u l , . . a , uKIT6 -+o[uk, . -, U,"]', where the vector fi is the transpose of the last row of W, with the last element of t subtracted from its first component (since u1 is known to be the vector of all 1's). Since the matrix [ ul, . . . , U " ] was generated from a Hadamard matrix by repeating the various rows, we can rewrite our conditions on 6 as H $ - o [ u k , - . - , u ~ ] , with H = H T a K X K Hadamard matrix, and $ a K-dimensional real-valued vector of unknowns, with 8" = GN = wNN = 0 (the last row appeared only once). Let p = HO. Then it is well-known that 8( = l / K ) H p , so GK = 0 = CJ(=lHKrpr = Z;K=lu;Vp, (since the last row of H is composed of the last compo- nents of u l , . . - , u " ) . However, since p + o [ u ~ , - - - , u N ] , uhp, 2 0 for every K 2 i 21. Therefore, 0 = C~c,lu',p, iff p = 0, which implies that U; = + 1 for K 2 i 2 1. A contra- diction results (since the last row of H is never all ones), so that there is no solution to Wu'

b) Asymptotic tightness of the Upper Bound: To show that for N + CO, C( a ) = [1/(1- 2a)l , we consider the par- ticular construction W = E E l u f u f r - KZ (for which w,, = 0

The requirements imposed by Wu' U' for i = 1,.

K T

U' with wNN = 0.

for all i ) and t = 0 . Using the same approach as in the proof of Proposition 3, we get ( K - 1)(1 - 2 a ) 2 (1 - ( K / N ) ) ; thus for N + 00, C( a) = [ 1 / ( 1 - 2a)l.

As we have seen both conventional structures admit a very limited worst-case capacity. This motivates the char- acterization of sign preserving systems (SPS's), i.e., the programming rules for which any given set of linearly independent vectors can be made stable simultaneously. This characterization is an extension of the spectral algo- rithms studied in [21]-[23].

Proposition 5: Given ul, u2, . . + , u " linearly independent f l vectors, the SPS's are exactly all the matrices W = C ~ l x i u i r , with x i +I U' for K 2 i 2 1 , x K + l , . . a , x N arbi- trarily chosen, and { U', . , uN } a basis of W orthonormal t o { u ' , ~ ~ ~ , u K } , i . e . , ( u ~ , u ' ) = 6 i j f o r N 2 j 2 1 , K r i 2 1 , while (U', U') = for N 2 j , i 2 K + 1.

Proof: From the construction it is obvious that Wu' = x i U'. Since any SPS satisfies Wu' -ff U', we define x i 2 Wu' for K 2 i 2 1 and choose arbitrarily x K + l 2 WuK+l , . . -, x N A WuN. Classical results on matrix decom- position guarantee that W is now uniquely determined by C,"= ,X'U'T.

Remarks: 1) The assumption that the vectors U'; ., U" are lin-

early independent holds with probability larger than (1 - ( ~ / f l ) ) + ~ + ~ l for K I N (cf. [8]). Furthermore, for linear dependencies of the form U' = - uJ, i # j (which are conjectured in [8] to be most of the dependencies between f 1 vectors for N + M), one can ignore u J and still obtain an SPS provided that t = 0 and x i has nonzero compo- nents. This is due to the fact that x i = Wu' +o U' implies that WuJ = - x i j 0 u J since x i has nonzero components.

2 ) Singular SPS's can be obtained by choosing x = 0, and thus trivial systems with nonsingular matrices are eliminated. The worst-case capacity of singular SPS ' s is, therefore, at least N (since we can choose t such that - t +o u N , which will guarantee that x = 0 u N ) . Re- sults in this spirit appear also in [21] and [22].

3 ) Whenever the sum-of-outer-products construction has { U'; . 0 , U"} as its stable states (which it often does, as we shall see later on), it is also an SPS with x i 2 CT=luJ( U', U').

4) For x K + ' = . . . = x N = O , t = 0 , x ' = u ' , K > i 2 1 , a particular SPS with W = C i K , l ~ i ~ ' T results. Here W is the projection operator on the subspace span { U', * -, U"}. In this case W = W T is a nonnegative definite (NND) matrix with K eigenvalues that are 1, and the rest zeros. This is one of the spectral algorithms (cf., [4], [ 1 5 ] , [21]-[23]) in which x K + l = . . . = x N = 0, and x i (positively) propor- tional to U' for i I K .

The SPS concept is a natural generalization of the sum-or-outer-products construction. We shall prove now that the SPS concept is also the generalization of the class of rules with forbidden self-connection when we allow a data-dependent location of the zero element in the main diagonal of W.

Page 4: On the capacity of associative memories with linear threshold functions

712 IEEE TRANSACTIONS ON INFORMATION THEORY. VOL. 35, NO. 4, JULY 1989

Proposition 6: Given U'; . ., U K , K < N , which are lin- early independent 1 vectors, an SPS (with t = 0) exists for { U',. . . , U } with at least one zero main-diagonal ele- ment.

Proof: Consider the SPS having x' = U', i = 1,. . e , K and t = 0. Every matrix W satisfying

for i = 1,. . . , N represents such an SPS. A matrix W with w,, = 0 exists which is such an SPS iff the ith column of {U'; . ., u ~ } ~ is in the subspace generated by the other ( N - 1) columns of the matrix. Since N columns are in tlus matrix whose rank is K < N , at least one column is indeed in the subspace generated by the other ( N - 1 ) columns, and the proof is thus complete.

Note: This particular construction may imply x' # 0 for i > K.

We note that a trivial generalization of the SPS concept allows us to generate networks with desired periodic attrac- tors, an issue addressed in [4] and [14]. In the parallel mode we simply choose the nonsymmetric matrix W by Wu'=x' , i = l ; . . , L with X ' + ~ U ' + ~ , i = l ; . . , L - l and x L + t U' to create the limit cycle u1 + u2 + . . . + u L + U' (with L I N ) . In the deterministic serial mode the limit cycle is composed of adjacent states with Hamming dis- tance one between them. These states are ordered accord- ing to the order in which the components are updated. The constraints imposed on W are of the form: XJ j I , 1.4;'' for the one component ( j ) which is updated when the net- work is in state U', and they allow for design of limit cycles up to length N 2 .

As we have deduced from Propositions 1-4, the worst- case capacity of the two structures just studied is quite small. However, when the memories U' are randomly cho- sen, the capacity is usually much larger, as we now show.

Proposition 7: Choose independently ( N + 1) X N ran- dom variables (rv's) a,, = f 1, with Pr( a,, = 1) = Pr( a , =

- 1) = 1/2 , and let the set {U'; . ., u N + ' } be composed of the values of these rv's. Define P( N + 1, N ) 2p(N+1)N. #{sets which are stable for at least one (W, t ) pair with w,,=O, N>i> l} .Then

2 N 1 ( &i 2(N+1' P ( N + ~ , N ) ~ I----

i.e., as N + CO, the probability that networks without self- connections can store at least ( N + 1) random vectors approaches I.

Proof: Let

be an ( N + l ) X ( N + l ) matrix of f l ' s . Define E'= [w,';. ., w f N , - t,]', an (unknown) (N+l)-dimensional

real-valued vector. In general, U'; . ., uN+' are stable states of every ( W, t ) pair for which

N 2 i 2 1. The constraint w,, = 0 is equivalent to E: = 0 for all i. However, since U(Adj U ) = det U 1, if det U f 0 and 0 = Cr=l'(AdjU),,ujpj for some positive { p\}r,LT', then

1 N + 1

have the required properties. Therefore,

with AN+' the event {det U = 0}, and A , the events

{ ycl1 (Adj U ) , / u ~ + 0

for every positive P I , N + 1 2 1 2 1 .

Using a specific instance of Bonferoni's inequalities, also known as the union bound,

1 I N + l \

the proposition follows from the estimates:

and

i =I,. . . 9 N . 2

P ( A , ) I 2"' The bound on P ( A N + ' ) was obtained in [8]. Note that the fixed column of + l ' s does not affect the probability that det U = 0, which is the same as for a random 1 matrix. To bound P ( A , ) , note that (Adj U), , , ( N + 1) > 1 2 1 are independent of { u;};"=i' and that the event A , occurs iff (Adj U ) , , U ~ are either all positive (for N + 12 12 l), or all negative. Therefore, there are at most two out of the 2N+' possible values of {U:; . . , U;""} for which A , can occur.

Note: Proposition 7 implies that l i m N - m , K s ( N + l ) P ( K , N ) =l. It is easy to obtain the converse result l imN+m3 K 2 ( Z + t ) N , t , 0 P ( K , N)=Ousing thehyper- plane counting arguments of [16]. The proof proceeds along the same lines as in [ 5 ] . Choose {U'; . ., u K } except for their first component. Now since wll = 0, ( W d - t)' is independent of U;, 1 I 1 I K , and can be written as the inner product between E' 2 [ - t,, wiz; . ., w l N ] and ii'. Here ii' is U / except that ii;=l independently of the original value of U;. The 2K possible values of U:; . ., U," are selected with equal probability. Thus P( K , N ) I 2-

Page 5: On the capacity of associative memories with linear threshold functions

DEMBO: ON THE CAPACITY OF ASSOCIATIVE MEMORIES WITH LINEAR THRESHOLD FUNCTIONS 71 3

initial state components Q,; . ., Q N that are independent in [6].

# {homogeneously linearly separable dichotomies of

(cf., [16, sect. IV]). Since for K 2 (2+ c)N, we have l imN+m2-(K- l )p-1 K - 1 = 0, the proof is complete.

1=0( I 1 For the sum-of-outer-products, a similar result can be

obtained as follows. Proposition 8: Choose the set { U ' , . ., U " } at random as

in Proposition 7, and define P ( K , N ) = 2-KN- #{sets which are stable for W = CiK,lu'u'T, t = O}. Then

i.e., for

( N 212) (-1)

( K - l ) < - 41nN '

1

Remark: This result coincides with the more detailed results on this subject in [6], which also includes its con- verse.

Proof: The proof follows the ideas in [9]. P( K, N ) = 1 - P(U;N=,U~=,A,,) where A , , is the event that (Wu'), does not have the sign of U / . Using the union bound, the proposition will follow from the bound P ( A , , ) I e-(N+K-1)2/2(N-1)(K-1). Without loss of generality, take i = j = 1, and rewrite (Wu'), = u:(N + K - 1) + Ir=2u:(t;', t;'), where i' is the suffix of U' with the first component omitted. Therefore, given U : , (Wu'), is the sum of ut( N + K - 1) and ( K - 1)( N - 1) independent k 1 rv's. Therefore, P ( A , , ) does not depend on U : , and P ( A , , ) I Pr[{sum of ( N - 1)(K - l), f l rv's} 2 ( N + K - l ) ] _< e-(N+K-1)2/2(N-1)(K-1) (cf. [lo, p. 581). This is essentially Hopffding inequality.

Although Propositions 7 and 8 imply that as N + m , with probability that approaches one, a reasonable storage capacity can be achieved, this does not diminish the im- portance of the SPS concept since the number of excep- tional sets grows exponentially with N (with a smaller exponent however than the overall number of sets).

111. CAPACITY AND ERROR CORRECTION

In the preceding section we studied the capacity when there are no errors in the input probe. We now investigate the error correction potential of the network, i.e., initiate it with a randomly chosen state, which is close to a specific desired memory, and compute the probability of direct convergence to this vector.

We shall consider throughout a specific subclass of SPS's in which W is of rank K . We denote by e,, the event (WQ), + ~, - uJ, (where Q is a distorted version of U ' ) , i.e., the "error" in updating the j t h component with respect to the z th desired memory.

Proposition 9: For SPS's with x K + l = . . . = x = 0, and

rv's with Pr(Q, # U ; ) = pi < 1/2,

K K

x;(A-l)rmX; r = l n i = l

where the ( r , m) the element of the K X K matrix A is (ur ,um) , and we assumed that [(1-2p,)xl- t,]uJ > 0.

Proof: e,, occurs only if C;",,(Q,u;>( - w,,u;uJ) 2 - t,uJ. Let be the rv (Q,u;)( - w,/u;). Then E ( Y , ) =

- (1 - 2p,)wJlu;uJ, and IY,l = Iw,/l. Thus applying the re- sults in [lo, p. 581 we obtain the following upper bound:

wj.

However,

and N K K

w;= X,"(Um,Ur)X;, r = l n7=1 /=1

where from the definition of { U ' ; . -, u K } , we have (urn , u r )

Example 1: { U , , . . -, u K } are orthogonal, i.e., A -' =

= ( A - l ) r m .

( l /N)I . Then, for

and

x;k (fj+u;)/(l-2pr),

one obtains

and from the union bound

(as WQ as

U' is the complement of U~=,e,,). Thus as long

K 1 N c <- ' = I (1-2pJ2 21n"

Pr(WQ+,u') -+N4ml.

The case of all the pi being equal coincides with the results

Page 6: On the capacity of associative memories with linear threshold functions

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 35, NO. 4, JULY 1989 714

Example 2: Let u l = ( - l , + l , + l , . . . , + l ) , u2=(1 ,1 , 1; . e , 1) and add to these vectors an almost orthogonal third vector u 3 = ( - 1 , 1 , - 1 , 1 , - 1 ; ~ ~ , 1 , - 1 ) (we assume that N = 2 L + 1 is odd). It is easy to verify that for any SPS with t = x 4 = . = x N = 0, WQ will be a function of only the values of Q1, (1/L)(Q2 + Q4 + + QNP1), and (l/L)(Q3 + Q , ,+ . . - + Q,) . Let these three values be de- noted by Q1, Q2, 03. We thus obtained an “error-cor- recting code” with respect to (w.r.t.) all the components except the first one. This “correcting” capability by a majority voting is possible since all three stable vectors contain two large sets of identical components. Eva1u:ting the ortho$pnal basis { u1;u2, U : } , we obtain 2WQ = (Q3 - Q l ) x l +(e2 + Q 1 ) x 2 +(e2 - Q3)x3. Now the projection rule A(xlA= U,,, x 2 = u2, x 3 = u 3 ) will yield WQ =

[ Q , , Q 2 , Q 3 , Q 2 ; e , Q3IT, i.e., there is no correction of the first component even around u3 which is “far” from the problematic pair ul, u2. This is an example of the “mixing” property of this memory. Here a cluster of desired memo- ries having small Hamming radius affects the shapes and sizes of domains of attractions of the other almost orthog- onal desired memory vectors. Note, however, that by tak- ing x 3 = pu3 with p > 2 we can correct an error in the first component for vectors which are close to u3. Let (U, Q), denote a modified inner product, which ignores the first component and is normalized through division by N - 1 =

2L. For ( u3, Q ) , > 0, the first component is corrected as long as p > 1 + l/(u3, Q),. On the other hand, if ( u 2 , Q ) , > 0, the vector Q is directly attracted by u1 or u2 only when p < ( u2, Q)+/l( u3, Q), I. Fig. 1 illustrates the do- mains of direct attraction of { u 1 , u 2 } and u3 in the ( u 2 , Q ) + versus (u3 , Q), plane. As p increases, the area occupied by u3 increases as the area occupied by { ul, U’} decreases (these areas are equal for p = 1 + a). In the unmarked area between these domains of attraction there is indirect convergence to & u3. In a serial mode of opera- tion, part of the domain of direct attraction to { ul, u 2 } may lead instead to indirect convergence to f u3, depend- ing on the order in which the components are updated.

As we have noted in the previous example, it may be hard in an SPS to obtain a complete Hamming sphere of attraction even around an isolated desired memory. There- fore, to measure the performance of the SPS quantita- tively, we shall determine the fraction of components for which convergence occurs. We consider infinite sequences of desired Hamming-sphere radii p = { pl , - * *, p K , . . . } (pi < 1/2), for which there exists p* < 1/2 such that (s.t.) for every K large enough, 00 > 1/(1- 2 ~ * ) ~ 2 E ( z i ) with zK being the rv having K equally probable values

For example, when all the memory vectors have the same desired radius of attraction, p* = pi. We hereafter concen- trate on the subclass of spectral algorithms t = x K + l = . . . = x N = O , xi=(1/(1-2pi))ui, K 2 i 2 1 , and bound the maximal fraction of components for which Pr( ei,) 2 E.

Proposition 10: For the projection rule (i.e., pi = p* for all i ) ,

1 y , ~ - # { j : ~ r ( e , , > > r } I

N (1 - 2p*)2

and

y2 A - 1 # { j : ~r ( 3 i , e i , ) 2 E } I

N (1 - 2p*)2

for any choice of { ul , . e , u } (which are linearly indepen- dent).

Proof: According to Proposition 9,

By the union bound,

Thus the foregoing proposition follows from ( ~ 3 . 0 ~

I -’

K # { j : ( w w ~ ) , , 2 a } I

a ( 1 - 2p*)2

Since ( W W T ) is a nonnegative definite (NND) matrix, ( WWT),, 2 0, and

N 1 K K

A tr u’v’Tv‘~‘T I = 1 r = l

c (wwT) / /= (1-2p*) / = 1

1 K K

K - -

(1 - 2p*)2

Fig. 1. Domains of direct attraction for example 2 in reduced state where the last is One Of the properties Of the space (u3 ,Q)+ versus(u2,Q)+. orthogonal basis, and the others follow by the properties

Page 7: On the capacity of associative memories with linear threshold functions

DEMBO: ON THE CAPACITY OF ASSOCIATIVE MEMORIES WITH LINEAR THRESHOLD FUNCTIONS 715

of the trace operator. Now, given that ( W W T ) , 2 0 and Cy= 1( WWT) , , = K/(1- 2 ~ * ) ~ , at most K / a ( l - 2 ~ * ) ~ of these N values can be above a , for every a > 0.

Therefore, at least for constant p < 1 / 2 , as long as ( K / N ) +,+ ,O, the fraction of components with "error probability" above c > 0 approaches zero as N --+ CO (result of the same spirit for the sum-of-outer-products structure appear in [25 ] ) . Whenever ( K In K/N) +,+,O, even the fraction of components with larger than c > 0 "error prob- ability" for at least one memory vector, approaches zero as N+CO.

Assuming that vectors { ul, . . . , U } are composed of N X K independent f 1 random variables (as done in [ 6 ] ) , we can obtain sharper upper bounds on y1 and y2 (which are now rv's).

We will base part of our proof on a well-known yet unproved conjecture on the limiting distribution of eigen- values of random matrices.

Conjecture: Let (U', U') be the ( i , l)th element of the K X K matrix A ; then

almost surely, provided that

lim, ~ , ( K/N) = y , 0 I y < 1.

This conjecture is easily proved when

K 2 In K (7) = 0

(for any distribution of the U; rv's). The complete proof is known only for Gaussian uf rv's (see 1261). Survey papers on this issue can be found 6 [27].

Proposition 11: For N(1- 2 ~ * ) ~

KI 2 + ( N ) '

{U';. .,U"} chosen at random (and equal), the following hold:

a) if

pi not necessarily

c > 0, 6 > 0, arbitrarily small,

then lim , + ,[ y2] = 0, almost surely; b) if + ( N ) 2 (1 + 6)ln N , 6 > 0, then lim,+m Pr(WQ

, j0 U') = 1 for every fixed 1 I i I K; c) if + ( N ) 2 ( 1 + 6 ) l n N K , S > O , thenlim,,, Pr(for

all i WQ =l.

Remark: We give here a short proof based on the previ- ous conjecture; a longer version which does not use this conjecture, but is limited to a weaker statement of part a), is given in Appendix 11.

Proof: From Proposition 9 the upper bound on Pr, ( e , , ) (i.e., the probability of the event e,, given

{U',.. .,U"}) is in our case

Pr, ( e , , ) s e x p - [ 1 / ( 2 x ~ W ' x , ) ]

with x, E R 2 ~ ~ ) . Now,

the vector whose ith component is uJ/ ( l -

K x,TA-lx,< (x , ,X , )Xm,(A- l ) I 1 '

(1 - 2p*)%X ' --A mln( N )

and using our assumption on K , we get 1

pr, (er , ) ~ e x P - + ( N ) ~ r n m ( %-A), a) Thus our lower bound on + ( N ) and the conjecture

imply that

lim ( max [Pr, (e,,)] 2 c } = 0, almost surely, N + c c l s i s K

l j j l N

which implies lim ,,, my2 = 0, almost surely. b) By the union bound,

The lower bound (we assumed) on + ( N ) , and the previ- ously stated conjecture (for y = 0), imply now that

lim Pr, ( WQ +o U,) = 1 , almost surely.

However, as 0 I Pr, (WQ + o u t ) I 1, we also have N + m

lim E [ Pr, ( WQ +o U')] A lim [ Pr ( WQ j0 U')] = 1 . N - + m N + m

c) In this case, our assumption and the union bound will give

Following the lines of proof of part b) we get the desired conclusion.

Remarks: 1) The probabilistic model for the initial vector Q was

chosen in order to simplify the proofs of the various propositions. Nevertheless, the value of (Q, U') is a bino- mial rv, with mean (1 - 2 p , ) N and standard deviation less than m. Thus for N + CO the relative distance of almost all initial vectors from U' is (1 - 2 p , ) , and, therefore, this model is asymptotically equivalent to choosing Q having Hamming distance p ,N from U'.

2 ) Parts b and c of the foregoing proposition are compa- rable with the capacity results of [6] , whereas part a corresponds to the €-capacity of [25 ] and to the statistical mechanics type of results (cf., for example, [28 ] ) .

3 ) For completeness we investigate an extended sum-of- outer-products scheme, i.e,

K 1

and obtain very similar results, i.e., we have the following.

Page 8: On the capacity of associative memories with linear threshold functions

716 IEEE TRANSACTIONS ON INFORMATION THEORY. VOL. 35. NO. 4. JULY 19x9

Proposition 12: Let K 1

{ U ' , . . . , U } chosen at random, and

N ( l - 2p*)' K I

2 + ( N ) . Then b and c of Proposition 11 hold. Part a of Proposition 11 is replaced by the follwing: a) for K I c N , max,. , [Pr(e,,)] I exp[-(1-2p*)'/2~(1+ €11.

Proof: Without loss of generality, consider N \

Define & Li u ~ Q , , and ii; 2 U;U;U;Q, for f # j , i # 1; then I 1 K 1

Here { U ; } y: , = are independent 1 rv's with zero mean, and { Q,};"= are (likewise) N independent IT'S, which are independent of U ' , i # 1. Thus iif, 1 # j , i # 1, are also a set of ( N - 1) X ( K - 1) independ$nt f 1 rv's with zero mean which are independent of {Q,},"=,. Using the fact that E(Q,) = (1 -2p,) and the bound of [lo, p. 581 (used also in Proposition 8), we obtain

- < exp - [ N 2(1 - 2 ~ * ) ~ / 2 K ( N - 1 + K )] . When we substitute the gven bound on K , we get Pr(e,,) s exp { - +( N)/[1 + ( K - l)/N]}. Since this bound holds for any e,,, part a follows immediately, whereas b and c are obtained by applying the union bound (consider the proof of Proposition 11).

Remarks: 1) Proposition 12 also holds for the probabilistic model

with the Hamming distance of Q and U' being exactly p,N, when estimating e,, (which was used in [6]). The main lines of the proof are the same, with some details changed to be more in the spirit of the proof of Proposition 8.

2) Bounds as in the previous proposition can be derived also for the more general1 case W = lP, U' u ' ~ with P, > 0 being arbitrary positive constants. However, the choice of PI = 1/(1- 2p,) gives the "best" set of upper bounds on the error probabilities.

3) Proposition 12, like the results of [6], guarantees that for almost every set { U ' ; . e , U " } almost afl the Hamming

sphere of radius p j N is directly attracted by ui. However, in general, not all the Hamming sphere is directly at- tracted, and in essence only when

\ - J + '

I

is the whole Hamming sphere directly attracted (similar result is stated in [21], and proved in [24]). Thus for K + 00, p, drops to zero as N 4 00. The proof that (3.1) is a sufficient condition follows from the fact that I( uJ, Q)l I N - II(uJ, U ' ) / - I(u',Q)ll, and that for ( u f , Q ) =N(1-2p,), we have ( 30 U ; provided that ( N - 2p,N) 2 ~ ~ = l l ( ~ J 9 Q ) l .

J f l

The proof that it is also a necessary conditions is based on the following construction: u1 is all ones, and U ' ,

j = 2 , - - - , K contain ( ~ / ~ ) [ N - ( u ' , u J ) ] consecutive +l ' s and (1/2)[ N - ( U', U ' ) ] consecutive - 1's each. If ( U', U') > 0, then the - 1's precede the + 1's; otherwise, the + 1's precede the -1's. The probe vector Q contains p ,N con- secutive - 1's followed by N - p,N + 1's. Thus its distance from u1 is exactly p,N. Now WQ +oul only if (WQ)l 2 0; careful investigation reveals that t h s is equivalent to (3.1). We assumed that it is possible to choose ( u J , u k ) for j # k # 1 as in the previous construction while ( U ' , U , )

j = 1; . . , K consist of K distinct numbers (so that U J # uk whenever j # k ) . If this is not the case, the corresponding vectors are perturbed at the boundary between the -1's and + 1's blocks to guarantee that they are distinct. This perturbation has a minor effect on (3.1).

IV. CONCLUSION AND OPEN PROBLEMS

In this work we analyzed some of the important features of various known constructions of associative memories based on linear threshold functions. We dealt with two important features:

a) the ability to select an arbitrary set of desired mem- ory vectors and design a network for this set (pro- vided that its size, denoted throughout by K , is less than some critical value, which we shall denote as the static capacity C);

b) the sizes and shapes of the domains of attraction of the desired memory vectors and their relation to various "design parameters."

Previous research addressed some of these issues (cf., [l], [5] , [6], [19]-[25], [28]), but most of our results are new or generalize known work in an important way.

The worst-case static capacity of two common structures of connectivity matrix in those memories is quite poor ( C =1,2, cf., [19], [ZO]), and far below the upper bound presented in [5]. We have shown that even when the desired memories are constrained to have proper Ham- ming distances (in the range (aN,( l - a ) N ) with 0 < a < 1/2 being fixed), the worst-case static capacity of these structures is limited to C( a ) = 1/(1- 2a), i.e., independent

Page 9: On the capacity of associative memories with linear threshold functions

DEMBO: ON THE CAPACITY OF ASSOCIATIVE MEMORIES WITH LINEAR THRESHOLD FUNCTIONS 717

of the size N of the network. This property is carried to higher order threshold functions, as considered in [17]. Therefore hyperplane counting arguments should be avoided in this context.

To overcome this drawback, a class of spectral algo- rithms was considered in [4], [21]-[23], with static capacity at least O( N ) . We characterized all the programming rules with this property and termed them SPS’s (sign preserving systems). The converse of Proposition 5 (upper bounding the static capacity of an SPS) was not proved but can be deduced once the trivial identity systems are excluded.

In spite of the poor worst-case static capacity, previous experimental use of the common constructions proved to be successful (cf. [2]-[4]). To explain this, we analyzed the static capacity for randomly chosen desired memories. Although the exceptional set of “bad memory sets” is growing exponentially with the size of the network ( N ) , its size becomes negligible w.r. t. 2 K , which is the number of all possible sets of K *1 vectors of length N . Therefore, with probability that approaches 1, as N + CO, the static capacity is of order (N/41n N ) for the sum-of-outer-prod- ucts (where the converse of Proposition 8 appears in [6]), and of order PN, 1 I /3 I 2 (yet unknown) for networks with zero self-connections.

The dynamic capacity is related to the existence of domains of attraction around each desired memory. As with static capacity, there are two different ways to ana- lyze the dynamic capacity. The first approach is to investi- gate the behavior of the network designed for an arbitrary set of memories. Since these vectors might be close to each other, it is not possible, in general, to guarantee that there will be a complete Hamming sphere of direct convergence around each one of them. However, choosing an initial vector Q at random near a desired memory U’, we derived for a subclass of SPS’s an exponential upper bound on the probability of “error” e i j in the j t h component at the first step.

Two extrema1 examples of sets of desired memories were then analyzed in detail. The first example is that of orthog- onal vectors, in which case the “optimal” SPS is an ex- tended sum-of-outer-products construction. Our upper bound then leads to some of the results reported in [6] regarding the Hamming spheres of attraction in this con- struction. In the second example we considered three de- sired memories. Two of them differ only in the first coordinate, whereas the third one is almost orhtogonal to this pair. As expected, the domains of attraction of the first two vectors are mixed together. However, the exis- tence of this pair of vectors also affects the shape of the domain of attraction of the third one, although it is far from them. In view of this example, we were only able to bound the relative fraction of exceptional components (denoted by y1 and y2, depending on the type of conver- gence desired).

Better performance was obtained when the average dy- namic capacity was evaluated. In this case the desired memories were randomly chosen, and the probability of systems having prescribed spheres of attraction was deter-

mined. We compared the performance of the spectral algorithms with the extended sum-of-outer-products scheme via Propositions 11 and 12. In both schemes, for spheres of attraction with (average) radius p*N and ( K / N ) small enough (not necessarily zero), the relative fraction of exceptional components (in which Pr,( e,,) 2 e ) , is arbi- trarily small with probability one as N -+ CO. Stronger results were obtained for

N ( 1 - 2p*)’

K z o ( 41nN 1’ namely, that convergence of all the components is guaran- teed with probability one as N -+ CO (similar to the results in [6] for the sum of outer products). A lengthy derivation of the converse of this latter result for the sum-of-outer- products construction was presented in (61. Questions which are still open are to prove the converse of Proposition 11, to supply a simple and elegant proof for the converse of Proposition 12, to prove the conjecture on which the current proof of Proposition 11 is based, and to supply a “ randomization” machmery to extend the €-capacity results (is., part a of Propositions 11 and 12) towards analysis of nondirect convergence (note that this section in [6] is wrong).

ACKNOWLEDGMENT

The author wishes to acknowledge Prof. J. Salz who introduced him to these exciting problems. The proofs of Propositions 8 and 12 follow his ideas. Many discussions with J. Kilian, A. Odlyzko, A. Wyner, and J. Ziv helped in the research. The comments of the anonymous referees significantly improved the presentation of the material.

APPENDIX I

We prove here a modified version of the union bound, given by the following lemma.

Lemma: Let A , ; . ., A , be events in a common probability spacewithPr(A,)<p,, ~ = l ; . . , N . F o r e v e r y i n t e g e r I ~ L < N ,

Pr(arnong A,; f . , A, at least L occur) I - C p,. 1 N

L / = l

Note: The case of L = 1 is the union bound.

Proof: Let X be a binary vector of length N, composed of xI; . ., x, with x, the indicator function on A , . Denote by IlXll the Hamming norm of X , and by P( X = x) the probability that X equals a specific binary vector x. We know that

( x , x , = 1 )

P ( X = x ) < p ,

and seek an upper bound on q f 1 I ,. ,I , P( X = x). However,

Page 10: On the capacity of associative memories with linear threshold functions

718

E[u:u:u~u~Adj(A),,Adj(A),,,1= {

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 35. NO. 4, JULY 1989

(N-K+2) { l h ( K-1) -( N-K+2)h( K-2)

?{ X h ( K ) + - [ ( N - K ) 2 h ( K ) - h ( K + l ) ] 1

(K-2) N

1 jlh(K-1)-(N-K+2)h(K-2) (K-2) N

2N

+ K( K- 1) ( N- K+2) ( N- K+4) h ( K- 2) - - h ( K- 1) N

which is exactly our lemma. The tightness of this bound is demonstrated by the following example:

P ( X = x ) = p , = p (x : x, '1)

1 - q , 0 , otherwise

for 11x1) = o

This leads to

APPENDIX I1

Proposition l lb: Assume there exists p* < 1/2 s.t. for every K large enough,

and furthermore assume that the fourth moment of z K / E ( z K ) is bounded uniformly in K. Then, for

N( 1 - 2 ~ * ) ~ ( l - a ) , K' 21nN

(N-K+2)2h(K-2), (N-K+2)h(K-2),

Define Y L det A , and

Thus

The matrix A is generated by inner products of K random vectors, each composed of N 1 rv's. In [12] and [13], the first two moments of det A are calculated for K = N. In [l l] we extended the results for K < N and for the elements of Adj A , and obtained

N! (N-K)! E[ Y] E( det A ) = ~ (11.1)

E[Y'] LE[(detA)']

N! K! K (N+2-i)! - (11.2) (-2) ' i ! ( K - i ) ! -

(N-K)! (N-K+2)! i - o

(we shall denote the right side of (11.2) by h( K))

/ N! i = l

( N - K + I ) ! ' i (N-K+I)!,

(11.3) (N-l)! i # l -

E[u;u;Adj(A),,] =

i=l=r=m (II.4a)

i= l=r+m (II.4b)

i= l+r=m (11.4~) i=r+ l=m (II.4d)

i=l#r#rn (II.4e)

i=r# l#m (II.4f)

i#l#r#m. (II.4g)

limN,,Pr(y2>6)=0 for every a>O, 6>0, r > O arbitrarily small.

Remark: This weaker statement of Proposition 11 is indepen- dent of the conjecture mentioned there.

Proo/: As shown in the proof of Proposition 10,

and

These are the seven different types of second moments needed to evaluate var( X,), and we denote them by Qo, a,,, Qc, Qd, Qe, Q,, and Qx, respectively. Further, let p,, i =1,2,3,4 denote the first four unbiased moments of the random variable zK mentioned earlier. Then

5 (1-a)[21n NI-' (11.5) where the last inequality follows from the proposition conditions. Some algebra on (11.4) leads to

E [ q'] = d% + P:P2*2 + P I P 8 5 + P4*4 + p:\k, (11.6)

Page 11: On the capacity of associative memories with linear threshold functions

DEMBO: ON THE CAPACITY OF ASSOCIATIVE MEMORIES WITH LINEAR THRESHOLD FUNCTIONS 71 9

where

= K( K - 1)( K - 2)( K - 3)ag + K( K - 1)( K - 2)(4@/ + 2ae) + K( K - 1)(20d + @( ) + 4 K( K - l)Q,, + KQ0 (II.7a) @2 (II.7b) @, = 8K(K-3)Og- 2K(K-4)(4cp/+2Qe)- 4K(2@J+'DC)+ 4K(K-4)@,,+4KQU (11.7~)

-6K( K -2)( K -3)Og + K( K -6)( K -2)(4@, + 2Qe)+ 2K( K - 3)(2@d + @<) + 12 K( K - 2)@, + 6KOU

Q4 = - 6KQg + 2 K(4@/ + 20e) - K(2QJ + Q l ) - 4KQ,, + K@<, (II.7d) Q5 = 3K2ag - K2(4@/ + 20e) + K2(2Qd + a1 ) (11.7e)

We shall now show that

var Y ?) - lim -- lim ~- N + m E(Y)' N + m E($)' - O

and use these results together with (11.5) to conclude the proof Starting with Y,

1 ( N - K + 2 ) ( N - K + 1 )

var Y / E ( Y)' =

K ~ ( N + 2 - i ) ! c ( - 2 Y j N ! i = O

K ( K - ~ ) \ - <

( N - K + 2 ) ( N - K + 1 )

3 K ( K - 1 ) ( N - K + 2)( N - K + 1)

%

i.e., lirn h ( K ) ( 7) ( N - K ) ! ' =l.

N-CC

By (US), (11.6), and the boundness of the pi ,

provided that

1, i=1,5 2, i = 2 . 0, i=3 ,4

'P, lirn

N - w ( XE(Y))'

Manipulating (11.7) and (11.4), we obtain 6'P, +( K -l)'P2

-

(11.8)

(11.9)

( I1 .lo)

2 K N

= - h( K - 1) { ( N - K + 4)( K 2 - 3K + 5 )

-3K( K + 2 ) } (1I.lla)

8K2 4q2 + 3( K - 2) 'P3 = - h( K - 1)( N - K + 1)

N (1I.llb)

3 q 3 +4(K-3)'P4

N ( K - 1 ) + 2 N( K -2)

h ( K - 1 )

( K' - 5K + 4- N( K - 1 ) ) h( K - 2)

= 4K'

( N - K + 2 ) + ( K - 2 )

+ ( N - K + 2) ( N( 2 K - 3) + K ) h ( K - 2) . (I1 . l l d )

Dividing (11.11) by 'PI and taking the limit as K, N + 00, with K / N -+ 0, leads to (11.10) provided that

' P I = lim 'PI lim

N ~ C C K 2 N 2 h ( K - 2 ) N - C O K ' h ( K - 1 )

( N - K ) ! ' 'PI = lim 7 =l . (11.12)

N - ~ ( N - I ) ! ' K

By (11.9), it is enough to prove one of these equalities. However, from (11.4) and (11.7):

K 1 ' P I = F h ( K ) + - 2N { ( N - K)'h( K ) - h( K + 1 ) }

Therefore,

( N - K)!2'P, = I - ( K - 1)( 2 N + K( K - 2 ) ) (N-1 ) ! 'K2 K ( N - K + l ) ( N - K + 2 )

K

j = 2

j ( N - 2 K ) + K ( K + 2 ) K 2 ( N - K + l ) ( N - K + 2 )

2 N + K ( K - 2 ) ( N - K + l ) ( N - K + 2 )

+ 4K (1+ ) K . (11.13) ( N - K + 1 ) N - K + l

The last inequality holds for N 2 2 K (derived similarly to (11.8)). Thus,

( N - K ) ! ~ ' P ~ lim <1.

N + W ( N - ~ ) ! K ' -

However, since the left side of (11.13) is E( X : ) / E ( X , ) 2 (for p* = p, ) , it must be at least 1, and (11.12) follows.

We now use (11.8) to get

P r ( ~ s E ( Y ) ( I - ~ ) ) I ( v ~ ~ Y ) / ( E Y ) ' / ~ ' 3 K ( K - 1 )

< - q 2 ( N - K + 1 ) ( N - K + 2 )

(Chebyshev bound for Y ) . Further, if Y 2 E( Y)(1- q ) , then

Denote by X , / E ( Y ) ( l - 1); then Ek/ I (1 - a)/ (1 - q)21n N. For 11 I a and K I EN we obtain the result

1 E x , < -

21n( :)

Page 12: On the capacity of associative memories with linear threshold functions

120 IEEE TRANSACTIONS ON INbORMAlION l”I:ORY, VOL. 35. NO. 4. JIJ1.Y 19x9

and therefore the Chebyshev bound for 2, yields A/D converter and a linear programming circuit,’’ I E E E Truns. Circuits Syst.. vol. CAS-33, pp. 533-541, 1986. J . S. Denker, ‘‘Neural network models of learning and adaptation,”

Y. S. Abu-Mostafa and J. St. Jacques, “Information capacity of the Hopfield model,” IEEE Trum. Inform. Theoty, vol. IT-31, pp.

R. J . McEliece et U/ . , “The capacity of the Hopfield associative memory,” IEEE Truns. Inform. Theorv, vol. IT-33, pp. 461-483, 1987. S. H. Cameron, “An estimate of the complexity requisite in a universal decision network,” in Proc. Bionics Symp., 1960, Wright Airforce Dev. Div. (WADD), Rep. 60-600, pp. 197-212. J . Komlos. “On the determinant of (0.1) matrices,” Stud. Scr. Muth. Huguricu, vol. 2. pp. 1-21, 1961.

[4]

[SI Ph.vs. D.. vol. 22, pp. 216-232, 1986.

461-464. 1985. (11.14) [6]

[7] - 2

[XI

[9] J. Salz, private communication. [lo] V. V. Petrov, Sums of Independent Rundoni Vuriuhles. Bcrlin:

Springer-Ver1a.g.. 1975. For q I a/2, K I E N we obtain

< varX,(41nN)’ (: - l )2 . (11.15) -

E( Y)* Using the modified union bound (proven in Appendix I) w.r.t. the events

and L = 6N, we obtain

. , . and K / N , - x1 4 0, we can choose q2 = ( K - 1)/( N - K + 1). For a large enough value of N , K I E N , and q I a/2 (for every c > 0, a > 0). Thus lim N-+a.Pr(y2 6) = 0 since lim - ,(var X , ) / E ( x,)’ = 0.

REFERENCES

[ l ] J . J. Hopfield, ‘‘Neural networks and physical systems with emer- gent collective computational abilities,” Proc. Nut. Acud. Sci.

J. J. Hopfield and D. W. Tank, “‘Neural’ computation of decisions in optimization problems,” Eiol. Cvbern., vol. 52, pp. 141-152, 1985. D. W. Tank and J. J. Hopfield, “Simple optimization networks:

USA, vol. 79. pp. 2554-2558,1982. [2]

[3]

[ l l ]

[12]

A. Dimbo, “ 6 n random determinants.” Quurt. Appl. Muth., vol. XLVII, no. 2. June 1989. H. Nyquist, S. 0. Rice, and J. Riordan, “The distribution of random determinants,” Quurt. Appl. Muth., vol. X I I , no. 2, pp.

~~ .. 97-104. July 1954. A. Prikopa, “On random determinants I.” Stud. Sci. Muth. Ilun- guricu, vol. 2. pp. 125-132, 1961. H. Sampolinsky and I. Kanter, “Temporal association in asymmet- ric neural networks,” Ph.vs. Rev. Lett., vol. 57. pp. 2861-2864. 1986. T. N. E. Greville, “Some applications of the pseudoinverse o f a matrix,” SIAM Rev., vol. 2, pp. 15-22, Jan. 1960. T. Cover, “Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition,” I E E E Truns. Electron. Comput., vol. EC-14, pp. 326-334. 1965. P. Baldi, “Neural networks, orientations of the hypercube and algebraic threshold functions,” IEEE-IT Truns. Inform. Theory. vol. IT-34, no. 3, pp. 523-530, May 1988. L. Schafli, Gesummelre Muthemutische A hhundlugeii I . Basel, Switzerland: Verlag Birkhauser, 1950, pp. 209-212. B. L. Montgomery and B. V. K. Vijayakumar, “Evaluation of the use of the Hopfield neural network model as a ncarest neighbor algorithm,” Appl. Opt., vol. 25, pp. 3159-3766, 19x6. J. Bruck and J . Sam, “A study on neural networks,” J . Intel/. $v.st., Feb. 1988. L. Personnaz et ul., “Information storage and retrieval in spin-glass like neural networks,” Phvs. Reu. Lett., pp. L359-365, 1985. S. S. Venkatesh and D. Psaltis, “Linear and logarithmic capacities in associative neural networks,” IEEE Truns. In/orm. Theorv, to be published. A. D. Maruami and G. Y. Sirat, “Information retrieval in neural networks,” preprint. S. Dasgupta et U/.. “Convergence in neural memories,” preprint. S. S. Venkatesh, “Epsilon capacity of neural networks,” in A I P

J. Silverstein, “The smallest eigenvalue of a large dimensional Wishart matrix,” A n n . Proh., vol. 13, no. 4, pp. 1364-1368, 1985. J. Cohen, H. Kesten. and C. Newman, Eds., Random Mutrices und Their Applicurions. AMS Series on Contemporary Math., vol. 50, 1986. D. J. Amit et ul., “Statistical mechanics of neural networks near saturation,” Ph-vs. Rev. Lett., vol. 55, p. 1530, 1985. Also in Ph.vs. Rev.. vol. A32, p. 1007, 1985.

C O I I ~ . P ~ o c . . vol. 151, 1986, pp. 440-445.