11

Click here to load reader

Efficient and robust associative memory from a generalized Bloom filter

Embed Size (px)

Citation preview

Page 1: Efficient and robust associative memory from a generalized Bloom filter

Biol Cybern (2012) 106:271–281DOI 10.1007/s00422-012-0494-6

ORIGINAL PAPER

Efficient and robust associative memory from a generalizedBloom filter

P. Sterne

Received: 12 August 2011 / Accepted: 27 April 2012 / Published online: 23 June 2012© Springer-Verlag 2012

Abstract We develop a variant of a Bloom filter that isrobust to hardware failure and show how it can be used asan efficient associative memory. We define a measure of theinformation recall and show that our new associative mem-ory is able to recall more than twice as much information asa Hopfield network. The extra efficiency of our associativememory is all the more remarkable as it uses only bits whilethe Hopfield network uses integers.

Keywords Associative memory · Randomized storage ·Bloom filter

1 Introduction

A Bloom filter (Bloom 1970) is a data structure that providesa probabilistic test of whether a new item belongs to a set(Broder and Mitzenmacher 2004). Bloom filters are unfortu-nately little-known outside of randomized algorithms in com-puter science. However they have found extensive practicaluse in industry, in web caching (Fan et al. 2000), peer-to-peernetworks (Broder and Mitzenmacher 2004), and spell-check-ing in resource-constrained environments (Bloom 1970). Inthis paper, we generalize the Bloom filter in such a way thatwe can perform inference over the set of stored items. Thisinference allows the filter to serve as an efficient and robustassociative memory.

P. SterneCavendish Laboratory, Cambridge University, Cambridge, UK

Present Address:P. Sterne (B)Frankfurt Institute for Advanced Studies, Johann Wolfgang GoetheUniversity, Ruth-Moufang-Str. 1, 60438 Frankfurt am Main, Germanye-mail: [email protected]

The approach described here is neurally inspired: Bio-logical neural networks are able to perform useful compu-tation using (seemingly) randomly connected simple hard-ware elements. They are also exceptionally robust and canlose many neurons before losing functionality. Analogously,our approach is characterised by storing the results of simplerandom functions of the input and performing distributed sta-tistical inference on the stored values. Each individual partof our Bloom filter works in parallel, independently of eachother. Since the parts are independent, the inference is ableto continue even if some of the parts fail.

This paper defines an architecture as ‘neurally-inspired’if it satisfies two important design goals:

1. Inability to specify exact connectivity: At the gross ana-tomical level there exists detailed structure in neural con-nectivity. As an example, the neocortex has distinct layersof neurons which send dendrites out to other distinct lay-ers (Bear et al. 2007). However, at the level of individualcells we believe that it is impractical to specify preciselywhich other cells should be connected. It is at this levelin vertebrates that connections are somewhat random.

2. Robustness to hardware failure: Neurons are not onlyvery noisy pieces of hardware, they are also liable to dieat any time. However, we do not expect this failure to cat-astrophically affect the computations carried out by otherneurons. This implies that we need to use distributed rep-resentations with no single points of failure (Hinton et al.1984).

These two design criteria favour architectures that use sim-ple distributed pieces of hardware which collectively solvea problem through communication. The associative memoryin this paper uses the statistical evidence from past obser-vation to perform inference. The inference is performed in

123

Page 2: Efficient and robust associative memory from a generalized Bloom filter

272 Biol Cybern (2012) 106:271–281

a randomly connected factor graph and performance is notdependent on the exact connectivity of the graph. Removingany individual component corresponds directly to the non-observation of such data and is a principled solution to theproblem of hardware failure.

Our Bloom filter is created from many identical parts.Each part will have a different random boolean function anda single bit of storage associated with it. In this approach, stor-age and processing are closely related. Our network is similarto Willshaw nets (1969) and Sigma-Pi nets (Plate 2000) thatboth store items using simple logical functions. Our networkhowever, uses principled statistical inference similar in spiritto Sommer and Dayan (1998), but for a much more generalarchitecture.

2 Neurally inspired Bloom filter

A standard Bloom filter is a representation of a set of items.The algorithm has a small (but controllable) probability of afalse positive. This means there is a small probability of theBloom filter reporting that an item has been stored when infact it has not. Accepting a small probability of error allowsfar less memory to be used (compared with direct storage ofall the items) as Bloom filters can store hashes of the items.

In the next section, we briefly describe a standard Bloomfilter before moving on to the neurally inspired version. Agood introduction to Bloom filters is given in Mitzenmacherand Upfal (2005).

2.1 The standard Bloom filter

A standard Bloom filter consists of a bit vector z, and K hashfunctions {h1, . . . hK }. Each hash function is a map from theobject space onto indices in z. To represent the set we initia-lise all elements of z to zero. For every object x that is addedto the set we calculate {h1(x), . . . , hK (x)}, using the resultsas indices, and set the bit at each index. If a bit has alreadybeen set, then it remains on.

To test whether or not x is an element of the set we cal-culate h1(x), . . . , hK (x); if any of the locations hk(x) in zcontains 0 then we can be certain the item is not in the set. Ifall indices are 1, then x is probably part of the set, althoughit is possible that it is not and all K bits were turned on bystoring other elements.

The calculations for a Bloom filter can also be thought ofas occurring at each individual bit in z. The zm th bit calcu-lates all the hash functions and if any of the hash functionsare equal to m then zm is set. To literally wire a Bloom fil-ter in this way would be wasteful, as the same computationis being repeated for each bit. However, this view suggestsa generalization to the Bloom filter. Rather than calculate acomplicated hash function for each bit, a neurally inspired

Bloom filter calculates a random boolean function of a subsetof the bits in x .

2.2 Neurally inspired modifications

A Bloom filter can be turned into a neurally inspired Bloomfilter (as shown in Fig. 1) by replacing the K hash functionswith a simple boolean function for each bit in z. We denotethese functions hm to indicate that they serve a similar func-tion to the original Bloom filter hashes, and we use the m toindicate that there are no longer K of them, but M.

To initialise the network we set the storage z to zero. Westore information by presenting the network with the inputand whichever boolean functions return true, the associatedstorage bit zm is turned on. To query whether an x has beenstored, we evaluate h(x) and if any of the boolean functionsreturn true, but the corresponding bit in z is false then we canbe sure that x has never been stored. Otherwise we assumethat x has been stored, but there is a small probability thatwe are incorrect, and the storage bits have been turned on byother data.

For our boolean function, we work with individual bits inx . Our function ‘OR’s lots of randomly negated bits that havebeen ‘AND’ed together—also known as a Sigma-Pi func-tion1(Plate 2000):

hm(x)= x1 · (¬x6) · x3 ∨ (¬x3) · x9 · (¬x2) ∨ (¬x5) · x6 · x4.

(1)

The variables that are ‘AND’ed together are chosen at ran-dom, and varied for each bit of storage. These functions havethe advantage that they are simple to evaluate, and they have atunable probability of returning true in the absence of knowl-edge about x .

Given the number R of items to store, we can find the formof the boolean functions that minimize the false positive prob-ability rate. If we denote by p the probability that a singlefunction returns true then the probability p∗ of a bit beingtrue after all R memories have been stored is:

p∗ = 1 − (1 − p)R . (2)

We can estimate the false positive probability, by noting thatthere are M(1 − p)R zeros and the probability of all of thesereturning false is:

Pr(failure) = (1 − p)M(1−p)R

≈ e−Mp(1−p)R, (3)

1 We dislike the sigma-pi terminology and find the term ‘sum-prod-uct’ function to be much more descriptive. However, this terminologyis unfortunately confusing as belief propagation is also known as thesum-product algorithm.

123

Page 3: Efficient and robust associative memory from a generalized Bloom filter

Biol Cybern (2012) 106:271–281 273

Fig. 1 In a neurally inspiredBloom filter we associate aboolean hash function with eachbit in z. To store x in the seteach hash function calculates itsoutput for x and if true then theassociated bit is set to one. In dwe can be sure that x has neverbeen stored before, as twoelements of h(x) return true, butthe associated bits in z are false.a Each bit has a function which“AND”s many variables and“OR”s the result. b We initialiseall elements in z to zero.c Adding an item. d Testing formembership

(a)(b)

(c) (d)

which has a minimum when:

p = 1

R + 1(4)

and results in p∗ ≈ 1−e−1 of the bits being on after all mem-ories have been stored. Therefore, the storage does not looklike a random bit string after all the patterns have been stored.

If p is the target probability, we have to choose how manyterms to ‘AND’ together and then how many of such termsto ‘OR’. Ideally we would like:

p = 1 − (1 − 2−a)b

, (5)

where a is the number of “AND”s and b is the number of“OR”s. For large R and equation 4, this can be approximatedby the relation:

b = 2a

R + 1. (6)

Note that the function does not depend on M. The neurallyinspired Bloom filter scales with any amount of storage, whileoptimal performance in a traditional Bloom filter requiresthe number of hash functions be changed. This optimalitywith regards to hardware failure makes the neurally inspiredBloom filter robust. To store R patterns in a traditional Bloom

filter with a probability of false positive failure no greater thanp f requires a storage size of at least:

M = R− ln p f

ln2 2, (7)

while a neurally inspired Bloom filter would require a storagesize:

M = (R + 1)(− ln p f )e. (8)

For an equal probability of bit failure the neurally inspiredversion requires 31 % (= e ln2 2) more storage. This is theprice for choosing bits in parallel, independently of all others.If we could ensure that exactly Mp functions were activatedevery time, then the optimal saturation would again be 0.5and the performance would be identical to the traditionalBloom filter.

In Fig. 2a we compare the theoretical performance withthe empirical performance of both the Bloom filter and theneurally inspired Bloom filter. The dotted lines represent thetheoretical performance of each and the empirical perfor-mance of both are very close to their theoretical performance.The empirical performance was obtained by averaging manyruns together, using different boolean hash functions. In gen-eral, we found that practically all neurally inspired Bloom

123

Page 4: Efficient and robust associative memory from a generalized Bloom filter

274 Biol Cybern (2012) 106:271–281

(a)

(b)

,,,,,

Fig. 2 The probability of error for a neurally inspired Bloom filterfor completely random binary vectors and vectors that only differ ina single bit. The grey lines give the expected theoretical values (asderived from Eqs. 8 and 9 respectively), which are closely matched bythe experiments. a The false probability of the Bloom filter and its neu-rally inspired equivalent fall off exponentially as more storage is added.b The false positive probability for an input which differs in only onebit from a stored pattern. Increasing the number of ‘AND’s decreasesthe probability of error, but requires exponentially more ‘OR’s

filters were well-behaved and the performance was close tothe mean.

To obtain these plots we stored R = 100 random pat-terns of N = 32 bits. We let M range from M = 100 toM = 2500. Note that even with the maximum storage wetested we still used less bits than naively storing the pat-terns would use (N R = 3200). The hm functions were con-structed by ‘AND’ing a = 10 distinct variables together and‘OR’ing b = 10 such terms together. The many hash func-tions of a traditional Bloom filter were constructed using alinear composition of two hash functions as suggested byKirsch and Mitzenmacher (2008), the two hash functionsthemselves were linear multiplicative hash functions takenfrom Knuth (1973).

2.3 Sensitivity to a bit flip

Since the boolean hash functions only use a small number ofthe bits in x we might expect that the probability of error foran unstored x which is very similar to a stored pattern willbe much higher than a completely random binary vector. Forconcreteness, we will estimate the probability of error for anunstored binary vector that differs in only a single bit from astored pattern.

For a conjunction in hm the probability that it contains theflipped bit is a

N . Therefore, the probability of a hash functionreturning true given a single bit flip is p · a

N . After R patternshave been stored there will be approximately M(1 − p)R

zeros in Z . The network fails to detect a bit flip when noneof the zeros in z are turned on:

Pr(failure | single bit flip) =[1 − ap

N

]M(1−p)R

≈ e−Mp(1−p)R · aN

= [Pr(failure)]aN . (9)

(This approximation is only accurate when the total numberof possible boolean hash functions is much larger than M,the number actually used.)

Equation 9 also shows that as more bits are ‘AND’edtogether the performance for neighboring binary vectorsbecomes more independent, and in the case a = N the perfor-mance is completely independent. Practically though only asmall number of bits can be ‘AND’ed together as the numberof ‘OR’s required increases exponentially (Eq. 6). Our resultsin Fig. 2b shows that the approximations are accurate. Thisplot set M = 2000, N = 32, stored R = 50 patterns andwe let a range from 5 to 10, while keeping the probability ofh(x) returning true as close to optimal as possible. It is clearthat increasing a decreases the probability of a false positivefrom a single bit flip.

A neurally inspired Bloom filter underperforms a standardBloom filter as it requires more storage and has a higher rateof errors for items that are similar to already-stored items.While such negative results might cause one to discard theneurally inspired Bloom filter, we show that there is still anarea where it outperforms. The neurally inspired Bloom fil-ter’s simple hash functions allow us to perform inferenceover all possible x vectors. This allows us to create an asso-ciative memory with high capacity, which we show in thenext section.

2.4 Related work

2.4.1 Familiarity discrimination

Particularly relevant work to this article is Bogacz et al.(1999). Although they describe the task as familiarity

123

Page 5: Efficient and robust associative memory from a generalized Bloom filter

Biol Cybern (2012) 106:271–281 275

discrimination, they essentially represent a set of previouslyseen elements using a Hopfield network. After carefullyexamining the statistics for the number of bits that flip whenperforming recall of a random element they then determine athreshold for declaring an element unseen or not. They define‘reliable behaviour’ as ‘better than a 1 % error rate for a ran-dom, unseen element’. Their method can generate both falsepositives and negatives. From Eq. 8 our Bloom filter requires13 bits per pattern for a 1 % error rate and gives only falsepositive errors. We can phrase this result in their terminol-ogy. If we scale our storage size M similarly to a Hopfieldnetwork with increasing input size N , (M = N (N − 1)/2))then we can store:

R = N (N − 1)

2 × 13(10)

patterns, which to leading order is R = 0.038N 2 and com-pares favourably to their reported results of 0.023N 2. It isworth emphasizing that we achieve superior performanceusing only binary storage, while Hopfield networks use inte-gers. We can also invert the process to find that a Hopfieldnetwork needs approximately 22 integers per pattern for a1 % recognition error rate. Since our representation of a setis more efficient than a Hopfield network we might suspectthat we can create a more efficient associative memory aswell.

2.4.2 The WISARD architecture

Igor Aleksander’s WISARD architecture is also relevant toour study (Aleksander et al. 1984). In WISARD, a smallnumber of randomly chosen bits in x are grouped togetherto form an index function which indexes z. To initialise thenetwork we create many such index functions and set z toall zeros. When we store a pattern we calculate the indexesinto z and set those entries to 1. So far the WISARD archi-tecture sounds very similar to the Bloom filter (both ours andthe traditional form), however the WISARD architecture isdesigned to recognise similar patterns. This forces the hashfunctions to be very simple and a typical hash function wouldtake the form:

hk(x) = 16x22 + 8x43 + 4x6 + 2x65 + x75. (11)

This partitions z into blocks of 32 bits and we would havea different partition for each function. If we want to testwhether a similar pattern has already been stored we againcalculate the hash functions for the pattern but now we counthow many bits have been set at the specified locations. If thepattern is similar to one already stored then the number ofbits set will be correspondingly higher. We could emulate theWISARD functionality by turning Eq. 11 into the followingset of sigma-pi functions:

Fig. 3 The Willshaw associative network stores pairs of patterns{(x, y)}. To store a pattern zi j is updated according to the followingrule: zi j = zi j ∨ xi y j . The Willshaw network is designed to storesparse binary patterns. Given a correct x we can then use z to recall theassociated y

h0(x) = (¬x22)(¬x43)(¬x6)(¬x65)(¬x75)

h1(x) = (¬x22)(¬x43)(¬x6)(¬x65)(x75)

h2(x) = (¬x22)(¬x43)(¬x6)(x65)(¬x75)

. . .

h31(x) = (x22)(x43)(x6)(x65)(x75) (12)

However we have a different aim, in that we want to be sen-sitive to different patterns, even if they are similar. In thenext section, we analyse our Bloom filter’s sensitivity to asingle changed bit from a known pattern. This motivates thedifferent form of our hash functions from the WISARD archi-tecture.

2.4.3 The Willshaw network

The Willshaw network associates a sparse binary inputtogether with a sparse binary output in a binary matrix whichdescribes potential all-to-all connectivity between all possi-ble pairs of inputs and outputs. Initially all the associationsbetween the input and output vectors are off. When a (x, y)pair is stored then the connections between active inputs andactive outputs are turned on (see Fig. 3). During recall timejust an x is presented and the task is to use the synaptic con-nectivity to estimate what y should be. While many recallrules have been proposed (Graham and Willshaw 1995, 1996,1997) the most relevant performs proper statistical inference(Sommer and Dayan 1998) to retrieve the maximum likeli-hood estimate.

A Willshaw network can represent a set of items in astraightforward manner. If the network is queried with apair of binary vectors (x, y) and one can find zi j = 0yet xi = 1 and y j = 1 then one knows that the patternhas not been stored before. This assumes that the synapticconnectivity is completely reliable, but does allow a simple

123

Page 6: Efficient and robust associative memory from a generalized Bloom filter

276 Biol Cybern (2012) 106:271–281

estimate of the capacity. As a further simplification we willconsider the case that the dimensionality of x and y areequal (N ). Since Willshaw networks are efficient when thepatterns are sparse (

∑i xi = log2 N ), if we store approxi-

mately:

R = N

log2 N(13)

patterns, then the probability of a bit in z being onis 50 %, and the probability of getting a false positivefor a randomly chosen pair of binary vectors x ′ and y′is:

Pr(failure) = 2−(log2 N)2 = N− log2 N . (14)

This gives the rate at which false positives decrease for incre-asing pattern sizes. Alternatively, we can think about a max-imum probability of error p f and store more patterns as thepattern sizes increase:

R ≈ −N

log2 N· log

(

1 − p

(1

log2 N

)2

f

)

. (15)

Notice that a Willshaw network is not able to store densebinary vectors very well. After R = 20 patterns have beenstored then the probability that a particular connection hasnot been turned on is on the order of one in a million. This isa very uninformative state and suggests that one of the keyinsights of the neurally inspired Bloom filter is to use func-tions which are able to create sparse representations (for anarbitrary level of sparsity).

2.4.4 Sparse distributed memory

Kanerva’s sparse distributed memory (Kanerva 1988) has alot of similarity with the work presented here. Where Ka-nerva refers to neurons as ‘address decoders’, here they aredescribed as hash functions. Kanerva also fixes the synapticweights with random values, corresponding to our randomlychosen hash functions. Kanerva’s address decoders and ourbinary hash functions are designed to activate for only a smallproportion of input (i.e. they are sparse). Kanerva’s workhas an unusual level of mathematical rigor, the behaviourof the network is described with great clarity using intuitivegeometric descriptions, even if the results are sometimes farfrom intuitive. Kanerva (1988) also makes interesting (butnot compelling) comparisons between the sparse distributedmemory and the cerebellum.

However, the recognition functions chosen by Kanervainvolve thresholding a linear sum involving the entire inputpattern. Our study suggests that it is much more informa-tive to use functions which use fewer variables. The dynam-ics of recognition are also just a simple linear sum foreach output bit, which is not the correct Bayesian decod-

ing. The easiest observation is that prior information (fromthe cue) is not retained and if too many patterns are storedthen the memory is overloaded and recalls random patternsunrelated to any stored patterns. A major difference is thatwhile the Hopfield network will converge to an unrelatedpattern, the sparse distributed memory will not convergebut sample seemingly unrelated binary vectors. This allowsone to determine whether the network is performing reli-ably.

A lack of proper statistical decoding means that the sparsedistributed memory has a poor capacity. As an example Kan-erva (1988) describes the performance of a network withM = 109 units (106 units for each bit in a pattern of N = 103

bits). The sensitivity is set such that 103 locations per bit areset for each pattern that is stored, and it has a critical dis-tance of 380 bits for the most recently stored pattern (in ourterminology, it can clean up a cue with noise level 38 %).However, if 2,300 more patterns are stored then the patternhas a 50 % chance of converging, even if the cue is noise-less. To describe this performance using our measure wouldrequire extensive simulations. However, we can get a simpleupper bound on the performance by assuming that the max-imally noisy cue (pc = 0.38) will be successfully cleanedup with zero probability of error for all R = 2,300 patterns.This gives:

RN · H2(pc)

M= 2.2 × 10−3, (16)

which suggests the associative network presented here is atleast 168 times more efficient (and the true figure is proba-bly much higher, since the critical distance rapidly decreaseswith more stored patterns).

3 Neurally inspired binary associative memory

An associative memory stores many patterns, and is thengiven an approximate pattern as a starting cue. The asso-ciative memory must then return the closest matching pat-tern that has been stored in the memory. A Bloom fil-ter could implement a very inefficient associative memoryas follows. Once the Bloom filter has stored all the pat-terns in the standard manner it can test all possible setelements for membership. If the elements are binary vec-tors then this step will take exponential time (so it is inef-ficient and impractical), however a small list of elementswill be identified as members of the set. Now, given anew pattern as the starting cue, it can be compared withall the members of the set and the closest member can bereturned.

If we measure performance on a bit-wise level then theoptimal recall procedure is slightly different. In this case itwill be better to return a pattern that has the smallest expected

123

Page 7: Efficient and robust associative memory from a generalized Bloom filter

Biol Cybern (2012) 106:271–281 277

(a) (b)

Fig. 4 The associative memory task and the factor graph that achievesthat task. a After many patterns have been stored in the neurally inspiredBloom filter we would like to recall a stored pattern that is similar to agiven cue c, we calculate the bitwise marginal posteriors Pr(xn | c, z).b The factor graph for inference during recall. We use φ,ψ, and θ as

shorthand for the various stages in the message algorithm. The φ’s arecalculated once while the updates for ψ’s and θ’s are alternately calcu-lated until convergence or a maximum number of iterations has beenreached

number of bit errors, even if that pattern is not a member ofthe Bloom filter. To find the smallest number of bit errors foran N bit pattern, we must make N decisions, one for eachbit. The decision for each single bit will come from Bayestheorem; we will estimate the probability of a member of theset given the initial cue. This probability will provide statis-tical evidence for the bit, and we will marginalize out overall members of the set.

The method as described above is incredibly inefficient. Itwill require calculations to be carried out for every possiblebinary vector and for a moderately sized problem this couldrequire more time than the age of the universe. Even once theBloom filter has discarded most of the vectors, the remain-ing members of the set could still be infeasibly large. How-ever, this is not a problem for the neurally inspired Bloomfilter. The neurally inspired Bloom filter uses simple hashfunctions, and this allows the inference to calculated effi-ciently. Instead of explicitly summing over all binary vectors,we can do this implicitly using the sum-product algorithm(Kschischang et al. 2001).

The most relevant literature for this section is Kirsch andMitzenmacher (2006), who consider the task of determin-ing whether an item is similar to any item already stored, orsignificantly different from all other items. They do this bycreating a partitioned Bloom filter with a distance–sensitivehash function. If a similar item has already been stored, thenthere will be a significant number of ones in the locationsspecified by the hash functions. Their approach requires alarge amount of storage and they can only tell the differencebetween items very near or very far (in the paper ‘near’ isdescribed as an average bit-flip probability< 0.1 and ‘far’ isan average bit-flip probability > 0.4).

3.1 Recall

Once all the patterns are stored in the network we would liketo retrieve one, given an input cue in the form of the mar-ginal probabilities of the bits in x being on: Pr(xn) = cn .To decode the noisy redundant vector we use loopy beliefpropagation (MacKay 2003; Bishop 2006).

If we cared about recalling the entire x correctly, then wewould want to find the most likely x . However in this paper,we tolerate a small number of bit errors in x . To minimizethe probability of bit error for bit xn we should predict themost likely bit by marginalising out everything else that webelieve about the rest of x . Our posterior over all possiblebinary strings for x after observing z and the noisy cue c isgiven by Bayes’ rule:

Pr(x | c, z) = Pr(x | c) · Pr(z | x)

Pr(z), (17)

where Pr(x | c) represents the prior probability of x givenan initial cue. We assume that noise in x is independent andall bits are flipped with identical probability. To make a pre-diction for an individual bit xn we need to marginalise outover all other bits Pr(xn | c, z) = ∑

x−nPr(x | c, z) (where

x−n refers to changing all possible values of x except the nthvalue).

The marginal calculation can be simplified as the distrib-utive law applies to the sum of many products. In general,the problem is broken down into message passing along theedges of the graph shown in Fig. 4a. In Fig. 4b the equivalentfactor graph is shown. For a more detailed explanation of thesum-product algorithm (which is a slightly more general ver-sion of belief propagation) see Kschischang et al. (2001).

123

Page 8: Efficient and robust associative memory from a generalized Bloom filter

278 Biol Cybern (2012) 106:271–281

Unfortunately the sum-product algorithm is only exact forgraphs which form a tree, and the graph considered here is nota tree. However, the algorithm can be viewed as an approxi-mation to the real posterior (MacKay 2003) and empiricallygives good performance.

3.2 Calculation of messages

We use the index m to refer to functions since we have Mencoding functions, and use the index n to refer to the vari-ables (although there are more than N variables, e.g the inter-mediate calculations for each boolean hash function). All ofthe messages communicated here are for binary variables andconsist of two parts. As a shorthand we denote the messagem(xn = 0) by xn and m(xn = 1) by xn . The simplest functionto consider is negation:

¬[

xn

xn

]=

[xn

xn

]. (18)

Effectively a negation interchanges the two messages, andis symmetric in either direction. If we want to consider the‘AND’ of many variables {xn}, the up message is:

∧↑ (x) =[

1 − ∏i xi∏

i xi

]. (19)

However, the ‘AND’ message is not symmetric in eitherdirection and has a different form for the down message:

∧↓ (x−i , y) =[

y

y + (y − y)(∏

j =i x j

)]

(20)

The ‘OR’ message can be derived from the observation that:

x1 ∨ x2 = y (21)

is logically equivalent to:

(¬x1) ∧ (¬x2) = ¬y, (22)

which gives:

∨↑ (x) =[ ∏

i xi

1 − ∏i xi

](23)

and

∨↓ (x−i , y) =[

y + (y − y)(∏

j =i x j

)

y

]

. (24)

It is important to note that all these messages are calcu-lated in a time that is linear in the number of variables. Linearcomplexity is far better than might be naively expected froma method that requires calculating the sum of exponentiallymany terms. Fortunately, the structure in ‘AND’ and ‘OR’provides computational shortcuts.

In Fig. 4b the factor graph for the associative memory isshown. The down message for φ is:[φm

φm

]=

[1 − zm pR−1

zm

], (25)

and only needs to be calculated once. We initialise θm = 1and ψm = 1, and then alternate calculating the θ ’s then theψ’s until the updates have converged. Since this is loopybelief propagation there is no convergence guarantee so wealso quit if a maximum number of iterations have beenexceeded. The final marginals are calculated by:

Pr(xn | z, c) ∝ cn ·∏

m

θm↓n (26)

Here, we return a binary vector as this allows us to compareour method with other non-probabilistic methods. We do thisby maximum likelihood thresholding:

xn = 1 [Pr(xn | z, c) > 0.5] . (27)

We now provide a metric of the recall performance that canbe used for both probabilistic associative memories and non-probabilistic ones.

3.3 Performance measure

To evaluate how well the memory is performing we useresults from the binary symmetric channel to measure theinformation in the recall and define the performance as:

s(p) = N · (1 − H2(p)) , (28)

where H2 is the standard binary entropy function:

H2(p) = −p log2 p − (1 − p) log2(1 − p). (29)

It is also important to take into account the information inthe cue. In general, networks that suffer catastrophic failuresimply define capacity as the average number of patterns thatcan be stored before failure occurs. Other networks look atthe number of patterns that can be stored before the error rateis worse than a threshold. However, this threshold is arbitrary

123

Page 9: Efficient and robust associative memory from a generalized Bloom filter

Biol Cybern (2012) 106:271–281 279

and the performance also depends on the noise present in thecue. Most associative memory work avoids this dependenceusing noiseless cues. We take the view that an associativememory is only useful when it is able to improve an initialnoisy cue. We measure the information provided by the net-work as the increase of information from the cue to the finalrecall:

I (pc) = (Recalled information)− (Cue information)

= s(px )− s(pc)

= N · (H2(pc)− H2(px )) (30)

This approach is very similar to the one taken by Palm andSommer (1996). In Fig. 5a we show the information returnedfrom a single recall as more patterns are stored in the network.(on the right hand side of Fig. 5a we show the associated prob-ability of error.) In Fig. 5b, we show how much informationin total can be recalled by the network. To correctly set theparameters a and b we used some theoretical insight into theperformance of the memory.

3.4 Theoretical prediction of the bit error rate

The probability of error during recall in the neurally inspiredassociative memory is affected by several possible errors (seeFig. 6 for a diagram of all the conditions). From Eq. 3 weknow the probability of error of a Bloom filter for a randomlychosen binary vector as part of the set. The uncertainty in thecue means that we expect to see the answer with approxi-mately N pc bits flipped and there are approximately

( Npc N

)

such binary vectors. The probability of a false positive in theentire set of possible vectors is:

ξ1 : Pr(false positive)=1−(

1−e−Mp(1−p)R)2N H2(pc)

, (31)

where we used the approximation N H2(pc) ≈ log2( N

pc N

).

There is also an error if another pattern falls within theexpected radius:

ξ2 : Pr(other pattern)=1−(

1−(R−1)2−N)2N H2(pc)

. (32)

The final error condition (ξ3) occurs in the region around thedesired pattern. A false positive is more likely for binary vec-tors that differ only by a single bit from a known pattern. Thisprobability of error was derived in Eq. 9. If the initial cue dif-fers from the stored pattern in the bit that is also accepted asa false positive, then it is likely that the bit will be incorrectin the final recall.

We estimate the information transmitted by assuming thatif any of the three failure conditions occur, then the proba-bility of bit error reverts back to the prior probability:

Pr(final bit error) = Pr(pc) · Pr(ξ1 ∨ ξ2 ∨ ξ3). (33)

,,,,,

,,,,,

(a)

(b)

Fig. 5 The performance of the associative memory as more patternsare stored. For these plots pc = 0.1, N = 100 and M = 4, 950. a Therecalled information as more patterns are stored, plotted for differentnumbers of terms “AND”ed together in the boolean hash functions. Thedashed line represents simply returning the input cue (pc = 0.1). b Thetotal amount of information recallable by the network is plotted as morepatterns are stored

There are other sources, for example there is no guaranteethat performing inference on the marginals of each xn is asinformative as considering the exponentially many x vectorsin their entirety. Moreover, the inference is only approximateas the factor graph is loopy. This simplification gives a poorprediction of the error rate, but still predicts accurately thetransition from reliable to unreliable behaviour.

Another reason the error rate is poorly predicted is dueto the assumption that any error condition will revert theprobability of error to the prior probability; in truth we canencounter several errors and still have weak statistical evi-dence for the correct answer. As an example of this, falsepositives are far more likely to have a small number of hm(x)return true, while a stored pattern will typically have manymore hm(x) return true. This difference provides weak statis-tical evidence, so even if there is a pattern and a false positiveequidistant from the initial cue, the recall procedure will moreheavily weight the correct pattern. These considerations are

123

Page 10: Efficient and robust associative memory from a generalized Bloom filter

280 Biol Cybern (2012) 106:271–281

Fig. 6 The uncertainty in the input cue determines a ball of possiblebinary vectors to recall. When N is large almost all of the mass of thishigh-dimensional ball will be at the edges. The Bloom filter is able torule out almost all of the binary vectors. In this diagram we show allpossibilities that could remain: the desired pattern xr , another patternxr ′, and any false positives the Bloom filter is unable to rule out. Loosely

speaking, the returned estimate x will ideally be the average of the cor-rect recall with any false positives. Since the Bloom filter operates ina bit-wise fashion, if a binary vector is an element of the Bloom filter,then its neighbours are more likely to be false positives, so we shouldexpect to find clumps of false positives around stored patterns

entirely absent from our theoretical analysis, but do not affectthe reliable regime. At this transition the associative mem-ory is providing the most information, so the approximationmakes acceptable predictions for the best parameter settings.We call the optimal performance the capacity of the associa-tive memory and examine it next.

3.5 Capacity of the neurally inspired associative memory

In Fig. 7a we show that the capacity of the network scaleslinearly as M is increased. The rate at which the capacityscales is the efficiency of the network. In Fig. 7b we can seethat our definition of the capacity depends on the reliabilityof the cue. At the optimal cue reliability the neurally inspiredassociative memory has an efficiency of 0.37 bits per bit ofstorage.

For comparison, the efficiency of a traditional Hopfieldnetwork (Hopfield 1982) is approximately 0.14 bits per unitof storage, which with smart decoding (Lengyel et al. 2005)can be enhanced to 0.17 bits per stored integer (see Sterne(2011) for more details). This is an unfair comparison how-ever as Hopfield networks are able to store integers, ratherthan a single bit which makes the performance even moreimpressive. This performance is also better than that reportedin Schwenker et al. (1996) of 0.19 for iterative binary Hebbianlearning. Palm and Sommer (1992) arrive at an informationcapacity for a clipped recurrent McCulloch–Pitts networkof 1

2 ln 2 ≈ 0.35. However, they measure the informationbetween the patterns stored and the network fixed points, soit is not entirely clear how the different information mea-

(b)

(a)

Fig. 7 The network performance as the redundancy is scaled and thelevel of noise in the cue is changed. a capacity of the network is linearin the redundancy. The grey line gives a fit to the best capacity. Thenetwork efficiency is the the slope of the line. b The information effi-ciency as a function of the cue noise. The grey lines give equivalentperformance of a Hopfield network under different recall strategies

sures compare. They also find the capacity for an unclippedrecurrent McCulloch–Pitts network to be 1

4 ln 2 ≈ 0.36.

4 Discussion

This paper presented a neurally inspired generalization of theBloom filter in which each bit is calculated independentlyof all others. This independence allows computation to becarried out in parallel. The independence also allows the net-work to be robust in the face of hardware failure, allowing it tooperate with an optimal probability of error for the degradedstorage.

However, the neurally inspired Bloom filter requires extrastorage to achieve the same probability of a false positive as atraditional Bloom filter. It also has a higher rate of errors foritems that are similar to already-stored items. In this sense it

123

Page 11: Efficient and robust associative memory from a generalized Bloom filter

Biol Cybern (2012) 106:271–281 281

is not a good Bloom filter, but it can be turned into a greatassociative memory. This modification allows a noisy cueto be cleaned up with a small probability of error. We alsoshowed the performance of the network can be measured asthe improvement it is able to make in a noisy cue. Our neu-rally inspired associative memory has more than twice theefficiency of a Hopfield network using probabilistic decod-ing.

Massively parallel, fault-tolerant associative memory is atask that the brain solves effortlessly and represents a key taskwe still do not fully understand. While the work presentedhere uses simple computational elements that are crude andnot directly comparable with biological neurons, we hopethat it provides insight into how such memories might becreated in the brain.

Acknowledgments I would like to thank David MacKay and mem-bers of the Inference group for very productive discussions of the ideasin this paper.

References

Aleksander I, Thomas W, Bowden P (1984) WISARD, a radical stepforward in image recognition. Sensor Review 4(3):120–124

Bear M, Connors B, Paradiso M (2007) Neuroscience: exploring thebrain. Lippincott Williams & Wilkins, Baltimore

Bishop C (2006) Pattern recognition and machine learning. Springer,New York

Bloom B (1970) Space/time trade-offs in hash coding with allowableerrors. Commun ACM 13(7):422–426

Bogacz R, Brown M, Giraud-Carrier C (1999) High capacity neuralnetworks for familiarity discrimination. In: Proceedings of IC-ANN’99. ICANN, Edinburgh, pp 773–778

Broder A, Mitzenmacher M (2004) Network applications of Bloom fil-ters: a survey. Internet Math 1(4):485–509

Fan L, Cao P, Almeida J, Broder A (2000) Summary cache: a scalablewide-area web cache sharing protocol. IEEE/ACM Trans Netw(TON) 8(3):293

Graham B, Willshaw D (1995) Improving recall from an associativememory. Biol Cybern 72:337–346. doi:10.1007/BF00202789

Graham B, Willshaw D (1996) Information efficiency of the associativenet at arbitrary coding rates. Artif Neural Netw ICANN 96:35–40

Graham B, Willshaw D (1997) Capacity and information efficiency ofthe associative net. Netw Comput Neural Syst 8(1):35–54

Hinton G, McClelland J, Rumelhart D (1984) Distributed representa-tions. Computer Science Department, Carnegie Mellon University,Pittsburgh

Hopfield JJ (1982) Neural networks and physical systems with emer-gent collective computational abilities. Proc Natl Acad Sci 79(8):2554

Kanerva P (1988) Sparse distributed memory. The MIT Press, Cam-bridge

Kirsch A, Mitzenmacher M (2006) Distance-sensitive bloom filters. In:Proceedings of the eighth workshop on algorithm engineering andexperiments and the third workshop on analytic algorithmics andcombinatorics. ALENEX, Miami

Kirsch A, Mitzenmacher M (2008) Less hashing, same performance:building a better Bloom filter. Random Struct Algor 33(2):187–218

Knuth D (1973) The art of computer programming: sorting and search-ing, vol. 3. Addison Wesley, Reading

Kschischang F, Frey B, Loeliger H (2001) Factor graphs and the sum–product algorithm. IEEE Trans Inform Theory 47(2):498–519

Lengyel M, Kwag J, Paulsen O, Dayan P (2005) Matching storage andrecall: hippocampal spike timing-dependent plasticity and phaseresponse curves. Nature Neurosci 8(12):1677–1683

MacKay DJC (2003) Information theory, inference, and learning algo-rithms. Cambridge University Press, Cambridge

Mitzenmacher M, Upfal E (2005) Probability and computing: random-ized algorithms and probabilistic analysis. Cambridge UniversityPress, Cambridge

Palm G, Sommer F (1992) Information capacity in recurrent McCul-loch–Pitts networks with sparsely coded memory states. NetwComput Neural Syst 3(2):177–186

Palm G, Sommer F (1996) Associative data storage and retrieval inneural networks. In: Domani E, van Hemmen J, Schulten K (eds)Models of neural networks III association, generalization and repr-esantation. Springer, Berlin, pp 79–118

Plate T (2000) Randomly connected sigma-pi neurons can form asso-ciator networks. Netw Comput Neural Syst 11(4):321–332

Schwenker F, Sommer F, Palm G (1996) Iterative retrieval of sparselycoded associative memory patterns. Neural Netw 9(3):445–455

Sommer F, Dayan P (1998) Bayesian retrieval in associative memorieswith storage errors. IEEE Trans Neural Netw 9(4):705–713

Sterne P (2011) Distributed associative memory. PhD thesis, CambridgeUniversity, Cambridge

Willshaw D, Buneman O, Longuet-Higgins H (1969) Non-holographicassociative memory. Nature 222(5197):960–962

123