2502 IEEE TRANSACTIONS ON CIRCUITS ... - Computer Action …web.cecs.pdx.edu › ~strom › publications › gao_tcas07_cmol.pdfHeaviside step function, where an output node will be

2502 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 54, NO. 11, NOVEMBER 2007

Cortical Models Onto CMOL andCMOS—Architectures and Performance/Price

Changjian Gao and Dan Hammerstrom, Senior Member, IEEE

Abstract—Here we introduce a highly simplified model of theneocortex based on spiking neurons, and then investigate variousmappings of this model to the CMOL CrossNet nanogrid nanoar-chitecture. The performance/price is estimated for several archi-tectural configurations both with and without nanoscale circuits.In this analysis we explore the time multiplexing of computationalhardware for a pulse-based variation of the model. Our analysisdemonstrates that the mixed-signal CMOL implementation hasthe best performance/price in both nonspiking and spiking neuralmodels. However, these circuits also have serious power density is-sues when interfacing the nanowire crossbars to analog CMOS cir-cuits. Although the results presented here are based on biologicallybased computation, the use of pulse-based data representation fornanoscale circuits has much potential as a general architecturaltechnique for a range of nanocircuit implementation.

Index Terms—Architecture performance and price, hierarchicaldistributed memory (HDM), multiplexing circuit design, nanoelec-tronics.

I. INTRODUCTION

THERE ARE A number of challenges facing the semicon-ductor industry, and, in fact, computer engineering as a

whole. For metal-oxide-semiconductor field-effect transistors(MOSFET), the gate voltage threshold sensitivity over gatelength grows exponentially, especially for gate lengths below10 nm [1]–[3]. The precision required for manufacturinglithography to overcome this exponentially growing parametersensitivity is currently beyond the industry’s projections [4].

Other challenges include parameter variation, design com-plexity, and severe power density constraints. Nanoelectroniccircuits have been touted as the next step for Moore’s law, yetthese circuits aggravate most existing problems and then createa few of its own, such as a radical increase in levels of faults anddefects. Borkar [5] has indicated that currently there is no can-didate emerging nanoelectronics that can replace CMOS in thenext ten to fifteen years. Chau et al. [6] proposed four metricsfor benchmarking nanoelectronics, and showed a promising fu-ture for nanoelectronics although their further performance andscalability need to be demonstrated.

In recent years, nanoelectronics has made tremendousprogress, with advances in novel nanodevices [7], nanocir-cuits [8], [9], nanocrossbar arrays [10]–[12], manufacture bynanoimprint lithography [13], [14], CMOS/nano co-design

Manuscript received December 27, 2006; revised May 13, 2007. This workwas supported by the National Science Foundation under Grants ECS-0408170and CCF-0508533. This paper was recommended by Guest Editor C. Lau.

The authors are with the Department of Electrical and Computer Engineering,Portland State University, Portland, OR 97207-0751 USA (e-mail: [email protected]; [email protected]).

Digital Object Identifier 10.1109/TCSI.2007.907830

architectures [2], [15]–[17] and applications [18]–[20]. Al-though a two-terminal nanowire crossbar array does not havethe functionality of FET-based circuits, it has the potential forincredible density and low fabrication costs [2]. In addition,unlike spintronics and other proposed nanoelectronic devicesthat use quantum mechanical state to compute [21], crossbararrays use a charge accumulation model that is more compatiblewith existing CMOS circuitry.

Rückert et al. [22]–[24] have demonstrated digital and mixed-signal circuit designs for nonspiking and spiking neural associa-tive memories. They did not fully explore time-multiplexing intheir physical designs. Also, there is no universal benchmark toevaluate different hardware designs with different neural com-putational models. We believe that the unique combination ofhybrid CMOS/nanogrids and biologically inspired models hasthe potential for creating exciting new computational capabili-ties. In our research we are taking the first few tentative stepsin architecting such structures. Consequently, the goal of the re-search described here is to investigate the possible architectureand performance/price options in implementing cortical modelstaken from computational neuroscience with molecular grid-based nanoelectronics [2].

We first introduce the computational models in Section II, andCMOL concepts, and its price and performance measurementsin Section III. In Section IV, we explain the details of the ar-chitectures and implementation methods for the nonspiking andspiking cortical models. We present an analytical method to es-timate the power, speed, silicon area cost of the different designsin Section V. Finally, we discuss the results in Section VI, andgive a conclusion in Section VII.

II. COMPUTATIONAL MODELS

The ultimate cognitive processor is the cerebral cortex and soconsequently it is the focus of significant research. Mammalianneocortex is remarkably uniform, not only across all differentparts of mammalian brain, but across almost all mammalianspecies. Many believe that cortex represents knowledge in asparse, distributed, hierarchical manner, and performs Bayesianinference over this knowledge base, which it does with consid-erable efficiency.

The fundamental unit of computation appears to be the cor-tical minicolumn [25], a vertically organized group of about80–100 neurons that traverses the thickness of the gray matter( 3 mm) and is about 50 m in diameter. The cortex also hasa distinct layer organization. Neurons in a minicolumn tend tocommunicate vertically with other neurons on different layersin the same minicolumn.

Mountcastle [25] proposed that minicolumns in turn aregrouped into larger units variously referred to as columns,macrocolumns, or hypercolumns. The existence of this larger

1549-8328/$25.00 © 2007 IEEE

GAO AND HAMMERSTROM: CORTICAL MODELS ONTO CMOL AND CMOS 2503

level column is controversial in the neuroscience community.Braitenberg and Schüz [26] have shown that there are geo-graphically close groups of neurons that are tightly connectedwith each other and then are sparsely, and more randomlyconnected to other groups. For convenience we loosely use theterm “column” for these tightly connected groups, but do notnecessarily imply a true column in the Mountcastle sense.

In the early days of neural networks, simple associativememory models were considered as a first step towards mod-eling cortex. It is now clear that the early models fell far short[27], but they still are useful as models for smaller cortical“modules”, such as the cortical column. A number of advancedmodels [28]–[30] have been developed that create corticallike structures by loosely connecting such modules into largerarrays. These modules (columns in our terminology) can bemodeled effectively as associative networks. Since the majorityof connections and computation are within a column, we beginour analysis there, and this will be the focus of this paper, thehardware implementation of a single column.

The next step then would be to connect the cortical columnstogether into a large array, which we call a hierarchical dis-tributed memory (HDM). In many of these models, the columnsare configured into a two-dimensional grid. Connectivity is typ-ically nearest neighbor with a few random, longer range pointto point connections. The entire structure creates a higher order,scalable, large capacity association memory. Analysis of suchlarge, sparsely connected structures is more complex and is notaddressed here, but there are several successful approaches, in-cluding the work of Lansner [31], Fulvi-Mari [32], Granger[33], Hecht-Nielsen [28], and Anderson [29], as well as relatedwork of George and Hawkins [34].

A. Traditional Nonspiking Associative Memory Model

The column associative memory model that we have used isbased on that of Palm [35] and Willshaw [36]. When an input issupplied to such a memory, it selects a “trained” vector with theclosest match to the given input assuming some metric, wherethe output vector is the closest matched vector. In an auto-as-sociative model, the set of mappings from input to output vec-tors is stored in an associative network, given by

. There are mappings, and both and aresparsely encoded, with , and

, , where and are the numbers of active (i.e.,nonzero) nodes in the input and output vectors, respectively.

For the analysis presented here, we do not include circuitryfor dynamic learning, which will be required for real world sys-tems and which will be addressed in future papers. For the cur-rent associative column model, the synapse strengths or weightsare set by a simple, “clipped” Hebbian learning rule. A binaryweight matrix is formed by , or a mul-tivalue weight matrix is formed by .During recall, a noisy or incomplete input vector is appliedto the network, and the network output is computed by

, where is a global threshold; and is theHeaviside step function, where an output node will be 1 (ac-tive) if its dendritic sum (inner-product op-eration) is greater than the threshold , otherwise it is 0. To setthe threshold, the “ winners take all ( -WTA) rule” is used,

where is the number of active nodes in an output vector. Thethreshold is set so that only those nodes that have the max-imum dendritic sums are set to “1,” and the remaining nodesare set to “0.” The -WTA rule leads to a “sparse” distributedrepresentation. It is possible to derive an incremental learningversion of this network, such as the one developed by Lansneret al. [37].

B. Spiking Model

Our preliminary analysis [38] showed significant power den-sity problems in a mixed signal CMOL implementation of anonspiking auto-associative module. In addition, it is becomingincreasingly clear that cortical-like models leverage the time do-main as a fundamental organizing principle [33], [34]. Conse-quently, we have moved to more complex spiking models thatoperate in the time domain. An additional benefit is that thesemodels also have a limited duty cycle which leads to a reductionin estimated power consumption.

Spiking or pulse-based models actually lead to an importantprinciple: computation proceeds by incremental change in re-sponse to spikes to a baseline state, where incremental data arerepresented by the inter-pulse timing. Traditional signal pro-cessing and neural models generally consist of sums of prod-ucts. By using pulse-based models, the entire sum needs not becomputed at any one time, rather only sparse incremental up-dates are processed. In our approach then the somatic MP of theneuron is updated by the sparse arrival of spikes. This charac-teristic leads to significantly increased efficiency in implemen-tation by the use of resource multiplexing as we will show.

Consequently, for the analysis performed here, we expandour associative memory to use neurons based on spiking neuronmodels. Suri [39] proved that all information in the spikingneuron model is determined by the time of the spike’s occur-rence, not by its shape. Hence, this gives us the freedom tochoose the spiking neuron models that favor our hardware im-plementations. For the all digital implementations studied here,we look at the Gerstner spiking neuron model [40], which sat-isfies our criteria that it can represent the time domain, spikingor limited duty cycle model, is fairly simple, has good mathe-matical descriptions, and is widely used in the computationalneuroscience community. In this model the somatic membranepotential (MP) of neuron at time is givenby

(1)

where is the efficacy of the connection from neuron toneuron ; is the postsynaptic potential (PSP) of neuron

contributing to neuron ; and is the refractory func-tion, which, in our model, is a negative contribution that reducesthe likelihood of additional output for some period of timeas soon as the MP reaches the threshold value . The thresholdvalue can be static or dynamic.

The PSP function is

(2)


where and are time constants; is the Heavisidefunction; and is the axonal transmission delay.

A related spiking model is the leaky integrate-and-fire (I&F)neuron, which can be represented by a first-order linear differ-ential equation: , whereis the time constant of the current leaky integrator, with theneuron’s equivalent resistance and capacitance . As soon asthe MP reaches the threshold , the MP will go to zero with timeconstant and kept at zero with time constant .

We use the I&F model in the mixed signal CMOL design, be-cause the I&F model is easier to implement with analog circuits,and satisfies our criteria for spiking neuron models as well. TheGerstner spiking neuron model is used as the base model for thedigital implementations.

A number of learning schemes exist for the spiking neuronmodel, such as competitive Hebbian learning through spike-timing-dependent synaptic plasticity (STDP) [41], however thispaper does not address learning.

III. CMOL AND ITS PERFORMANCE/PRICE MODELING

For the nanogrid model used in this analysis, we use CMOL,a hybrid CMOS/molecular architecture developed by Likharevet al. [2]. Although nanoelectronics allow much denser circuits,it has a number of limitations, perhaps the biggest is that it isa faulty computation platform. In CMOL circuits, there arestatic defects (permanent defects) and transient faults possiblein the nanodevices, the nanowires, and the CMOS-to-nanowirecontacts. Strukov and Likharev [20] have demonstrated twomethods of fault tolerance for CMOL memory. For associativealgorithms, Rückert et al. [42] showed that the stuck-at-0 con-nection errors have a greater impact on network performancethan the stuck-at-1 connection errors. Sommer et al. [43]used iterative retrieval by probabilistic inference to improvethe network’s information capacity in the presence of weightmatrix errors. The fundamental fault tolerance of our targetalgorithms, coupled with Strukov and Likharev’s results [20],leads us to believe that the extra overhead for affecting faulttolerance will be minimal (5%–10%) and so it is not factoredinto this analysis.

Likharev et al. [2] developed the concept of CMOL(CMOS/nanowire/MOLecular hybrid) as a likely implemen-tation technology for charge-based nanoelectronics devices.Examples include neuromorphic CrossNet, field-programmablegate array (FPGA), and memory [18]–[20]. The nanodevicein CMOL is a binary “latching switch” based on moleculeswith two metastable internal states. Fig. 1 shows the schematic

– curve of this two-terminal nanodevice. Qualitatively, ifthe drain-to-source voltage is low during programming, thenanodevice will be in the “off” state with a high resistance

; if the applied voltage is greater than the thresholdvoltage , the nanodevice will be in the “on” state with alower resistance .

In this analysis we develop the performance/price of variousCMOL configurations when emulating an auto-associativecortical column model. The components that affect the per-formance of the circuit include the nanodevice itself, thenanowire, and the pin-to-nanowire contact (pins interfaceCMOS and nanowire, in [2, Fig. 3(a)]), as shown in Fig. 2. InCMOL, we assume that each latching switch is implemented

Fig. 1. Schematic I–V curve of a two-terminal nanodevice (adapted from [2]).

Fig. 2. Current (the arrowed line) flows from the input pin via an inputnanowire through the nanodevice and output nanowire to the output pin.

as a parallel connection of single-electron devices. Themolecule capacitance is typically negligible in comparisonwith the capacitance between the wires. What is changing is

. Theoretically increases with the half pitch of nanowire, however, it is highly related to manufacturing precision.

If we assume the scaling is nm , then thescaling of is (i.e., ); and the scalingof is . For nanowire capacitance and resistance,refer to [19, Fig. 13 and (5)]. The size issues also need to beconsidered because of very high resistance of the nanowire.We assume the pin-to-nanowire contact is ohmic. The contactresistance is , where is aboutwith doping.

Fig. 2 shows a signal current flowing through a nanowirecrossbar. With values for the resistance and capacitance of thebasic components in CMOL, according to the classic Elmoredelay model of [44], we estimatethe time delay from the input pin to the output pin through thenanowires and nanodevices as

(3)

where is the pin-to-nanowire contact resistance; isthe nanowire resistance; and is the nanowire capacitance.

For CMOL crossbar arrays, the static power consumptionincludes both the working power and the leakage power. A


working “on” power is due to the “on” nanodevices, and isgiven by

(4)

where is the average probability that the driving voltage to theinput nanonwire is high (voltage on the nanodevice is over );

is the probability that the nanodevices are “on”; and andare the horizontal and vertical nanowire counts, respectively.

Due to the current leakage through the “off” nanodevices, theleakage power is given by

(5)

If we know the average current for each output nanowire oreach bundle of output nanowires [Fig. 9(b)], the average powerthat CMOL CrossNet dissipates is given by

(6)

where is the number of output nanowires or the numberof bundles of output nanowires depending on applications[Fig. 9(b)]. The dynamic power due to the dynamic charging ofthe nanowires is

(7)

where is the average probability that the nanowires are chargedduring cycle time . The area for a CMOL crossbar array is

(8)

IV. SYSTEM ARCHITECTURES AND IMPLEMENTATIONS

There are a variety of ways in which a CMOL-based hardwareplatform can be used to implement an auto-associative column.A full-custom design based on traditional CMOS, though withthe same hypothetical 22-nm process, is used as a baseline forfour comparisons. We chose this number since many believe thatdue to lithographic limitations, there will be not much additionalscaling beyond that feature size [4].

Before presenting the latest analysis, we first discuss the keyprinciple of our architecture, virtualization. We then brieflysummarize our previous analysis of the implementation ofnonspiking models before presenting the current spiking modelanalysis.

A. Virtualization

In this context we define virtualization to be the degreeof time-multiplexing neural computations with hardware re-sources. Neural algorithms, and many other kinds of signalprocessing algorithms, have a naturally massive parallelism,which allows a wide range of possible parallel implementation.

One way to conceptualize the implementation options ofthose algorithms is to imagine a “virtualization” spectrum.At one end of the spectrum we have a single processor thatemulates all components, computation, and communication

Fig. 3. Hardware spectrum for artificial neural networks. Finer-grained pro-cessing (less virtualization) means more structural parallelism, but less effi-ciency and flexibility.

of the model in a mostly sequential fashion [45]. At the otherend of the spectrum, we literally implement all features ofthe algorithm in silicon. This can be thought of as having aprocessor for each parallel component, which at the finest grainis the individual synapse. Obviously minimizing virtualizationincreases performance. However it can also introduce signifi-cant inefficiency. And minimal virtualization tends to involvemore hardwiring and inflexibility. Fig. 3 shows an approximatehardware virtualization spectrum.

In this paper, we use the term processing node (PN) some-what loosely. In general it is a simple digital processor that maydo some simple arithmetic and has a simple control structure.It implements anywhere from some part, to the entire neuronalgorithm. Generally a PN is a digital processor, though somemixed signal computation is often used. For our analysis, theminimum level of virtualization is assumed to be a PN for eachneuron within a column processor. And the maximum level ofvirtualization is assumed to be one PN which emulates all theneurons assigned to a column. Lesser and greater levels of vir-tualization are possible, but for the nature of the cortical modeland the model parameters we are using, we have discovered thatthese are not cost-effective.

Fig. 4 shows a range of degrees of time-multiplexing neuralcomputations onto PNs, from the coarsest-grained PN (mul-tiplexing all computations) to the finest-grained PN (withoutmultiplexing, the most parallel architecture). Each column pro-cessor can have a single or multiple PNs to emulate a singlecolumn. Many column processors, in turn, emulate a much morecomplex cortical function. This hierarchical architecture is likethe neural network model, “network of networks,” by Andersonand Sutton [46].

A significant amount of neural hardware research involvesimplementing most, if not all of an algorithm directly in sil-icon. And over the years many groups have done that [47]–[50].What we see more often is the multiplexing of communica-tion resources with the address event representation (AER) [51],though with no multiplexing of computational structures. In ouranalysis, we assume that computation can be multiplexed aswell, leading to a broader definition of virtualization.

Another implementation option concerns the representationof the data. Our spike models use timing to represent data. But


Fig. 4. PN time multiplexes the neural computations. The lower right neuron il-lustrates the computations around the post-synapses and in the soma. The finest-grained PN computes a single PSP and does not multiplex other PSPs. A coarser-grained PN time multiplexes computations from multiple neurons. The coarsestPN time multiplexes all the computations required by the network.

during actual computation we have other options besides spiketiming, including digital and analog data representations, whichcan use voltage and/or current representations. Analog circuitscan be multiplexed, although it is trickier. Consequently, signalrepresentation is actually somewhat orthogonal to virtualiza-tion.

The traditional view of neural emulation was that a smallnumber of transistors was dedicated to an analog, nonmulti-plexed implementation of each synapse. However, the sparsecommunication and the sparse activation of our models appearto compromise the effectiveness of such an approach. Thatis, with sparse activation, dedicated, nonmultiplexed computehardware, whether it is analog or digital, does not appear to bethe most efficient use of silicon area.

Although learning is not addressed here, multiplexed com-putational hardware looks to be an even more efficient way toutilize silicon real estate when dynamic, incremental learning isadded to the model.

B. Nonspiking Model Analysis

Although the focus of this analysis is the spiking model, wepresent here some of the hardware issues involved in the non-spiking model implementation, which is then used in the spikingmodel analysis. Also, in the final results, we present both spikingand nonspiking performance/price numbers.

For the nonspiking model analysis, we assumed four basicconfigurations: all digital CMOS, mixed-signal CMOS, all dig-ital CMOL, and mixed-signal CMOL. The primary computa-tions in the column-processor are the input vector/weight matrixinner-product and -WTA. Fig. 5 shows the four basic designs.

Nonspiking Digital CMOS Design [see Fig. 5(a)]: Theweight matrix is stored in CMOS memory (MEM), whichcould be realized with SRAM or embedded-DRAM (eDRAM[52]). The inner-product and -WTA computations are per-formed by arithmetic logic in the digital CMOS platform.Because of the sparse activation of input vectors (on the orderof ), we only retrieve weight columns whose columnindices correspond to those of the active nodes, and sum them.Because of the sparse activation of the input vectors, thiscolumn-wise inner-product, which is borrowed from sparsematrix computation techniques, saves time and power over thetraditional row-wise inner-product (comparingadditions to additions [53]).

Fig. 5. Functional partitioning of the four configurations. (a) Digital CMOSdesign. (b) Mixed-signal CMOS design. (c) Digital CMOL design. (d) Mixed-signal CMOL design. The different computation tasks are partitioned onto dif-ferent hardware.

Nonspiking Mixed-Signal CMOS Design [see Fig. 5(b)]: Inthis option, because the inner-product operation does not scalewith the network size (i.e., number of neurons), the weightmatrix is still stored in CMOS memory and the inner-productcomputed digitally. We could also implement the inner-productin mixed signal circuits, using a capacitor (requiring regularrefresh) or floating gate transistors to store nonbinary weights.This idea has appeared in a number of neural-network chipsover the years, one of the most well known was the Intel ETANN[54]. However, the floating-gate transistor implementation ofthe network connections with the analog inner-product oper-ations was not cost-effective compared to a more virtualizedapproach due to the low duty cycle of the sparse activation. Thedigital inner-product unit realizes the circuit with complexityof , while the analog inner-product approaches

complexity with finer-grained PNs (fewer neuronsmultiplexed per PN).

With the help of time-multiplexed digital inner-productcircuits, we can use an analog -WTA with the samecomplexity. The -WTA analog circuits use analog currentsto generate the highest voltages according to the largestcurrents [55]. The column processor then converts thosehighest-voltages to the addresses of the output neurons. Fig. 6shows a simple -WTA analog circuit with complexity,where the largest injection currents drive the outputshigh, others low.

However, the -WTA is implemented in analog CMOS, so weneed parallel digital–analog (D/A) converters to convert thedigital signals from the inner-product results to analog inputs ofthe -WTA circuit [55], [56].

It is not clear from biology whether a column simulationneeds only be WTA or whether -WTA is required .Obviously the WTA is simpler, but it also reduces capacity. Weused the more complex “soft-max” -WTA for analysis since itis more generic and requires more hardware, making it a moreconservative comparison.


Fig. 6. Schematic view of the k-WTA circuit (adapted from [55]).

Fig. 7. Structural view of mixed-signal CMOL design. The denser crossbararrays in the center are CMOL nanogrids (nanowire crossbar arrays). Beneaththe CMOL nanogrids are the CMOS driving circuits and programmingcircuits for the nanodevices. The larger square blocks are analog CMOScircuits for each output neuron.

Nonspiking Digital CMOL Design [see Fig. 5(c)]: Here,CMOL is used only as very dense (and somewhat slow)memory to replace the CMOS weight memory of the all digitalCMOS design. The inner-product and -WTA computationsare still in digital CMOS and have the same circuits as those inthe digital CMOS design.

Nonspiking Mixed-Signal CMOL Design [see Fig. 5(d)]: Inthis configuration, we borrow the idea of the CMOL CrossNetto represent the network connections (i.e., the weight matrix).The application here is a variation of the neuromorphic CMOLCrossNet [18], with somewhat different CMOS cells and net-work topology. Due to the use of the CMOL nanowires to rep-resent the network connections, we refer to this configuration asCMOL nanogrids. With the active nodes in the CMOS drivingthe output nanowires, the output nanowires connect to the in-puts of the analog -WTA circuits, i.e., replacing the “Load” inFig. 10(b) with “Ik” in Fig. 6 directly or via a current mirror.

Fig. 7 shows the structure of the mixed-signal CMOL de-sign. In this figure, the CMOL nanogrids sit in the center ofthe layout. The nanogrids are fabricated on top of the CMOS

Fig. 8. (a) Single-bit CMOL nanogrids and pin connection diagram, where“ ” are the driving pins from CMOS to nanowires, and “ ” are pinsconnecting output nanowires and analog CMOS neuron circuits. (b) MultibitCMOL nanogrids and pin connection diagram. Each driving signal “ ” andoutput signal “ ” connects three nanowires in this diagram. The dark circlesrepresent the pins connecting CMOS signals and horizontal nanowires. Thehollow circles represent the pins for the vertical nanowires.

Fig. 9. (a) Single-bit CMOL CrossNet schematic diagram. (b) MultibitCMOL CrossNet schematic diagram. Here, for example, each input signal andeach output signal connects a bundle of three nanowires, which can satisfy a3-bit precision requirement.

circuits, which are used for driving, programming, and readingthe outputs of the nanodevices. The nanowires connect to theCMOS using the CMOL self-aligning architecture. Each inputblock of the analog -WTA circuit represents a competingneuron. Because the analog circuits are assumed to only scaleto 250 nm, instead of to 22 nm, the area for each neuron isabout 12.5 m (a conservative estimate for the circuit inFig. 6), which is much larger than nanowire cells .The advantage of using CMOL is that the CMOS circuits need

pins to connect to the nanowires within thearea. This requires , where is thehalf pitch of CMOS.

Fig. 8 shows a schematic diagram of the CMOL nanogrids ofFig. 7, only the layout of pins and nanowires is displayed. Thedark circles represent pins connecting horizontal nanowires,which are the inputs to the nanogrid, to the top level of metalof the underlying CMOS. The hollow circles represent pinsconnecting vertical nanowires, which are the outputs fromthe nanogrid. In Figs. 8 and 9, , represent inputnanowires; and , represent output nanowires.

Fig. 8 does not show the nanogrid molecular connections.Fig. 9 is a schematic that includes these inter-grid devices. Thesmall black dots at the cross-points of the nanowires are “on”


Fig. 10. (a) Programming nanodevices with multibits. (b) Operation of CMOLnanogrids with multibits. (Adapted from [18].)

nanodevices. The “off” nanodevices are not shown in the dia-gram. The positions of the “on” nanodevices are used to illus-trate the current flow.

During operation, for single-bit-weight computation, inputactive nodes pull their nanowires to the input active voltage“high”; all output neurons pull their nanowires to voltage “low.”If there is a connection between the input neuron and the outputneuron (i.e., the synapse value is “1”), which means that the nan-odevice is in the “on” state, an “on” current will flow throughthe connection from the input neuron to the output neuron. Thecurrents from different input neurons will sum together to forma single output. As illustrated in Fig. 9(a), the nanowire sumsthree units of current.

Although auto-associative models work quite well with bi-nary weights, we would like a few bits of precision, as this ap-pears to increase the dynamic learning capacity of the network.Because the nanodevices at the wire cross point can only taketwo states, we need nanodevices to represent an -bitweight. For example, if the weight has three bits, we need atleast eight nanodevices to represent all values. This is illustratedin Fig. 9(b), where each input neuron and output neuron con-nects to three nanowires. For a pair of input and output neurons,they have nine nanodevices to connect their nanowires. Thesenanodevices can then be programmed to represent the differentvalues.

As mentioned by Türel et al. [18], Fig. 10 shows one wayto program multibit CMOL nanogrids. During programming ofthe nanodevices, voltage differences and are added to themetallic resistors connecting to the horizontal nanowires andvertical nanowires, respectively. As shown in the picture, theslope angle of the “boundary” is , (

, and are not both integers). The boundary is located atthe point where the voltage is equal to the threshold voltage

. However, in order to be able to program each of thenanodevices, the boundary should avoid crossing two or morenanodevices simultaneously. Thus, we have the constraint that

and are not both integers at the same time.A big advantage of the CMOL nanogrids is that they do not

require the line encoding and decoding circuits of a memory.They not only provide memories for the synapses, but alsoimplement the inner-product computations naturally. Further-more, the CMOL nanogrids convert the digital data (voltages) toanalog data (currents). This saves space for the D/A convertersrequired in the mixed-signal CMOS design, and is why we onlyneed to perform one computation (i.e., -WTA) inside CMOS.

C. Spiking Model Hardware

When emulating the spiking HDM models, the hardwareis assumed to operate in real time. Usually, an analog-circuitsystem has a dedicated circuit for each computation. The realtime requirement sets constraints on each analog circuit. Thisin turn determines the signal processing rate for the analogcircuits, and the power consumption in terms of response timeor spiking rate. For digital circuits, computational resourcesare generally multiplexed. Therefore, there can be jitter noise,which needs to be minimized. One potential disadvantage ofmultiplexing computational hardware is that the more sharingthere is, the more unpredictable processing time is, and themore jitter noise added to the signals. In digital systems, it ispossible to keep a virtual system clock, which is updated asneeded and eliminates jitter noise. However, it adds significantcomplexity to the system and is not assumed here.

For the spiking model analysis, we have the same basic con-figurations we saw in the nonspiking case. For each design,because the different computations and operations of the non-spiking and spiking HDM models, the spiking HDM implemen-tations are much different in the architectures, complexities, andunderlying circuit components, although they share some cir-cuit components with the nonspiking HDM implementations.Furthermore, because the spiking nature of the spiking HDM,we studied how to leverage the virtualization of the digital de-signs with CMOS and CMOL technologies. However, for thenonspiking digital implementations, we used a constant (64 neu-rons per PN) parallelism without consideration of implementa-tion efficiency issues, since the efficiency does not change ap-preciably with the level of virtualization. For the mixed-signalCMOL implementations, the CMOL nanogrids play the samerole performing inner-product operations for the nonspiking andspiking HDM models. Hence the difference is in the CMOScells where the -WTA and the I&F neuron circuits are imple-mented.

Spiking Digital CMOS Design: In the all digital, all CMOSdesign, we use a PN to emulate some part of the network. Thevirtualization (degree of multiplexing) chosen depends on thespecific dynamic characteristics of the model being emulated.The column processor, as shown in Fig. 11, consists of singleor multiple PNs that perform the calculations, and a memory tostore the weight values. The column consists of some numberof neurons, typically several thousand, which are fairly tightlyconnected with each other.

When implementing such a computation in a set of proces-sors, the sparse activation of input spikes motivates the use ofa sender-oriented method to improve computational efficiency[57]. That is, the PN reads the sparse presynaptic events fromthe input neurons—senders, computes the weighted PSPs forthe connected output neurons according to the connection listand stored weights, and updates their somatic MPs.

Fig. 11 shows the block diagram of each PN, with weightmemory in the column processor system. Each PN time mul-tiplexes one or more neurons’ computations. For example, if aPN multiplexes four neurons in a 32 neuron network, the totalsystem needs eight PNs to run in parallel. We call it a mux-4 PNsystem.

There are eight major operations that are performed by a spikePN:


Fig. 11. Spike-timing-dependent computation structure. Each column pro-cessor system has one weight memory for all PNs or several weight memoriesdistributed for many PNs.

1) Read SE: The column processor system has a dispenserto distribute the presynaptic events from the intracolumn spikeevents or the AER-based inter-column communication channelto each PN, and put those events’ indices and a countdown timeinto the PreSynaptic Events Memory (PSEM), shown in Fig. 11.The PN reads the presynaptic events from the PSEM and cap-tures the event’s time. This time is used to fetch the PSP from thePSP-LUT (look-up table). When the time record goes zero, thisevent no longer affects the computation. The PN will invalidatethis record. The PSEM could be implemented with an SRAM.The PSEM has a records of synaptic events, with a record widthof .

The PSP-LUT stores the PSPs in terms of elapsed time. Wecould also calculate the PSP value according to (2). However,such a computation requires at least two dividers and two ex-ponential arithmetic units, which consume either time or siliconarea for the “multiplier” and “adder” in the PN as in Fig. 11. Ifthe look-up table has a small number of entries, it can be faster.A possible circuit for the LUT is Content-Addressable Memory(CAM) with SRAM [58].

2) Read the Weight Values From Weight Memory: The weightmemory stores the weights with records of , where is thenetwork size. Because this is generally the largest componentof the column processor, we have assumed eDRAM technology(we assume that the eDRAM processing does not add consider-able cost to the chip [52], so that it does not significantly impactcost). When the PN receives a new synaptic event, it will readthe corresponding column weight data from the weight memoryinto the weight cache. If the event is not new, the weight infor-mation is already inside the weight cache, and the PN will skipthis stage.

Fig. 12. (a) The presynaptic events memory (PSEM) stores each valid event’sindex and time offset. (b) The weight cache stores the weights with a consecutivearrangement of the synaptic events index and the output neuron index as the rowand column addresses respectively. (c) Output neuron MP memory stores eachoutput neuron’s somatic MP and remained refractory time.

3) Read the Weight From Weight Cache: The weight cache isimplemented in SRAM and has lower latency and higher band-width than the weight memory. The weight cache stores at mostthe same number of record rows as the number of valid (i.e.,active) synaptic events. The number of valid synaptic events isroughly , which reduces the capacity requirement of theweight cache as compared to the weight memory. Because ofthis sparse activation and the elapsed time [ and in (1)] ofpostsynaptic events in the PN, we can store the weight in theweight cache for the duration of the synaptic event and guar-antee the weight is in the weight cache during this event life, ex-cept for the first cycle of a new synaptic event. The weight cacheblock diagram is illustrated in Fig. 12(b). The size of the weightcache is , where is the bit width of eachweight. Since not all connections exist, the weight matrix couldbe sparse. We only store the nonzero weights into the weightcache when , where is theprobability of a nonzero weight. A disadvantage of this sparserepresentation is that the nonzero weights are stored as a list,and we would need to traverse the entries in the weight cacheto fetch a weight from a random request. Though not assumedhere, it is possible to use CAM to store the nonzero weights.

In order to leverage the sparse connectivity, for the full rep-resentation of the weight cache (i.e., store all zero and nonzeroweights according to their sequential addresses), we could readmultiple weights at once instead of reading a single weightduring a clock cycle, and Boolean them to see if the resultis zero. If so, then there is no connection between those neuronsand the driving synaptic event. If this value is not zero, then wemust test each connection sequentially. The multiple-weightread can only work for the PNs with multiplexed neurons. Fornonmultiplexing PNs, this specific multiple-weight read optionis not possible.

4) Multiply Weight and PSP: This operation uses the “Mul-tiplier” unit, with inputs of weight and PSP value, and output ofweighted PSP. This assumes multibit weight values, since thePN does not need a multiplier for single-bit weight representa-tions.

5) Update Neuron’s Somatic MP: The PN first checks if theneuron is still in the refractory period by examining whetherthe record in the MP memory is zero. If it is not zero, then thePN will ignore the new weighted PSP input and decrease the


neuron’s refractory time by a single time unit. Otherwise, the PNadds the new weighted PSP value with the neuron’s last savedMP value. The structure of the MP is shown in Fig. 12(c).

6) Compare MP With Threshold: If a new MP is generated,the PN will compare the new MP with a stored threshold viathe “ threshold” unit, which will enable a “yes” signal whenthe new , and a “no” signal otherwise.

7) Write Back MP if Needed: When there is a new MP valueor new refractory time (from the Counter unit) available, the PNwill write the update value into the MP memory.

8) Write to Spike Event Memory: When the “ threshold”unit outputs a “yes” signal, the PN will write the neuron’sindex into the “Spike Events Memory,” which either goes to thecolumn processor’s dispenser directly, or to other chips via anAER transmitter.

These eight stages listed above can be pipelined reasonablywell to improve the PN’s performance and reduce the possibilityof idle hardware. The overall performance is determined by theslowest pipe state of these eight stages. When the weight readfrom the weight cache is zero, the following pipe stages willbe in the idle state, which lowers the PN’s computational effi-ciency, while improving power efficiency.

In Fig. 11, the AT units are address translators. Because thePN stores the weights and MPs consecutively (there is a knownrelation between the memory address and the stored items),the address translators can use the current synaptic event indexand the neuron index to encode the address. This simplified en-coding allows us to ignore the analysis of speed and silicon areafor the address translators. The is a Boolean operator thatgenerates a “next neuron enable” signal to the Neuron Counterunit to move the current neuron index to the next neuron.

In the digital circuit design, the PSEM stores the presynapticevents. The size of this PSEM affects the maximum waiting timefor the computation of each event. Assume there are three clockcycle times: , , and for channel speed (intra-column communication channel or inter-column AER commu-nication channel), column processor system clock, and PN clockrespectively. We assume synaptic events are independent, iden-tically distributed, and are generated as a Poisson approxima-tion: , where is the expected spiking (orfiring) rate in the channel to the column processor. As Boahensummarized [51], the average waiting time is

(9)

Fig. 13 shows the average waiting time (with unit of ) interms of spiking rate . In our system, assume there are entriesin the PSEM in each PN, and each postsynaptic event spreadsover number of cycles, the maximum average waitingtime should be . That is to say, if the average waiting time is

, each spike has to wait channel cycles. We assume, where is the network size, the maximum firing

rate is , . The maximum spikingrate of the PN is then given by , where .This means the maximum spiking rate is only a fraction of thechannel speed.

For example, in our performance estimate, for a typicalnetwork size of , if the average waiting timeis , then the maximum spiking rate

Fig. 13. Average waiting time in terms of firing rate �, according to (1).

Fig. 14. Normalized time of multiple-weight read with the network size of16,384. The horizontal axis is the number of multiplexed neurons per PN withthe same number of weights read; vertical axis gives the time normalized withthe longest time (the single-weight read). The three curves represent three dif-ferent probabilities of memory connectivity. As shown in this figure, for 0.1connectivity, the 4-weight read has the optimal normalized time of 0.5; for 0.01connectivity, the 8-weight read with 0.2 normalized time; and for 0.001 connec-tivity, the 32-weight read with 0.06 normalized time.

, which means the maximum spiking rateof each PN can achieve is about 97% of the maximum channelspeed . We also define the column processor’s clock cycletime , where is the number of mul-tiplexed neurons per PN, is the PN’s synaptic potentialcalculation time normalized to full connection calculation time,which will be explained in the next paragraph. If the PN’s clockcycle time (i.e., 5 GHz), the postsynapticevent’s spread time , and ,then , so the channel spiking rate is

.For example, in Fig. 14, with 0.001 connectivity, mux-32 PN,

, the column processor’s final maximum inputspiking rate is .

Because of the sparse activation and sparse connectivity,there is opportunity to multiplex the computational hardwarewithout impacting real time constraints. In our current as-sociation memory models 0.1 (10%) connectivity is typical.However as the columns scale as well as interconnecting them


into a large HDM array, it is less clear how sparse the local,intracolumn connectivity will be. For the sake of our analysis,we start with 0.1, and then go down to very sparse connectivityto demonstrate the effectiveness of virtualization. The ineffi-ciency of not multiplexing costs in terms of idle silicon area,and puts the digital system’s performance/price far behind acoarser-grained PN system. As explained in the “Read weightfrom weight cache” paragraph, a multiple-weight read coupledwith multiple neurons per PN design can save time comparedto single-weight read or a nonvirtual design. We use the termnormalized time to indicate the time for a multiple-weight read,divided by the time for a single-weight read.

Normalized time then is , where is thetime for reading connections in a cycle, and isthe time for reading one connection in each PN cycle. That is,

and ,where is the probability that consecutive connections areall zero, and is the number of multiplexed neurons per PN.If is the weight connectivity, then the average probabilityof consecutive nonzero weights is . According toqueuing theory [59], with Poisson arrival and service times,we know that . Thus, we have the normalized time

. Fig. 14 shows the normalizedmemory reading time with three different levels of connectivity,for a network size of 16384 neurons.

Spiking Mixed-Signal CMOS Design: The spike-basedmixed-signal CMOS design is not as simple as the nonspikingmixed-signal CMOS design, which time-multiplexes theinner-product operations in the digital regime. Furthermore,the analog -WTA circuits replace the time-consuming andsilicon-consuming digital -WTA circuits. For the spikingmodels, it would not make sense to use multiplexed digitalcircuits for the weighted PSP computations and analog circuitsfor the I&F neuron model. This is because of the real-timerequirement and the continuous operation of analog circuits.Even if we did these analog circuits, they could only replacethe “Adder” and “ threshold” units in the digital counterparts,which are fairly simple and fast. The PN may also need a D/Aconverter for each I&F neuron. Thus, the mixed-signal CMOSapproach would not improve the performance/price by much,and it is not included in the performance/price comparisons inSection VI.

Spiking Digital CMOL Design: This design is similar to thespiking all digital CMOS implementation, except that CMOLmemory is used to hold the weight values as compared to usingeDRAM in the spiking digital CMOS design.

Spiking Mixed-Signal CMOL Design: Like the nonspikingmixed-signal CMOL design, we use CMOL nanogrids (Fig. 7)to represent the network connections (i.e., the weight matrix).Pulses (current spikes) from the CMOS circuitry drive theCMOL output nanowires, which connect to the inputs of theanalog I&F neuron circuits. Indiveri’s [47] circuit implementsthe leaky I&F neuron, with adaptation to the output firing rate.Fig. 15 shows the schematic view of this analog I&F neuroncircuit.

Each CMOL nanogrids output nanowire connects to the inputof the I&F neuron circuit, i.e., the in Fig. 15. The currentfrom the CMOL nanogrids output nanowire charges the capac-itor . When the capacitor’s voltage reaches the threshold,the circuit will generate an output spike, which will discharge

Fig. 15. Schematic view of an analog I&F neuron circuit (adapted from [47]).

TABLE ICOMPONENTS FOR DIFFERENT SYSTEMS OF NONSPIKING HDM MODEL

TABLE IICOMPONENTS FOR DIFFERENT SYSTEMS OF SPIKING HDM MODEL

the . As with real neurons, the circuit will oscillate if wehave a continuous injection current.

V. PERFORMANCE/PRICE ANALYSIS

For nonspiking implementations, the components used byeach of the four designs are shown in Table I, in which a “Y”indicates where the target system uses the component. Table IIshows the components used by the three designs for the spikingHDM model.

The designs are evaluated according to performance/price,where performance is measured by speed [connections persecond (CPS) for the nonspiking model, or maximum inputspiking rate for the spiking model]. CPS is a traditionalperformance measure when emulating neural networks. Unfor-tunately, it is not as precise with the incremental, spike-basedmodels presented here, but the maximum spike processing


TABLE IIICIRCUITS PERFORMANCE/PRICE SCALINGS

rate still gives a reasonably good predictor of hardware perfor-mance. Price is measured by silicon area and power (regardingthe total chip size of 858 , which is the maximum radicalfield size expected at 22 nm [4]).

Table III lists the equations used to estimate the performance/price for each component in Tables I and II. For the CMOL cir-cuit performance/price estimates, we refer to Section III, and es-timate the typical design density for a number of circuits usingexamples from the literature: the digital -WTA [60], the D/Aconverter [56], the CAM [58], the multiplier [61], and the adder[58, p.678]. We then scale these circuits down to our hypothet-ical 22-nm technology according to the ITRS projections [4],using the first-order constant field scaling principle [58], where

as the scaling factor. We know that current scales to , re-sistance to 1, gate capacitance to , gate delay to , fre-quency to , chip area to , and dynamic power dissipationto . Analog circuits do not scale at the same pace as digitalcircuits, so we conservatively scaled the analog circuits to 250nm.

Table III shows the area, power, and time delay scaling esti-mates for the different components. Our performance/price es-timates cover a range of parallelism (“virtualization”), from asingle PN for each neuron, to having a single PN multiplex allthe neurons in the column. The estimates also explore variationsin model parameters, such as network size, weight data preci-sion, and sparseness of connections.

VI. RESULTS AND DISCUSSION

The resulting performance/price estimates are presented intwo parts. Table IV contains the comparisons for the nonspikingmodel, while Table V contains comparisons for the spikingmodel.

TABLE IVPERFORMANCE/PRICE COMPARISON FOR NONSPIKING HDM MODEL

TABLE VPERFORMANCE/PRICE COMPARISON FOR SPIKING HDM MODEL

MS stands for Mixed-Signal

In Table IV, the estimates are based on a model size (for asingle column) of 16,384 neurons, with 4-bit weight resolution,256 PNs per column processor, and eDRAM technology for theCMOS designs. The total chip size is 858 . “CPS” denotesthe connection computations per second.

Table IV shows that the CMOL designs have lower powerconsumption (by one to two orders of magnitude) than theCMOS designs, due to greatly reduced charging power. Be-cause the digital -WTA circuit is at least ten times slower andten times more costly in area than its analog counterpart, theCPS performance of mixed-signal CMOS and CMOL designshave roughly two orders of magnitude advantage over their dig-ital counterparts. We also estimated the performance/price withdifferent algorithm parameters, for example with a network sizeof 1 024, and single-bit weights; the relative performance/pricecomparisons above are still valid.

For the spiking CMOL and CMOS designs, we compared theinput spiking rate (i.e., the maximum input spiking rate that thechip can process), power, and the number of column processorson a chip based on digital CMOS, digital CMOL, and mixed-signal CMOL designs. The performance/price here means thespiking rate of a chip size of 858 .

Figs. 16 and 17 show the input spiking rates per chip for dig-ital CMOS and digital CMOL, respectively. With less connec-tivity, the PN should be able to multiplex more neurons (totalconnections tends to be a more important indicator than numberof neurons), and the whole chip can process higher input spikingrate. For example, in Fig. 16, for 0.1 connectivity, the highestinput spiking rate occurs when four neurons are multiplexed byeach PN. For 0.01 connectivity, the highest input spiking rate oc-curs when 32 neurons are multiplexed by each PN. With moremultiplexed neurons in each PN, the weight memory (eDRAMand CMOL memory) occupies a greater proportion of the chiparea as fewer PNs are needed. This is an issue in the CMOSdesign when the eDRAM area approaches 90% of the wholechip with maximum neuron multiplexing (all neurons being em-ulated by one PN). CMOL memory is slower than eDRAM, but


Fig. 16. The input spiking rate (in log) of the digital CMOS design for an 858mm chip with three scenarios of connectivity. The “diamond” marked curveshows the area percentage of the eDRAM.

Fig. 17. The input spiking rate (in log) of the digital CMOL design for an 858mm chip with three scenarios of connectivity. The “diamond” marked curveshows the area percentage of the CMOL memory.

occupies much less silicon area. Fig. 17 shows the improvedperformance/price of a digital CMOL design over the digitalCMOS design (about 50% improvement).

Table V shows the performance/price comparisons of thespiking HDM models for the digital CMOS and mixed-signalCMOL designs, assuming the same benchmark input spikingrate for both designs. The benchmark input spiking rate is themaximum input spiking rate the digital CMOS can processunder the three different connectivity values used in Fig. 16.Although the mixed-signal CMOL power consumption in-creases with the input spiking rate, it shows at least two ordersof magnitude of advantage over the digital CMOS designsunder the same network conditions. On the other hand, wealso notice a much narrower performance/price gap betweendigital CMOS and mixed-signal CMOL implementations forthe spiking model than for the nonspiking model. This is dueto hardware virtualization.

The dynamic power dissipated by the CMOL memory in thenanowire/nanodevice crossbars is defined by (7). If the hori-zontal and vertical nanowires are , the con-nectivity , nanogrid half pitch nm, ap-plied voltage V, in order to satisfy the power density

, we have the constraint of. If we increase the nanogrid size by 1000

times, that is, , the constraint will be. These are practical constraints.

On the other hand, the time delay defined by (3) degrades whenthe nanowire length increases. This means that when the CMOLnanogrids footprint increases, the dynamic power density de-creases, while the time delay increases.

Digital CMOS circuits need D/A converters to interface withanalog CMOS circuits, which are expensive in both area andpower. The mixed-signal CMOL design does not require con-verters. Currents from CMOL nanogrids can feed directly intoanalog circuits, such as the -WTA (see Fig. 6) and the I&Fneuron (see Fig. 15). The average injection current determinesthe analog circuit’s dynamic response. For example, the I&Fcircuit requires at least 10 pA of injection current to spike at 10Hz. The nanowire connecting the CMOL to the input node of theI&F neuron circuit can provide such current. The CMOL powerdensity is , whichleads to the constraint , where is the weight bits,

V, nm, and . CMOL nanogridscan easily satisfy this constraint. However, if there is sparse con-nectivity, the power density of the hot spots (i.e., where the“on” nanodevices are located) is

, where is the connectivity. This gives the con-straint of with .

Another nanodevice average power density constraintderived from CMOL nanogrids operation is given by

W/mm , whereis the duty cycle defined as

. For ,, which might be possible for single

electron molecules [62]. However, should not be too high,otherwise it will degrade the dynamic response of the CMOLnanogrids given (3).

VII. CONCLUSION

The possibilities created by hybrid CMOS/nanogrid elec-tronics are very exciting, especially in the area of neuralmodel emulation. To give a sense of the scale of the CMOLmixed-signal configuration (see Table IV), we are able toimplement 1716 column processors, each having 16 thousandnodes, with 16 thousand connections each, with each connec-tion consisting of a 4-bit weight, for a total of 2 tera-connectionbits. Furthermore, we can update the entire network once everymicrosecond. These are approaching biological densities andspeeds, though with significantly less functionality. However,the conflict between the CMOL nanogrids’ power density anddynamic response is a reminder that system architects andcircuit engineers need to carefully balance their designs whenworking with these new technologies.

Another key point of the architectural trade-offs presentedhere is the value of leveraging sparse activation and connec-tivity to multiplex scarce resources. We demonstrated that,because of the sparse activation and sparse connection of ourmodels, for very sparse, 0.1%, connectivity rates, a simpletime-multiplexing scheme for digital CMOS can achieve com-parable spiking rate as the mixed-signal CMOL configurationwhile using the same silicon area (see Table V), although thisapproach does consume more power.


We have demonstrated a path to scalable hardware imple-mentation for a family of biologically inspired algorithmsand have uncovered a number of interesting nanoarchitectureresearch problems along the way. The next steps for this re-search are, first, to add dynamic learning to the implementation,and, second, to add the larger, more complex multicolumnarchitecture.

ACKNOWLEDGMENT

The authors are very grateful to Prof. K. K. Likharev, Dr. D.B. Strukov, and Prof. G. Indiveri for helpful discussions, andto the anonymous reviewers for their valuable comments andsuggestions.

REFERENCES

[1] S. Borkar, T. Karnik, S. Narendra, J. Tschanz, A. Keshavarzi, and V.De, “Parameter variations and impact on circuits and microarchitec-ture,” presented at the DAC 2003, Anaheim, CA, 2003, pp. 338–342,unpublished.

[2] K. K. Likharev and D. B. Strukov, “CMOL: Devices, circuits, and ar-chitectures,” in Introducing Molecular Electron., Springer, Berlin, Ger-many, 2005, pp. 447–477.

[3] Q. Chen and J. D. Meindl, “Nanoscale metal-oxide-semiconductorfield-effect transistors: Scaling limits and opportunities,” Nan-otechnol., vol. 15, pp. S549–S555, 2004.

[4] Int. Technol. Roadmap For Semicond. 2005 Edition, 2005 [Online].Available: http://public.itrs.net, SEMATECH

[5] S. Borkar, “Electronics beyond nanoscale CMOS,” presented at theDAC 2006 San Francisco, CA, 2006.

[6] R. Chau, S. Datta, M. Doczy, B. Doyle, B. Jin, J. Kavalieros, A. Ma-jumdar, M. Metz, and M. Radosavljevic, “Benchmarking nanotech-nology for high-performance and low-power logic transistor applica-tions,” IEEE Trans. Nanotechnol., vol. 4, no. 2, pp. 153–158, Mar.2005.

[7] J. Xiang et al., “Ge/Si nanowire heterostructures as high performancefield-effect transistors,” Nature Lett., vol. 441, no. 7092, pp. 489–493,May 2006.

[8] A. Bachtold, P. Hadley, T. Naknishia, and C. Dekker, “Logic cir-cuits with carbon nanotube transistors,” Sci, vol. 294, no. 5545, pp.1317–1320, Nov. 9, 2001.

[9] R. S. Friedman, M. C. McAlpine, D. S. Ricketts, D. Ham, and C. M.Lieber, “Nanotechnology: High-Speed integrated nanowire circuits,”Nature, vol. 434, no. 7037, pp. 1085–11085, Apr. 28, 2005.

[10] Y. Chen, G.-Y. Jung, D. A. A. Ohlberg, X. Li, D. R. Stewart, J. O.Jeppesen, K. A. Nielsen, J. F. Stoddart, and R. S. Williams, “Nanoscalemolecular-switch crossbar circuits,” Nanotechnol., vol. 14, no. 4, pp.462–468, Apr. 1, 2003.

[11] P. J. Kuekes, D. R. Stewart, and R. S. Williams, “The crossbar latch:Logic value storage, restoration, and inversion in crossbar circuits,” J.Appl. Phys., vol. 97, pp. 034 301-1–034 301-5, 2005.

[12] G. S. Snider, P. J. Kuekes, and R. S. Williams, “CMOS-like logicin defective, nanoscale crossbars,” Nanotechnol., vol. 15, no. 8, pp.881–891, Aug. 1, 2004.

[13] S. Zankovych, T. Hoffmann, J. Seekamp, J.-U. Bruch, and C. M. S.Torres, “Nanoimprint lithography: Challenges and prospects,” Nan-otechnol., vol. 12, no. 2, pp. 91–95, Jun 1, 2001.

[14] D. J. Resnick, W. J. Dauksher, D. Mancini, K. J. Nordquist, T. C.Bailey, S. Johnson, N. Stacey, J. G. Ekerdt, C. G. Willson, and S. V.Sreenivasan, “Imprint lithography for integrated circuit fabrication,” J.Vacuum Sci. Technol. B, vol. 21, p. 2624, 2003.

[15] A. DeHon, P. Lincoln, and J. E. Savage, “Stochastic assembly of sub-lithographic nanoscale interfaces,” IEEE Trans. Nanotechnol., vol. 2,no. 3, pp. 165–174, Sep. 2003.

[16] M. M. Ziegler and M. R. Stan, “CMOS/nano co-design for crossbar-based molecular electronic systems,” IEEE Trans. Nanotechnol., vol.2, no. 4, pp. 217–230, Dec. 2003.

[17] G. Snider and R. Williams, “Nano/CMOS architectures using afield-programmable nanowire interconnect,” Nanotechnol., vol. 18,pp. 1–11, 2007.

[18] Ö Türel, J. H. Lee, X. Ma, and K. K. Likharev, “Architectures for nano-electronic implementation of artificial neural networks: New results,”Neurocomput., vol. 64, pp. 271–283, 2005.

[19] D. B. Strukov and K. K. Likharev, “CMOL FPGA: A reconfigurablearchitecture for hybrid digital circuits with two-terminal nanodevices,”Nanotechnol., vol. 16, no. 6, pp. 888–900, Jun 1, 2005.

[20] D. B. Strukov and K. K. Likharev, “Prospects for terabit-scale nano-electronic memories,” Nanotechnol., vol. 16, no. 1, pp. 137–148, Jan.1, 2005.

[21] V. Cerletti, W. A. Coish, O. Gywat, and D. Loss, “Recipes for spin-based quantum computing,” Nanotechnol., vol. 16, pp. R27–R49, 2005.

[22] U. Rückert, “An associative memory with neural architecture and itsVLSI implementation,” presented at the HICSS-24, Koloa, HI, 1990.

[23] A. Heittmann and U. R̈ckert, “Mixed mode VLSI implementation ofa neural associative memory,” in Proc. MicroNeuro ’99, 1999, pp.299–306.

[24] U. Rückert, “VLSI design of an associative memory based on dis-tributed storage of information,” in VLSI Design of Neural Networks,U. Ramacher and U. R̈ckert, Eds. Boston, MA: Kluwer, 1991, pp.153–168.

[25] V. Mountcastle, Perceptual Neuroscience: The Cerebral Cortex.Cambridge, MA: Harvard Univ. Press, 1998.

[26] V. Braitenberg and A. Schuz, Cortex: Statistics and Geometry of Neu-ronal Connectivity. New York: Springer-Verlag, 1998.

[27] D. O’Kane and A. Treves, “Why the simplest notion of neocortex asan auto-associative memory would not work,” Network, vol. 3, pp.379–384, 1992.

[28] R. Hecht-Nielsen, “A theory of thalamocortex,” in ComputationalModels For Neuroscience—Human Cortical Information Processing,R. Hecht-Nielsen and T. McKenna, Eds. New York: Springer, 2003.

[29] J. A. Anderson, Programming considerations for a brain-like computer,Dept. of Cognitive and Linguistic Sciences, Brown Univ., Providence,RI, Jun. 14, 2005.

[30] C. Johansson and A. Lansner, “Towards cortex sized artificial nervoussystems,” presented at the Knowledge-Based Intelligent Inf. Eng. Syst.KES’04, Wellington, New Zealand, 2004.

[31] C. Johansson, M. Rehn, and A. Lansner, “Attractor neural networkswith patchy connectivity,” Neurocomput., vol. 69, pp. 627–633, 2006.

[32] C. Fulvi Mari, “Extremely dilute modular neuronal networks: Neocor-tical memory retrieval dynamics,” J. Comput. Neurosci., vol. 17, pp.57–79, 2004.

[33] R. Granger, “Brain circuit implementation: High-precision computa-tion from low-precision components,” in Replacement Parts For theBrain, T. Berger and D. Glanzman, Eds. Cambridge, MA: MIT Press,2005, pp. 277–294.

[34] D. George and J. Hawkins, “A hierarchical Bayesian model of invariantpattern recognition in the visual cortex,” presented at the IJCNN ’05,2005.

[35] G. Palm, F. Schwenker, F. T. Sommera, and A. Strey, “Neural asso-ciative memories,” in Associative Processing and Processors. LosAlamitos, CA: IEEE Computer Society, 1997, pp. 284–306.

[36] D. Willshaw, “Tolerance of a self-organizing neural network,” NeuralComput., pp. 911–936, 1997.

[37] A. Sandberg, A. Lansner, K.-M. Petersson, and Ö Ekeberg, “Bayesianattractor networks with incremental learning,” Network: Comput.Neural Syst., vol. 13, pp. 179–194, 2002.

[38] C. Gao and D. Hammerstrom, “CMOL-based cortical models,” inEmerging Brain-Inspired Nano-Architectures, V. Beiu and U. Rückert,Eds. Singapore: World Scientific, 2008, to be published.

[39] R. E. Suri, “A computational framework for cortical learning,” Biol.Cybern, vol. 90, pp. 400–409, 2004.

[40] W. Gerstner, “Spiking neurons,” in Pulsed Neural Networks, W. Maassand C. M. Bishop, Eds. Cambridge, MA: MIT Press, 1998, pp. 3–53.

[41] S. Song, K. D. Miller, and L. F. Abbott, “Competitive hebbian learningthrough spike-timing-dependent synaptic plasticity,” Nature Neurosci.,vol. 3, no. 9, pp. 919–926, 2000.

[42] U. Rückert and H. Surmann, “Tolerance of a binary associative memorytoward stuck-at-faults,” presented at the Proc. 1991 Int. Conf. ArtificialNeural Networks (ICANN-91), Espoo, Finland, 1991.

[43] F. T. Sommer and P. Dayan, “Bayesian retrieval in associative mem-ories with storage errors,” IEEE Trans. Neural Networks, vol. 9, pp.705–713, Jul. 1998.

[44] W. C. Elmore, “The transient response of damped linear networks,” J.Appl. Phys., vol. 19, pp. 55–63, Jan. 1948.

[45] R. Figueiredo, P. A. Dinda, and J. Fortes, “Guest editors’ introduc-tion: Resource virtualization renaissance,” Computer, vol. 38, no. 5,pp. 28–31, May 2005.


[46] J. A. Anderson and J. P. Sutton, “If we compute faster, do we under-stand better?,” Behave Res. Methods, Instruments Comput., vol. 29, pp.67–77, 1997.

[47] G. Indiveri, E. Chicca, and R. Douglas, “A VLSI array of low-powerspiking neurons and bistable synapses with spike-timing dependentplasticity,” IEEE Trans. Neural Networks, vol. 17, pp. 211–221, 2006.

[48] U. Rückert, “ULSI architectures for artificial neural networks,” IEEEMicro, vol. 22, no. 3, pp. 10–19, May. 2002.

[49] T. Schoenauer, S. Atasoy, N. Mehrtash, and H. Klar, “NeuroPipe-chip:A digital neuro-processor for spiking neural networks,” IEEE Trans.Neural Netw., vol. 13, no. 1, pp. 205–213, Jan. 2002.

[50] A. Bofill-i-Petit and A. F. Murray, “Synchrony detection by analogueVLSI neurons with bimodal STDP synapses,” presented at the NIPS2003, 2003.

[51] K. A. Boahen, “Point-to-point connectivity between neuromorphicchips using address-events,” IEEE Trans. Circuits Syst. II, AnalogDigit. Signal Process., vol. 47, pp. 416–434, May 2000.

[52] S. S. Iyer, J. E. Barth, Jr., P. C. Parries, J. P. Norum, J. P. Rice, L. R.Logan, and D. Hoyniak, “Embedded DRAM: Technology platform forthe blue Gene/l chip,” IBM J. Res. Dev., vol. 49, pp. 333–350, 2005.

[53] D. Hammerstrom, C. Gao, S Zhu, and M. Butts, “FPGA implemen-tation of very large associative memories—Scaling issues,” in FPGAImplementations of Neural Networks, A. Omondi, Ed. Boston, MA:Kluwer Academic Publishers, 2003.

[54] M. Holler, S. Tam, H. Castro, and R. Benson, “An electrically train-able artificial neural network (ETANN) with 10 240 ‘floating gate’synapses,” in Proc. Int. Joint Conf. Neural Networks Jun. 1989, pp.191–196.

[55] J. Lazzaro, S. Rychkebusch, M. A. Mahowald, and C. A.Mead, Winner-Take-All networks of complexity, Comput. Sci.Dep.,California Institute of Technology, Pasadena, CA, CAL-TECH-CS-TR-21-88, 1989.

[56] S.-Y. Chin and C.-Y. Wu, “A 10-bit 125-MHz CMOS digital-to-analogconverter (DAC) with threshold-voltage compensated current sources,”IEEE J. Solid-State Circuits, vol. 29, no. 11, pp. 1374–1380, Nov. 1994.

[57] M. Schäfer and G. Hartmann, “A flexible hardware architecture foronline Hebbian learning in the sender-oriented PCNN-neurocomputerspike 128 K,” in Proc. MicroNeuro ’99, 1999, pp. 316–323.

[58] N. Weste and D. Harris, CMOS VLSI Design—A Circuits and SystemsPerspective, 3rd ed. : Addison Wesley, 2004.

[59] L. Kleinrock, Queueing Systems. New York: Wiley, 1976.[60] C. S. Lin, S. H. Ou, and B. D. Liu, “Design of k-WTA/sorting network

using maskable WTA/MAX circuit,” in Proc. Int. Symp. VLSI Tech-nology, Systems, Applicat., 2001, pp. 69–72.

[61] R. K. Kolagotla, H. R. Srinivas, and G. F. Burns, “VLSI implemen-tation of a 200-MHz 16 � 16 left-to-right carry-free multiplier in0.35-�m CMOS technology for next-generation dsps,” in Proc. IEEE1997 Custom Integrated Circuits Conf., 1997, pp. 469–472.

[62] J. C. Ellenbogen and J. C. Love, “Architectures for molecular elec-tronic computers: Logic structures and an adder designed from molec-ular electronic diodes,” Proc. IEEE, vol. 88, pp. 386–426, Mar. 2000.

Changjian Gao received the B.S. degree in electricalengineering from Beijing Institute of Technology,Beijing, China, the M.S. degree in circuits and sys-tems from Beijing Institute of Radio Measurement,Beijing, China, and the M.S. degree in electricaland computer engineering from Oregon GgraduateInstitute, Oregon Health and Sciece University(OGI/OHSU), Beaverton, in 1995, 1998, and 2005,respectively. He is working toward the Ph.D. degreein the Department of Electrical and Computer Engi-neering, Portland State University, Portland, OR.

His research interests include biologically inspired circuits design, CMOS,field-programmable gate arrays, computer architecture, and nanoelectronic ar-chitectures and circuits design.

Dan Hammerstrom (SM’04) ) received the B.S. de-gree from Montana State University, Bozeman, theM.S. degree from Stanford University, Stanford, CA,and the Ph.D. degree from the University of Illinois,Urbana, in 1971, 1972, 1977, respectively all in elec-trical engineering.

He was a Computer Systems Design Officer in theU.S. Air Force from 1972 to 1975, and was an Assis-tant Professor in the Electrical Engineering Depart-ment at Cornell University, Ithaca, NY, from 1977 to1980. In 1980, he joined Intel, Hillsboro, OR, where

he participated in the development and implementation of the iAPX-432, thei960, and iWarp. He joined the faculty of the Computer Science and EngineeringDepartment at the Oregon Graduate Institute (OGI) in 1985, as an Associate Pro-fessor. In 1988, he founded Adaptive Solutions, Inc. which specialized in highperformance silicon technology (the CNAPS chip set) for image processing andpattern recognition. He returned to OGI in 1997, where he was the Doug StrainProfessor in the Computer Science and Engineering Department until 2004. Heis currently a Professor in the Electrical and Computer Engineering Departmentand Associate Dean for Research in the Maseeh College of Engineering andComputer Science at Portland State University, Portland, OR. He is also an Ad-junct Professor in the Information, Computation, and Electronics (IDE) Depart-ment at Halmstad University, Halmstad, Sweden.

Documents

2502 IEEE TRANSACTIONS ON CIRCUITS ... - Computer Action …web.cecs.pdx.edu › ~strom › publications › gao_tcas07_cmol.pdfHeaviside step function, where an output node will be