5
A 65-nm 738k-Synapse/mm 2 Quad-Core Binary-Weight Digital Neuromorphic Processor with Stochastic Spike-Driven Online Learning Charlotte Frenkel, Jean-Didier Legat and David Bol ICTEAM Institute, Universit´ e catholique de Louvain, Louvain-la-Neuve, Belgium Email: {charlotte.frenkel, jean-didier.legat, david.bol}@uclouvain.be Abstract—Recent trends in the field of artificial neural net- works (ANNs) and convolutional neural networks (CNNs) in- vestigate weight binarization for full on-chip weight storage to minimize circuit resources and to avoid the high energy cost of off- chip memory accesses. In parallel, spiking neural network (SNN) architectures are explored to further reduce power when process- ing sparse event-based data streams, while on-chip spike-based online learning targets applications constrained in power and resources during the training phase. However, leveraging high- density on-chip online learning in binary-weight SNNs is still an open challenge. In this work, we demonstrate MorphIC, a quad- core binary-weight digital neuromorphic processor embedding a stochastic version of the spike-driven synaptic plasticity (S-SDSP) learning rule and a hierarchical routing fabric for large-scale chip interconnection. The MorphIC SNN processor embeds a total of 2k leaky integrate-and-fire (LIF) neurons and more than two million plastic synapses for an active silicon area of 2.86mm 2 in 65nm CMOS, achieving a high density of 738k synapses/mm 2 . I. I NTRODUCTION The massive deployment of neural network accelerators as inference devices is currently hindered by the memory footprint and power consumption required for high-accuracy classification [1]. Two trends are currently being explored in order to solve this issue. The first trend consists in optimizing current artificial neural network (ANN) and convolutional neu- ral network (CNN) architectures. Weight quantization down to binarization is a promising approach as it allows to simplify the operations and minimize the memory footprint, thus avoiding the high energy cost of off-chip memory accesses if all the weights can be stored into on-chip memory [2]. The accuracy drop induced by quantization can be mitigated to acceptable levels for many applications with the use of quantization- aware training techniques that propagate binary weights dur- ing the forward pass and keep full-resolution weights for backpropagation updates [3]. The associated off-chip learning setup for quantization-aware training is shown in Fig. 1(a): this strategy allows binary-weight on-chip neural networks to perform inference with a favorable energy-accuracy tradeoff, as recently demonstrated by binary CNN chips (e.g., [4]–[6]). The second trend consists in changing the neural network architecture and data representation, which is currently being explored with bio-inspired spiking neural networks (SNNs) as a power-efficient processing alternative for sparse event-based data streams [7]. Embedded online learning is a key feature in SNNs as it enables on-the-fly adaptation to the environ- ment [8]. Moreover, by avoiding the use of an off-chip opti- mizer, on-chip online learning allows SNNs to target applica- tions that are power- and resource-constrained during both the C. Frenkel is with Universit´ e catholique de Louvain as a Research Fellow from the National Foundation for Scientific Research (FNRS) of Belgium. Backpropagation update 1-bit weight memory Update circuit Inference Input 1-bit weight memory Inference Input Forward pass Training Inference Training + inference Off-chip learning setup On-chip learning setup (a) (b) Label Label 32-bit weight memory 1-bit quant. Input Label Fig. 1: Learning strategies for binary-weight neural networks. (a) Quantization- aware off-chip learning setup: binary weights are used during the forward pass while full-resolution weights are kept for backpropagation updates [3]. Training is carried out in an off-chip high-performance optimizer, while inference is carried out in the power- and resource-constrained device. (b) On- chip online learning setup, where data-driven weight updates are carried out in parallel with inference in the power- and resource-constrained device. training and the inference phases, as shown in Fig. 1(b). Spike- based online learning is an active research area, both in the development of new rules for high-accuracy learning in multi- layer networks (e.g., [9]–[12]) and in the demonstration of silicon implementations in applications such as unsupervised learning for image denoising and reconstruction [13], [14]. However, these approaches currently rely on multi-bit weights. These two trends mostly evolve in parallel as only three chips have been proposed previously to leverage the density and power advantage of binary weights with SNNs. First, the IBM TrueNorth is the largest-scale neuromorphic chip with 1M neurons and 256M 1-bit synapses, however it does not embed online learning [15]. Second, the recently-proposed Intel Loihi has a configurable synaptic resolution that can be reduced to 1 bit and embeds a programmable co-processor for on-chip learning, though not demonstrated using a binary synaptic resolution [16]. Finally, Seo et al. propose a stochastic version of the spike-timing-dependent plasticity (S-STDP) rule for online learning in binary synapses [17]. However, S-STDP requires the design of a custom SRAM memory with both row and column accesses, which severely degrades the density advantage of their approach. It has been suggested in [18] that the spike-dependent synaptic plasticity (SDSP) learning rule proposed by Brader et al. in [19] allows for a more efficient resource usage than STDP: all the information necessary for learning is avail- able in the post-synaptic neuron at pre-synaptic spike time. SDSP requires neither an expensive local synaptic storage of spike timings nor a custom SRAM with both row and column accesses. Therefore, in this work, we propose an efficient 978-1-7281-0397-6/19/$31.00 ©2019 IEEE

A 65-nm 738k-Synapse/mm 2 Quad-Core Binary-Weight Digital ...static.tongtianta.site/paper_pdf/17492c8c-9287-11e... · backpropagation updates [3]. The associated off-chip learning

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A 65-nm 738k-Synapse/mm 2 Quad-Core Binary-Weight Digital ...static.tongtianta.site/paper_pdf/17492c8c-9287-11e... · backpropagation updates [3]. The associated off-chip learning

A 65-nm 738k-Synapse/mm2 Quad-Core

Binary-Weight Digital Neuromorphic Processor

with Stochastic Spike-Driven Online LearningCharlotte Frenkel, Jean-Didier Legat and David Bol

ICTEAM Institute, Universite catholique de Louvain, Louvain-la-Neuve, Belgium

Email: {charlotte.frenkel, jean-didier.legat, david.bol}@uclouvain.be

Abstract—Recent trends in the field of artificial neural net-works (ANNs) and convolutional neural networks (CNNs) in-vestigate weight binarization for full on-chip weight storage tominimize circuit resources and to avoid the high energy cost of off-chip memory accesses. In parallel, spiking neural network (SNN)architectures are explored to further reduce power when process-ing sparse event-based data streams, while on-chip spike-basedonline learning targets applications constrained in power andresources during the training phase. However, leveraging high-density on-chip online learning in binary-weight SNNs is still anopen challenge. In this work, we demonstrate MorphIC, a quad-core binary-weight digital neuromorphic processor embedding astochastic version of the spike-driven synaptic plasticity (S-SDSP)learning rule and a hierarchical routing fabric for large-scale chipinterconnection. The MorphIC SNN processor embeds a total of2k leaky integrate-and-fire (LIF) neurons and more than twomillion plastic synapses for an active silicon area of 2.86mm2 in65nm CMOS, achieving a high density of 738k synapses/mm2.

I. INTRODUCTION

The massive deployment of neural network acceleratorsas inference devices is currently hindered by the memoryfootprint and power consumption required for high-accuracyclassification [1]. Two trends are currently being explored inorder to solve this issue. The first trend consists in optimizingcurrent artificial neural network (ANN) and convolutional neu-ral network (CNN) architectures. Weight quantization down tobinarization is a promising approach as it allows to simplify theoperations and minimize the memory footprint, thus avoidingthe high energy cost of off-chip memory accesses if all theweights can be stored into on-chip memory [2]. The accuracydrop induced by quantization can be mitigated to acceptablelevels for many applications with the use of quantization-aware training techniques that propagate binary weights dur-ing the forward pass and keep full-resolution weights forbackpropagation updates [3]. The associated off-chip learningsetup for quantization-aware training is shown in Fig. 1(a):this strategy allows binary-weight on-chip neural networks toperform inference with a favorable energy-accuracy tradeoff,as recently demonstrated by binary CNN chips (e.g., [4]–[6]).

The second trend consists in changing the neural networkarchitecture and data representation, which is currently beingexplored with bio-inspired spiking neural networks (SNNs) asa power-efficient processing alternative for sparse event-baseddata streams [7]. Embedded online learning is a key featurein SNNs as it enables on-the-fly adaptation to the environ-ment [8]. Moreover, by avoiding the use of an off-chip opti-mizer, on-chip online learning allows SNNs to target applica-tions that are power- and resource-constrained during both the

C. Frenkel is with Universite catholique de Louvain as a Research Fellowfrom the National Foundation for Scientific Research (FNRS) of Belgium.

Backpropagation

update

Fusi’s

… …

1-bit weight

memory

Update

circuitInferenceInput

1-bit weight

memory

InferenceInput

Forward

pass

Training Inference

Training + inference

Off-chip learning

setup

On-chip learning

setup

(a)

(b)

Label

Label

≤ ±

32-bit weight

memory

1-bit

quant.

Input

Label

Fig. 1: Learning strategies for binary-weight neural networks. (a) Quantization-aware off-chip learning setup: binary weights are used during the forwardpass while full-resolution weights are kept for backpropagation updates [3].Training is carried out in an off-chip high-performance optimizer, whileinference is carried out in the power- and resource-constrained device. (b) On-chip online learning setup, where data-driven weight updates are carried outin parallel with inference in the power- and resource-constrained device.

training and the inference phases, as shown in Fig. 1(b). Spike-based online learning is an active research area, both in thedevelopment of new rules for high-accuracy learning in multi-layer networks (e.g., [9]–[12]) and in the demonstration ofsilicon implementations in applications such as unsupervisedlearning for image denoising and reconstruction [13], [14].However, these approaches currently rely on multi-bit weights.

These two trends mostly evolve in parallel as only threechips have been proposed previously to leverage the densityand power advantage of binary weights with SNNs. First, theIBM TrueNorth is the largest-scale neuromorphic chip with1M neurons and 256M 1-bit synapses, however it does notembed online learning [15]. Second, the recently-proposedIntel Loihi has a configurable synaptic resolution that can bereduced to 1 bit and embeds a programmable co-processorfor on-chip learning, though not demonstrated using a binarysynaptic resolution [16]. Finally, Seo et al. propose a stochasticversion of the spike-timing-dependent plasticity (S-STDP) rulefor online learning in binary synapses [17]. However, S-STDPrequires the design of a custom SRAM memory with bothrow and column accesses, which severely degrades the densityadvantage of their approach.

It has been suggested in [18] that the spike-dependentsynaptic plasticity (SDSP) learning rule proposed byBrader et al. in [19] allows for a more efficient resource usagethan STDP: all the information necessary for learning is avail-able in the post-synaptic neuron at pre-synaptic spike time.SDSP requires neither an expensive local synaptic storage ofspike timings nor a custom SRAM with both row and columnaccesses. Therefore, in this work, we propose an efficient

978-1-7281-0397-6/19/$31.00 ©2019 IEEE

Page 2: A 65-nm 738k-Synapse/mm 2 Quad-Core Binary-Weight Digital ...static.tongtianta.site/paper_pdf/17492c8c-9287-11e... · backpropagation updates [3]. The associated off-chip learning

––

L2 router

(mesh)

SOUTH AER LINK

NORTH AER LINK

EASTAERLINK

WESTAERLINK

L1 router

(star)

L0 router

L0 router L0 router

Core 3 (crossbar)Core 2 (crossbar)

Core 0 (crossbar)

L0 router

Core 1 (crossbar)

MorphIC SNN

Fig. 2: Block diagram of the MorphIC quad-core neuromorphic processor.

stochastic implementation of SDSP compatible with a standardhigh-density foundry SRAM in order to leverage embeddedonline learning in binary-weight SNNs. We demonstrate thisapproach with MorphIC, a quad-core digital neuromorphicprocessor with a hierarchical routing fabric for large-scale chipinterconnection. MorphIC was prototyped in 65nm CMOS andembeds 2k neurons and more than 2M synapses in an activesilicon area of 2.86mm2, therefore achieving a high densityof 738k 1-bit online-learning synapses per mm2. It results inan order-of-magnitude density improvement compared to thebinary-weight online-learning SNN processor from Seo et al.

The remainder of this paper is structured as follows. The ar-chitecture and implementation of the MorphIC SNN processorare provided in Section II. The specifications, measurementsand benchmarking results are then presented in Section III.

II. ARCHITECTURE AND IMPLEMENTATION

A block diagram of the MorphIC quad-core spikingneuromorphic processor is shown in Fig. 2, illustrating itshierarchical routing fabric for large-scale chip interconnection:level-0 (L0) routers handle intra-core connectivity, level-1 (L1)routers handle inter-core connectivity and level-2 (L2) routershandle inter-chip connectivity (Section II-A). A block diagramof the MorphIC core is shown in Fig. 3: each core embeds512 leaky integrate-and-fire (LIF) neurons, 256k L0 synapsesconfigured as a crossbar array, 256k L1 synapses and 16k L2synapses. Time multiplexing is used to increase the neuronand synapse densities by using shared update circuits andoffloading neuron and synapse states to local SRAM memory,based on the strategy we previously proposed for the ODINSNN in [20]. Each synapse embeds online learning witha stochastic implementation of the spike-dependent synapticplasticity (S-SDSP) learning rule (Section II-B).

A. Hierarchical event routing

Clustering groups of neurons with dense local and sparselong-range connectivity allows minimizing memory require-ments while keeping flexibility and scalability [21]. This orga-nization is found in the brain and is known as small-world net-works [22]. Hierarchy is therefore a key concept in SNN eventrouting infrastructures for large-scale networks [15], [16], [21],[23], [24]. MorphIC uses a heterogeneous hierarchical routingfabric with different router types at each level.

MorphIC core

64kB SRAM(256k L0 synapses,

256k L1 synapses)

8kB SRAM(512 neurons,

16k L2 synapses)

Unfolded LFSR

CLK

RST

LIF update logicS

-SD

SP

up

/do

wn

re

giste

rs

Param.

bank

Router I/F

Controller

(time-multiplexed crossbar)

L0 router

(scheduler)

S-SDSP update logic

L1 ro

ute

r

Fig. 3: Block diagram of the MorphIC core. Each core features 512 LIFneurons and 528k binary-weight synapses with embedded online learning.

1) L0 – A crossbar architecture ensures dense local all-to-all connectivity of the 512 neurons inside eachcore. 256k L0 synapses are thus available per core.

2) L1 – A star router ensures low-latency local broad-casting of events inside MorphIC chips. L1 connec-tivity information is stored in the source neuron: a 3-bit word allows independently enabling source-basedrouting to each of the three other local cores. 256kL1 synapses are thus available per core.

3) L2 – A mesh router with dimension-ordered routing(i.e. x direction before y direction) manages inter-chip connectivity. Each neuron has 32 L2 synapses,for a total of 16k per core. L2 connectivity in-formation is stored in the source neuron and isdestination-based: each neuron can target a specificL2 synapse address in any combination of cores inanother MorphIC chip, at a distance of up to 4 chipsin any direction. Address-event representation (AER)transactions [25] are used for L2 event routing.

This combination of routers is deadlock-free and allowsreaching a fan-in of 512 (L0) + 512 (L1) + 32 (L2) and afan-out of 512 (L0) + 3×512 (L1) + 4 (L2). A key feature ofMorphIC is that synapses at all routing levels embed onlinelearning (Section II-B): distance information is contained inevent packets and modulates the probability of synaptic weightupdate as a function of distance, in accordance with small-world network models. To the best of our knowledge, no otherpreviously-proposed SNN features online hierarchical learning.

B. Stochastic spike-dependent synaptic plasticity (S-SDSP)

As the spike-timing-dependent plasticity (STDP) learningrule relies on the relative timing between pre- and post-synapticspikes, it requires a local synaptic buffering of spike timings,which leads to critical overheads as buffering circuitry has tobe replicated inside each synapse [18]. In order to avoid thisproblem, the stochastic binary approach proposed by Seo et al.in [17] involves the design of a custom SRAM with bothrow and column accesses to carry out STDP updates eachtime pre- and post-synaptic spikes occur. However, beyondincreasing the design time, custom SRAMs do not benefit fromDRC pushed rules for foundry bitcells and induce a strongarea penalty compared to single-port high-density foundrySRAMs [18]. Therefore, STDP cannot be implemented effi-ciently in silicon.

Page 3: A 65-nm 738k-Synapse/mm 2 Quad-Core Binary-Weight Digital ...static.tongtianta.site/paper_pdf/17492c8c-9287-11e... · backpropagation updates [3]. The associated off-chip learning

≤ q±

Fusi’s

1 2 3 4 … 16 17

clk

y

1 + 2 3

y(9k)

4 + 5 6

y(9k+1)

7 + 8 9

y(9k+2)

+ 10 11

y(9k+3)

+ 12 13

y(9k+4)

+ 14

y(9k+5)

+ 15

y(9k+6)

+ 16

y(9k+7)

+ 17

y(9k+8)

… y1 2 3 + 4 16 17

a)

c)

b)

unfolding

Fig. 5: (a) Circuit diagram of a Galois 17-bit LFSR with characteristic polynomial x17+x3+1. (b) Equivalent compact representation. (c) 9-unfolded 17-bitGalois LFSR for single-cycle 9-bit pseudo-random word generation for the S-SDSP online learning rule.

Unfolded LFSR

1

0

param_q+

param_q-

wb

clk

9

9

9

9

downup

Fusi’s

ENdo_update rnd

wb,next

1

0

… …

rnd ≤ param_q±

Comparator

Fig. 4: Time-multiplexed S-SDSP update logic. up and down signals representthe value of the S-SDSP update conditions in Eq. (2).

The spike-dependent synaptic plasticity (SDSP) learningrule [19] avoids this drawback: the synaptic weight w isupdated each time a pre-synaptic event occurs, according toEq. (1). The update depends solely on the state of the post-synaptic neuron at the time of the pre-synaptic spike, i.e. themembrane potential Vmem compared to threshold θm and theCalcium concentration Ca compared to thresholds θ1, θ2 andθ3. The Calcium concentration represents an image of the re-cent firing activity of the neuron, it disables SDSP updates forhigh and low post-synaptic neuron activities and helps preventoverfitting [19]. A single-port high-density foundry SRAMcan therefore be used for high-density time-multiplexed imple-mentations. However, as SDSP relies on discrete positive andnegative steps, it cannot be applied directly to binary weights.{

w→w+1 if Vmem(tpre)≥ θm, θ1 ≤Ca(tpre)<θ3

w→w−1 if Vmem(tpre)<θm, θ1 ≤Ca(tpre)<θ2(1)

Senn and Fusi proposed a stochastic learning rule for binarysynapses in [26]. However, the update conditions rely on thetotal synaptic input of the post-synaptic neuron at the timeof the pre-synaptic spike, which is difficult to extract in time-multiplexed implementations. Therefore, we propose a stochas-tic spike-dependent synaptic plasticity (S-SDSP) learning rulesuitable for binary weights, as formulated in Eq. (2). It resultsfrom the fusion of the stochastic mechanism proposed in [26]with the SDSP update conditions. ζ+ and ζ– are binary randomvariables with probability q+ and q– of being at 1, respectively.The synaptic weight wb therefore goes from 0 to 1 (resp. 1to 0) with probability q+ (resp. q–), depending on the updateconditions.{

wb →wb+ζ+(1−wb) if Vmem(tpre)≥ θm, θ1 ≤Ca(tpre)<θ3

wb →wb−ζ–wb if Vmem(tpre)<θm, θ1 ≤Ca(tpre)<θ2(2)

The proposed S-SDSP update logic is shown in Fig. 4.The binary random variables ζ± can be generated with q±

probabilities using linear feedback shift register (LFSR)-basedpseudo-random number generation. In order to generate theq± probabilities with a resolution similar to those used in [26]and to leave sufficient margin for distance-based modulationof q± (Section II-A), a 9-bit resolution is required. As S-SDSP

Technology

Implementation

Area

Number of cores

Total # neurons (type)

Total # synapses (hier.)

Fan-in (hier.)

Fan-out (hier.)

Online learning

Time constant

Supply voltage

Max. clock frequency

Leakage power (Pleak)

Idle power (Pidle)

Energy/SOP (ESOP)

65nm LP CMOS

Digital

3.5mm² (incl. pads)

2.86mm² (excl. pads)

4

2048 (LIF)

1M (L0), 1M (L1), 64k (L2)

512 (L0), 512 (L1), 32 (L2)

512 (L0), 3x512 (L1), 4 (L2)

Stochastic SDSP, 1-bit weight

Biological to accelerated

0.8V – 1.2V

55MHz (0.8V) – 210MHz (1.2V)

45µW @0.8V

41.3µW/MHz @0.8V

30pJ @0.8V

1.87mm

(b)(a)

1.8

7m

m

Core 2synapseSRAM

C2

ne

ur

SR

AM

Core 3synapseSRAM

C3 neur SRAM

Core 0synapseSRAM

C0 neur SRAM

Core 1synapseSRAM

C1

ne

ur

SR

AM

Logic

Fig. 6: (a) Chip microphotograph. (b) Specifications and measurements.

updates must be computed in a single clock cycle, it is possibleto parallelize 9 successive iterations of a given LFSR by usingthe unfolding algorithm from [27] with an unfolding factor of9, as suggested in [28] to avoid instantiating parallel LFSRsand save switching power. The unfolding process and theresulting unfolded LFSR are illustrated in Fig. 5. Unfoldingleads the combinational logic resources (here, a single XORgate) to be multiplied by the unfolding factor, while theLFSR period is divided by the unfolding factor. In order toavoid inducing correlation between synapses, the unfoldedLFSR period must be one order of magnitude higher than thenumber of synapses per neuron. A 17-bit LFSR depth has thusbeen selected. The overhead incurred by the resulting S-SDSPupdate logic is negligible as it is shared with time multiplexingfor all the L0, L1 and L2 synapses in a MorphIC core.

III. MEASUREMENTS AND BENCHMARKING RESULTS

MorphIC was prototyped in an 8-metal 65-nm low-power (LP) CMOS process. A chip microphotograph is pre-sented together with specifications and die measurement resultsat 25◦C in Figs. 6(a) and 6(b), respectively. The power con-sumption P of digital SNNs can be modeled by Eq. (3) [20]:

P =Pleak+Pidle×fclk+ESOP×rSOP, (3)

where Pleak is the leakage power, Pidle is the idle power (i.e. ac-tive clock, without network activity), ESOP is the energy persynaptic operation (SOP), fclk is the clock frequency and rSOP

is the SOP processing rate. ESOP is an incremental definitionof the energy per SOP as it does not include contributions fromleakage and idle power. At 0.8V, when including the leakageand idle power contributions at maximum fclk and rSOP, i.e.55MHz and 110MSOP/s (27.5MSOP/s/core) [20], MorphICconsumes a total energy of 51pJ per SOP.

Offline learning performances with quantization-awaretraining can be demonstrated with the MNIST dataset ofhandwritten digits [29]. Using the four cores of MorphIC andall the available neuron resources with the network topologyshown in Fig. 7, an accuracy of 97.8% can be reached.

Page 4: A 65-nm 738k-Synapse/mm 2 Quad-Core Binary-Weight Digital ...static.tongtianta.site/paper_pdf/17492c8c-9287-11e... · backpropagation updates [3]. The associated off-chip learning

MorphIC core 0

MorphIC SNN

196 virtual

input neurons

(500+1) hidden

neurons

{0,1}-weights from excitatory

neurons, learned offline

global inhibitory neuron of a given layer (firing threshold = 2,

subtraction of half the sum of the previous layer activities)1-weights from an inhibitory neuron

to all other neurons of a given layer

Input 28x28-pixel

MNIST image

MorphIC core 1

196 virtual

input neurons

(500+1) hidden

neurons

(10+1) output

neurons

MorphIC core 2

196 virtual

input neurons

(500+1) hidden

neurons

(10+1) output

neurons

MorphIC core 3

196 virtual

input neurons

(490+1) hidden

neurons

10 output

neurons

Sub-image MLP Sub-image MLPOutput sum

layer

Router

L2

Router

L1

(10+1) output

neurons

(10+1) output

neurons

Output

classification

Interleaved sub-sampling to four 14x14-pixel images and conversion to rate-based Poisson-distributed spike trains

Sub-image MLPSub-image MLP

Fig. 7: MNIST classification setup. Input images are split with interleaved sub-sampling into four independent 14×14 images. The sub-image pixels areconverted to rate-based Poisson-distributed spike trains and sent to four multi-layer perceptrons (MLPs) resulting from quantization-aware training in Keras,following [2], [3]. Layer-wise inhibitory neurons are used to compensate for rescaling of synaptic weights trained with −1 and +1 values in Keras to values of0 and 1 in MorphIC. Average-pooling the core activities into a global output sum layer leads to a classification accuracy of 97.8%.

Input stimulus (22x22)

Output of convolution

layers (4@16x16)

Average pooling

(4@8x8)

Learned synapses in

output FC layer (8@256x1)

Calcium-based

teacher

Fig. 8: MorphIC implementation of the 8-pattern CNN classification benchmark proposed in [30]. Left: the CNN architecture consists of four 7×7 line detectionkernels at 0◦, 45◦, 90◦ and 135◦, followed by average pooling layers and a fully-connected (FC) 8-neuron output layer. Right: the input pattern activities, neuronactivations of convolution and pooling layers and synaptic weights learned online with S-SDSP are illustrated. 100-% accuracy is reached on 800 test samples.

S-SDSP online learning is demonstrated in Fig. 8, wereproduced the benchmark that was proposed in [30] for ananalog SDSP implementation. Eight patterns are classified bya spiking CNN. Each MorphIC core implements a fixed-weightconvolutional layer with a line detection kernel followed by anaverage pooling layer. The pooling layers are connected withplastic weights to an 8-neuron fully-connected (FC) outputlayer in core 0. The resulting weights allow classifying with100-% accuracy a test set consisting of 100 different Poissonrealizations of each input pattern. Future work should focuson extending S-SDSP for multi-layer training.

Finally, a comparison of MorphIC with the threepreviously-proposed binary SNNs is provided in Table I.While Loihi has a programmable learning engine but does notdemonstrate online learning with a binary-weight configura-tion, MorphIC and the chip from Seo et al. are the only onesto demonstrate embeded online learning with binary weights.The high-density claim of S-SDSP online learning on binaryweights is demonstrated with an order-of-magnitude densityadvantage compared to the S-STDP rule from Seo et al., whichis further emphasized when considering process normalizationto 65nm. Regarding power, MorphIC has a competitive energyper SOP despite using a less advanced CMOS process.

IV. CONCLUSION

In this paper, we presented the MorphIC quad-core spikingneuromorphic processor to leverage the power and density ad-

TABLE I: Comparison of state-of-the-art binary SNN processor chips.

Reference Seo et al. [17] TrueNorth [15] Loihi [16] This work

Technology 45nm SOI 28nm LP 14nm FinFET 65nm LP

Area (excl. pads) 0.8mm2 389mm2 51.8mm2 2.86mm2

# cores 1 4096 128 4

# neurons / core 256 256 max. 1024 512

# synapses / core 64k 64k 114k to 1M 528k

Synaptic width 1-bit 1-bit 9- to 1-bit 1-bit

On-chip learning S-STDP – Programmable S-SDSP

Neuron density†raw

norm.

320

153

2.6k

494

max. 2.5k

max. 190

716

716

Synapse density†raw

norm.

80k

38.3k

674k

125k

282k to 2.5M

21k to 190k

738k

738k

Incremental energy/SOP N/A N/A (>23.6pJ at 0.75V)∗ 30pJ at 0.8V

Total energy/SOP N/A 26pJ at 0.775V N/A 51pJ at 0.8V

∗ Simulation results, excluding the cost of neuron and learning engine updates.† Neuron and synapse densities are obtained by dividing the total number of neurons or

synapses by the chip area, excluding pads. As the obtained raw density performances

are strongly dependent on the selected technology node, values normalized to a 65nm

technology node are provided. Normalization is carried out by using the node factor

(e.g., a (65/45)2-fold reduction for normalizing 45nm to 65nm), except for Loihi where

we used the 13.5 factor from [31] for 14nm FinFET normalization to a bulk 65nm node.

vantages of binary weights with online-learning SNNs. Usingthe proposed stochastic spike-dependent synaptic plasticity (S-SDSP) learning rule, we demonstrated this claim with a densityof 738k synapses per mm2 in 65nm CMOS. It is an order-of-magnitude improvement compared to the only previously-proposed binary SNN with demonstrated online learning fromSeo et al. MorphIC also integrates a hierarchical routing fabricfor large-scale chip interconnection, where distance informa-tion allows modulating the synaptic update probabilities, inaccordance with small-world brain network models.

Page 5: A 65-nm 738k-Synapse/mm 2 Quad-Core Binary-Weight Digital ...static.tongtianta.site/paper_pdf/17492c8c-9287-11e... · backpropagation updates [3]. The associated off-chip learning

REFERENCES

[1] P. N. Whatmough et al., “A 28nm SoC with a 1.2 GHz 568nJ/predictionsparse deep-neural-network engine with >0.1 timing error rate tolerancefor IoT applications,” Proc. of IEEE International Solid-State Circuits

Conference (ISSCC), 2017.

[2] B. Moons et al., “Minimum energy quantized neural networks,” Proc. of

IEEE Asilomar Conference on Signals, Systems, and Computers, 2017.

[3] M. Courbariaux et al., “Binarized neural networks: Training deep neuralnetworks with weights and activations constrained to +1 or -1,” arXiv

preprint arXiv:1602.02830, 2016.

[4] R. Andri et al. “YodaNN: An architecture for ultralow power binary-weight CNN acceleration,” IEEE Transactions on Computer-Aided De-

sign of Integrated Circuits and Systems, vol. 37, no. 1, pp. 48-60, 2018.

[5] B. Moons et al., “BinarEye: An always-on energy-accuracy-scalablebinary CNN processor with all memory on chip in 28nm CMOS,” Proc.

of IEEE Custom Integrated Circuits Conference (CICC), 2018.

[6] D. Bankman et al., “An always-on 3.8µJ/86% CIFAR-10 mixed-signalbinary CNN processor with all memory on chip in 28nm CMOS,” Proc.

of IEEE International Solid-State Circuits Conference (ISSCC), 2018.

[7] C. S. Poon and K. Zhou, “Neuromorphic silicon neurons and large-scale neural networks: challenges and opportunities,” Frontiers in neu-

roscience, vol. 5, p. 108, 2011.

[8] M. R. Azghadi et al., “Spike-based synaptic plasticity in silicon: design,implementation, application, and challenges,” Proceedings of the IEEE,vol. 102, no. 5, pp. 717-737, 2014.

[9] N. Zheng and P. Mazumder, “Online supervised learning for hardware-based multilayer spiking neural networks through the modulation ofweight-dependent spike-timing-dependent plasticity,” IEEE Transactions

on Neural Networks and Learning Systems, vol. 29, no. 9, pp. 4287-4302,2017.

[10] H. Mostafa, “Supervised learning based on temporal coding in spikingneural networks,” IEEE Transactions on Neural Networks and Learning

Systems, vol. 29, no. 7, pp. 3227-3235, 2017.

[11] E. Neftci et al., “Event-driven random back-propagation: Enabling neu-romorphic deep learning machines,” Frontiers in Neuroscience, vol. 11,p. 324, 2017.

[12] F. Zenke and S. Ganguli, “SuperSpike: Supervised learning in multilayerspiking neural networks,” Neural computation, vol. 30, no. 6, pp. 1514-1541, 2018.

[13] P. Knag et al., “A sparse coding neural network ASIC with on-chiplearning for feature extraction and encoding,” IEEE Journal of Solid-

State Circuits, vol. 50, no. 4, pp. 1070-1079, 2015.

[14] G. K. Chen et al., “A 4096-neuron 1M-synapse 3.8pJ/SOP spikingneural network with on-chip STDP learning and sparse weights in 10nmFinFET CMOS,” Proc. of IEEE Symposium on VLSI Circuits (VLSI-C),2018.

[15] F. Akopyan et al., “TrueNorth: Design and tool flow of a 65 mW 1million neuron programmable neurosynaptic chip,” IEEE Transactions

on Computer-Aided Design of Integrated Circuits and Systems, vol. 34,no. 10, pp. 1537-1557, 2015.

[16] M. Davies et al., “Loihi: A neuromorphic manycore processor withon-chip learning,” IEEE Micro, vol. 38, no. 1, pp. 82-99, 2018.

[17] J.-S. Seo et al., “A 45nm CMOS neuromorphic chip with a scalablearchitecture for learning in networks of spiking neurons,” Proc. of IEEE

Custom Integrated Circuits Conference (CICC), 2011.

[18] C. Frenkel et al., “A fully-synthesized 20-gate digital spike-basedsynapse with embedded online learning,” Proc. of IEEE International

Symposium on Circuits and Systems (ISCAS), pp. 17-20, 2017

[19] J. M. Brader, W. Senn and S. Fusi, “Learning real-world stimuli in aneural network with spike-driven synaptic dynamics,” Neural Computa-

tion, vol. 19, no. 11, pp. 2881-2912, 2007.

[20] C. Frenkel et al., “A 0.086-mm2 12.7-pJ/SOP 64k-synapse 256-neuron online-learning digital spiking neuromorphic processor in 28-nmCMOS,” IEEE Transactions on Biomedical Circuits and Systems, vol. 13,no. 1, pp. 145-158, 2019.

[21] S. Moradi et al., “A scalable multicore architecture with heterogeneousmemory structures for Dynamic Neuromorphic Asynchronous Processors(DYNAPs),” IEEE Transactions on Biomedical Circuits and Systems,vol. 12, no. 1, pp. 106-122, 2018.

[22] D. Bassett and E. D. Bullmore, “Small-world brain networks,” The

Neuroscientist, vol. 12, no. 6, pp. 512-523, 2006.

[23] J. Park et al., “Hierarchical address event routing for reconfigurablelarge-scale neuromorphic systems,” IEEE Transactions on Neural Net-

works and Learning Systems, vol. 28, no. 10, pp. 2408-2422, 2017.

[24] J. Navaridas et al., “Understanding the interconnection network ofSpiNNaker,” Proc. of ACM International Conference on Supercomputing

(ICS), pp. 286-295, 2009.

[25] K. A. Boahen, “Point-to-point connectivity between neuromorphic chipsusing address events,” IEEE Transactions on Circuits and Systems II,vol. 47, no. 5, pp. 416-434, 2000.

[26] W. Senn and S. Fusi, “Convergence of stochastic learning in perceptronswith binary synapses,” Physical Review E, vol. 71, no. 6, p. 061907, 2005.

[27] K. K. Parhi, VLSI Digital Signal Processing Systems: Design and

Implementation, John Wiley & Sons, 1999

[28] C. Cheng and K. K. Parhi, “High-speed parallel CRC implementationbased on unfolding, pipelining, and retiming,” IEEE Transactions on

Circuits and Systems II, vol. 53, no. 10, pp. 1017-1021, 2006.

[29] Y. LeCun and C. Cortes, “The MNIST database of handwritten digits,”1998 [Online]. Available: http://yann.lecun.com/exdb/mnist/.

[30] G. Indiveri, F. Corradi and N. Qiao, “Neuromorphic architectures forspiking deep neural networks,” Proc. of IEEE International Electron

Devices Meeting (IEDM), 2015.

[31] K. Mistry, “10nm technology leadership,” Leading at the Edge:

Intel Technology and Manufacturing Day, 2017 [Online]. Avail-able: https://newsroom.intel.com/newsroom/wp-content/uploads/sites/11/2017/03/Kaizad-Mistry-2017-Manufacturing.pdf.