Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
A 65-nm 738k-Synapse/mm2 Quad-Core
Binary-Weight Digital Neuromorphic Processor
with Stochastic Spike-Driven Online LearningCharlotte Frenkel, Jean-Didier Legat and David Bol
ICTEAM Institute, Universite catholique de Louvain, Louvain-la-Neuve, Belgium
Email: {charlotte.frenkel, jean-didier.legat, david.bol}@uclouvain.be
Abstract—Recent trends in the field of artificial neural net-works (ANNs) and convolutional neural networks (CNNs) in-vestigate weight binarization for full on-chip weight storage tominimize circuit resources and to avoid the high energy cost of off-chip memory accesses. In parallel, spiking neural network (SNN)architectures are explored to further reduce power when process-ing sparse event-based data streams, while on-chip spike-basedonline learning targets applications constrained in power andresources during the training phase. However, leveraging high-density on-chip online learning in binary-weight SNNs is still anopen challenge. In this work, we demonstrate MorphIC, a quad-core binary-weight digital neuromorphic processor embedding astochastic version of the spike-driven synaptic plasticity (S-SDSP)learning rule and a hierarchical routing fabric for large-scale chipinterconnection. The MorphIC SNN processor embeds a total of2k leaky integrate-and-fire (LIF) neurons and more than twomillion plastic synapses for an active silicon area of 2.86mm2 in65nm CMOS, achieving a high density of 738k synapses/mm2.
I. INTRODUCTION
The massive deployment of neural network acceleratorsas inference devices is currently hindered by the memoryfootprint and power consumption required for high-accuracyclassification [1]. Two trends are currently being explored inorder to solve this issue. The first trend consists in optimizingcurrent artificial neural network (ANN) and convolutional neu-ral network (CNN) architectures. Weight quantization down tobinarization is a promising approach as it allows to simplify theoperations and minimize the memory footprint, thus avoidingthe high energy cost of off-chip memory accesses if all theweights can be stored into on-chip memory [2]. The accuracydrop induced by quantization can be mitigated to acceptablelevels for many applications with the use of quantization-aware training techniques that propagate binary weights dur-ing the forward pass and keep full-resolution weights forbackpropagation updates [3]. The associated off-chip learningsetup for quantization-aware training is shown in Fig. 1(a):this strategy allows binary-weight on-chip neural networks toperform inference with a favorable energy-accuracy tradeoff,as recently demonstrated by binary CNN chips (e.g., [4]–[6]).
The second trend consists in changing the neural networkarchitecture and data representation, which is currently beingexplored with bio-inspired spiking neural networks (SNNs) asa power-efficient processing alternative for sparse event-baseddata streams [7]. Embedded online learning is a key featurein SNNs as it enables on-the-fly adaptation to the environ-ment [8]. Moreover, by avoiding the use of an off-chip opti-mizer, on-chip online learning allows SNNs to target applica-tions that are power- and resource-constrained during both the
C. Frenkel is with Universite catholique de Louvain as a Research Fellowfrom the National Foundation for Scientific Research (FNRS) of Belgium.
Backpropagation
update
z±
Fusi’s
… …
1-bit weight
memory
Update
circuitInferenceInput
1-bit weight
memory
InferenceInput
Forward
pass
Training Inference
Training + inference
Off-chip learning
setup
On-chip learning
setup
(a)
(b)
Label
Label
≤ ±
32-bit weight
memory
1-bit
quant.
Input
Label
Fig. 1: Learning strategies for binary-weight neural networks. (a) Quantization-aware off-chip learning setup: binary weights are used during the forwardpass while full-resolution weights are kept for backpropagation updates [3].Training is carried out in an off-chip high-performance optimizer, whileinference is carried out in the power- and resource-constrained device. (b) On-chip online learning setup, where data-driven weight updates are carried outin parallel with inference in the power- and resource-constrained device.
training and the inference phases, as shown in Fig. 1(b). Spike-based online learning is an active research area, both in thedevelopment of new rules for high-accuracy learning in multi-layer networks (e.g., [9]–[12]) and in the demonstration ofsilicon implementations in applications such as unsupervisedlearning for image denoising and reconstruction [13], [14].However, these approaches currently rely on multi-bit weights.
These two trends mostly evolve in parallel as only threechips have been proposed previously to leverage the densityand power advantage of binary weights with SNNs. First, theIBM TrueNorth is the largest-scale neuromorphic chip with1M neurons and 256M 1-bit synapses, however it does notembed online learning [15]. Second, the recently-proposedIntel Loihi has a configurable synaptic resolution that can bereduced to 1 bit and embeds a programmable co-processorfor on-chip learning, though not demonstrated using a binarysynaptic resolution [16]. Finally, Seo et al. propose a stochasticversion of the spike-timing-dependent plasticity (S-STDP) rulefor online learning in binary synapses [17]. However, S-STDPrequires the design of a custom SRAM memory with bothrow and column accesses, which severely degrades the densityadvantage of their approach.
It has been suggested in [18] that the spike-dependentsynaptic plasticity (SDSP) learning rule proposed byBrader et al. in [19] allows for a more efficient resource usagethan STDP: all the information necessary for learning is avail-able in the post-synaptic neuron at pre-synaptic spike time.SDSP requires neither an expensive local synaptic storage ofspike timings nor a custom SRAM with both row and columnaccesses. Therefore, in this work, we propose an efficient
978-1-7281-0397-6/19/$31.00 ©2019 IEEE
––
L2 router
(mesh)
SOUTH AER LINK
NORTH AER LINK
EASTAERLINK
WESTAERLINK
L1 router
(star)
L0 router
L0 router L0 router
Core 3 (crossbar)Core 2 (crossbar)
Core 0 (crossbar)
L0 router
Core 1 (crossbar)
MorphIC SNN
Fig. 2: Block diagram of the MorphIC quad-core neuromorphic processor.
stochastic implementation of SDSP compatible with a standardhigh-density foundry SRAM in order to leverage embeddedonline learning in binary-weight SNNs. We demonstrate thisapproach with MorphIC, a quad-core digital neuromorphicprocessor with a hierarchical routing fabric for large-scale chipinterconnection. MorphIC was prototyped in 65nm CMOS andembeds 2k neurons and more than 2M synapses in an activesilicon area of 2.86mm2, therefore achieving a high densityof 738k 1-bit online-learning synapses per mm2. It results inan order-of-magnitude density improvement compared to thebinary-weight online-learning SNN processor from Seo et al.
The remainder of this paper is structured as follows. The ar-chitecture and implementation of the MorphIC SNN processorare provided in Section II. The specifications, measurementsand benchmarking results are then presented in Section III.
II. ARCHITECTURE AND IMPLEMENTATION
A block diagram of the MorphIC quad-core spikingneuromorphic processor is shown in Fig. 2, illustrating itshierarchical routing fabric for large-scale chip interconnection:level-0 (L0) routers handle intra-core connectivity, level-1 (L1)routers handle inter-core connectivity and level-2 (L2) routershandle inter-chip connectivity (Section II-A). A block diagramof the MorphIC core is shown in Fig. 3: each core embeds512 leaky integrate-and-fire (LIF) neurons, 256k L0 synapsesconfigured as a crossbar array, 256k L1 synapses and 16k L2synapses. Time multiplexing is used to increase the neuronand synapse densities by using shared update circuits andoffloading neuron and synapse states to local SRAM memory,based on the strategy we previously proposed for the ODINSNN in [20]. Each synapse embeds online learning witha stochastic implementation of the spike-dependent synapticplasticity (S-SDSP) learning rule (Section II-B).
A. Hierarchical event routing
Clustering groups of neurons with dense local and sparselong-range connectivity allows minimizing memory require-ments while keeping flexibility and scalability [21]. This orga-nization is found in the brain and is known as small-world net-works [22]. Hierarchy is therefore a key concept in SNN eventrouting infrastructures for large-scale networks [15], [16], [21],[23], [24]. MorphIC uses a heterogeneous hierarchical routingfabric with different router types at each level.
MorphIC core
64kB SRAM(256k L0 synapses,
256k L1 synapses)
8kB SRAM(512 neurons,
16k L2 synapses)
Unfolded LFSR
CLK
RST
LIF update logicS
-SD
SP
up
/do
wn
re
giste
rs
–
Param.
bank
Router I/F
Controller
(time-multiplexed crossbar)
L0 router
(scheduler)
S-SDSP update logic
L1 ro
ute
r
Fig. 3: Block diagram of the MorphIC core. Each core features 512 LIFneurons and 528k binary-weight synapses with embedded online learning.
1) L0 – A crossbar architecture ensures dense local all-to-all connectivity of the 512 neurons inside eachcore. 256k L0 synapses are thus available per core.
2) L1 – A star router ensures low-latency local broad-casting of events inside MorphIC chips. L1 connec-tivity information is stored in the source neuron: a 3-bit word allows independently enabling source-basedrouting to each of the three other local cores. 256kL1 synapses are thus available per core.
3) L2 – A mesh router with dimension-ordered routing(i.e. x direction before y direction) manages inter-chip connectivity. Each neuron has 32 L2 synapses,for a total of 16k per core. L2 connectivity in-formation is stored in the source neuron and isdestination-based: each neuron can target a specificL2 synapse address in any combination of cores inanother MorphIC chip, at a distance of up to 4 chipsin any direction. Address-event representation (AER)transactions [25] are used for L2 event routing.
This combination of routers is deadlock-free and allowsreaching a fan-in of 512 (L0) + 512 (L1) + 32 (L2) and afan-out of 512 (L0) + 3×512 (L1) + 4 (L2). A key feature ofMorphIC is that synapses at all routing levels embed onlinelearning (Section II-B): distance information is contained inevent packets and modulates the probability of synaptic weightupdate as a function of distance, in accordance with small-world network models. To the best of our knowledge, no otherpreviously-proposed SNN features online hierarchical learning.
B. Stochastic spike-dependent synaptic plasticity (S-SDSP)
As the spike-timing-dependent plasticity (STDP) learningrule relies on the relative timing between pre- and post-synapticspikes, it requires a local synaptic buffering of spike timings,which leads to critical overheads as buffering circuitry has tobe replicated inside each synapse [18]. In order to avoid thisproblem, the stochastic binary approach proposed by Seo et al.in [17] involves the design of a custom SRAM with bothrow and column accesses to carry out STDP updates eachtime pre- and post-synaptic spikes occur. However, beyondincreasing the design time, custom SRAMs do not benefit fromDRC pushed rules for foundry bitcells and induce a strongarea penalty compared to single-port high-density foundrySRAMs [18]. Therefore, STDP cannot be implemented effi-ciently in silicon.
≤ q±
z±
Fusi’s
1 2 3 4 … 16 17
clk
y
1 + 2 3
y(9k)
4 + 5 6
y(9k+1)
7 + 8 9
y(9k+2)
+ 10 11
y(9k+3)
+ 12 13
y(9k+4)
+ 14
y(9k+5)
+ 15
y(9k+6)
+ 16
y(9k+7)
+ 17
y(9k+8)
… y1 2 3 + 4 16 17
a)
c)
b)
unfolding
Fig. 5: (a) Circuit diagram of a Galois 17-bit LFSR with characteristic polynomial x17+x3+1. (b) Equivalent compact representation. (c) 9-unfolded 17-bitGalois LFSR for single-cycle 9-bit pseudo-random word generation for the S-SDSP online learning rule.
Unfolded LFSR
1
0
param_q+
param_q-
wb
clk
9
9
9
9
downup
z±
Fusi’s
ENdo_update rnd
wb,next
1
0
… …
rnd ≤ param_q±
Comparator
Fig. 4: Time-multiplexed S-SDSP update logic. up and down signals representthe value of the S-SDSP update conditions in Eq. (2).
The spike-dependent synaptic plasticity (SDSP) learningrule [19] avoids this drawback: the synaptic weight w isupdated each time a pre-synaptic event occurs, according toEq. (1). The update depends solely on the state of the post-synaptic neuron at the time of the pre-synaptic spike, i.e. themembrane potential Vmem compared to threshold θm and theCalcium concentration Ca compared to thresholds θ1, θ2 andθ3. The Calcium concentration represents an image of the re-cent firing activity of the neuron, it disables SDSP updates forhigh and low post-synaptic neuron activities and helps preventoverfitting [19]. A single-port high-density foundry SRAMcan therefore be used for high-density time-multiplexed imple-mentations. However, as SDSP relies on discrete positive andnegative steps, it cannot be applied directly to binary weights.{
w→w+1 if Vmem(tpre)≥ θm, θ1 ≤Ca(tpre)<θ3
w→w−1 if Vmem(tpre)<θm, θ1 ≤Ca(tpre)<θ2(1)
Senn and Fusi proposed a stochastic learning rule for binarysynapses in [26]. However, the update conditions rely on thetotal synaptic input of the post-synaptic neuron at the timeof the pre-synaptic spike, which is difficult to extract in time-multiplexed implementations. Therefore, we propose a stochas-tic spike-dependent synaptic plasticity (S-SDSP) learning rulesuitable for binary weights, as formulated in Eq. (2). It resultsfrom the fusion of the stochastic mechanism proposed in [26]with the SDSP update conditions. ζ+ and ζ– are binary randomvariables with probability q+ and q– of being at 1, respectively.The synaptic weight wb therefore goes from 0 to 1 (resp. 1to 0) with probability q+ (resp. q–), depending on the updateconditions.{
wb →wb+ζ+(1−wb) if Vmem(tpre)≥ θm, θ1 ≤Ca(tpre)<θ3
wb →wb−ζ–wb if Vmem(tpre)<θm, θ1 ≤Ca(tpre)<θ2(2)
The proposed S-SDSP update logic is shown in Fig. 4.The binary random variables ζ± can be generated with q±
probabilities using linear feedback shift register (LFSR)-basedpseudo-random number generation. In order to generate theq± probabilities with a resolution similar to those used in [26]and to leave sufficient margin for distance-based modulationof q± (Section II-A), a 9-bit resolution is required. As S-SDSP
Technology
Implementation
Area
Number of cores
Total # neurons (type)
Total # synapses (hier.)
Fan-in (hier.)
Fan-out (hier.)
Online learning
Time constant
Supply voltage
Max. clock frequency
Leakage power (Pleak)
Idle power (Pidle)
Energy/SOP (ESOP)
65nm LP CMOS
Digital
3.5mm² (incl. pads)
2.86mm² (excl. pads)
4
2048 (LIF)
1M (L0), 1M (L1), 64k (L2)
512 (L0), 512 (L1), 32 (L2)
512 (L0), 3x512 (L1), 4 (L2)
Stochastic SDSP, 1-bit weight
Biological to accelerated
0.8V – 1.2V
55MHz (0.8V) – 210MHz (1.2V)
45µW @0.8V
41.3µW/MHz @0.8V
30pJ @0.8V
1.87mm
(b)(a)
1.8
7m
m
Core 2synapseSRAM
C2
ne
ur
SR
AM
Core 3synapseSRAM
C3 neur SRAM
Core 0synapseSRAM
C0 neur SRAM
Core 1synapseSRAM
C1
ne
ur
SR
AM
Logic
Fig. 6: (a) Chip microphotograph. (b) Specifications and measurements.
updates must be computed in a single clock cycle, it is possibleto parallelize 9 successive iterations of a given LFSR by usingthe unfolding algorithm from [27] with an unfolding factor of9, as suggested in [28] to avoid instantiating parallel LFSRsand save switching power. The unfolding process and theresulting unfolded LFSR are illustrated in Fig. 5. Unfoldingleads the combinational logic resources (here, a single XORgate) to be multiplied by the unfolding factor, while theLFSR period is divided by the unfolding factor. In order toavoid inducing correlation between synapses, the unfoldedLFSR period must be one order of magnitude higher than thenumber of synapses per neuron. A 17-bit LFSR depth has thusbeen selected. The overhead incurred by the resulting S-SDSPupdate logic is negligible as it is shared with time multiplexingfor all the L0, L1 and L2 synapses in a MorphIC core.
III. MEASUREMENTS AND BENCHMARKING RESULTS
MorphIC was prototyped in an 8-metal 65-nm low-power (LP) CMOS process. A chip microphotograph is pre-sented together with specifications and die measurement resultsat 25◦C in Figs. 6(a) and 6(b), respectively. The power con-sumption P of digital SNNs can be modeled by Eq. (3) [20]:
P =Pleak+Pidle×fclk+ESOP×rSOP, (3)
where Pleak is the leakage power, Pidle is the idle power (i.e. ac-tive clock, without network activity), ESOP is the energy persynaptic operation (SOP), fclk is the clock frequency and rSOP
is the SOP processing rate. ESOP is an incremental definitionof the energy per SOP as it does not include contributions fromleakage and idle power. At 0.8V, when including the leakageand idle power contributions at maximum fclk and rSOP, i.e.55MHz and 110MSOP/s (27.5MSOP/s/core) [20], MorphICconsumes a total energy of 51pJ per SOP.
Offline learning performances with quantization-awaretraining can be demonstrated with the MNIST dataset ofhandwritten digits [29]. Using the four cores of MorphIC andall the available neuron resources with the network topologyshown in Fig. 7, an accuracy of 97.8% can be reached.
MorphIC core 0
MorphIC SNN
196 virtual
input neurons
(500+1) hidden
neurons
…
…
…
{0,1}-weights from excitatory
neurons, learned offline
global inhibitory neuron of a given layer (firing threshold = 2,
subtraction of half the sum of the previous layer activities)1-weights from an inhibitory neuron
to all other neurons of a given layer
Input 28x28-pixel
MNIST image
MorphIC core 1
196 virtual
input neurons
(500+1) hidden
neurons
(10+1) output
neurons
…
…
…
MorphIC core 2
196 virtual
input neurons
(500+1) hidden
neurons
(10+1) output
neurons
…
…
…
MorphIC core 3
196 virtual
input neurons
(490+1) hidden
neurons
…
…
…
10 output
neurons
…
Sub-image MLP Sub-image MLPOutput sum
layer
Router
L2
Router
L1
(10+1) output
neurons
(10+1) output
neurons
Output
classification
Interleaved sub-sampling to four 14x14-pixel images and conversion to rate-based Poisson-distributed spike trains
Sub-image MLPSub-image MLP
Fig. 7: MNIST classification setup. Input images are split with interleaved sub-sampling into four independent 14×14 images. The sub-image pixels areconverted to rate-based Poisson-distributed spike trains and sent to four multi-layer perceptrons (MLPs) resulting from quantization-aware training in Keras,following [2], [3]. Layer-wise inhibitory neurons are used to compensate for rescaling of synaptic weights trained with −1 and +1 values in Keras to values of0 and 1 in MorphIC. Average-pooling the core activities into a global output sum layer leads to a classification accuracy of 97.8%.
Input stimulus (22x22)
Output of convolution
layers (4@16x16)
Average pooling
(4@8x8)
Learned synapses in
output FC layer (8@256x1)
Calcium-based
teacher
Fig. 8: MorphIC implementation of the 8-pattern CNN classification benchmark proposed in [30]. Left: the CNN architecture consists of four 7×7 line detectionkernels at 0◦, 45◦, 90◦ and 135◦, followed by average pooling layers and a fully-connected (FC) 8-neuron output layer. Right: the input pattern activities, neuronactivations of convolution and pooling layers and synaptic weights learned online with S-SDSP are illustrated. 100-% accuracy is reached on 800 test samples.
S-SDSP online learning is demonstrated in Fig. 8, wereproduced the benchmark that was proposed in [30] for ananalog SDSP implementation. Eight patterns are classified bya spiking CNN. Each MorphIC core implements a fixed-weightconvolutional layer with a line detection kernel followed by anaverage pooling layer. The pooling layers are connected withplastic weights to an 8-neuron fully-connected (FC) outputlayer in core 0. The resulting weights allow classifying with100-% accuracy a test set consisting of 100 different Poissonrealizations of each input pattern. Future work should focuson extending S-SDSP for multi-layer training.
Finally, a comparison of MorphIC with the threepreviously-proposed binary SNNs is provided in Table I.While Loihi has a programmable learning engine but does notdemonstrate online learning with a binary-weight configura-tion, MorphIC and the chip from Seo et al. are the only onesto demonstrate embeded online learning with binary weights.The high-density claim of S-SDSP online learning on binaryweights is demonstrated with an order-of-magnitude densityadvantage compared to the S-STDP rule from Seo et al., whichis further emphasized when considering process normalizationto 65nm. Regarding power, MorphIC has a competitive energyper SOP despite using a less advanced CMOS process.
IV. CONCLUSION
In this paper, we presented the MorphIC quad-core spikingneuromorphic processor to leverage the power and density ad-
TABLE I: Comparison of state-of-the-art binary SNN processor chips.
Reference Seo et al. [17] TrueNorth [15] Loihi [16] This work
Technology 45nm SOI 28nm LP 14nm FinFET 65nm LP
Area (excl. pads) 0.8mm2 389mm2 51.8mm2 2.86mm2
# cores 1 4096 128 4
# neurons / core 256 256 max. 1024 512
# synapses / core 64k 64k 114k to 1M 528k
Synaptic width 1-bit 1-bit 9- to 1-bit 1-bit
On-chip learning S-STDP – Programmable S-SDSP
Neuron density†raw
norm.
320
153
2.6k
494
max. 2.5k
max. 190
716
716
Synapse density†raw
norm.
80k
38.3k
674k
125k
282k to 2.5M
21k to 190k
738k
738k
Incremental energy/SOP N/A N/A (>23.6pJ at 0.75V)∗ 30pJ at 0.8V
Total energy/SOP N/A 26pJ at 0.775V N/A 51pJ at 0.8V
∗ Simulation results, excluding the cost of neuron and learning engine updates.† Neuron and synapse densities are obtained by dividing the total number of neurons or
synapses by the chip area, excluding pads. As the obtained raw density performances
are strongly dependent on the selected technology node, values normalized to a 65nm
technology node are provided. Normalization is carried out by using the node factor
(e.g., a (65/45)2-fold reduction for normalizing 45nm to 65nm), except for Loihi where
we used the 13.5 factor from [31] for 14nm FinFET normalization to a bulk 65nm node.
vantages of binary weights with online-learning SNNs. Usingthe proposed stochastic spike-dependent synaptic plasticity (S-SDSP) learning rule, we demonstrated this claim with a densityof 738k synapses per mm2 in 65nm CMOS. It is an order-of-magnitude improvement compared to the only previously-proposed binary SNN with demonstrated online learning fromSeo et al. MorphIC also integrates a hierarchical routing fabricfor large-scale chip interconnection, where distance informa-tion allows modulating the synaptic update probabilities, inaccordance with small-world brain network models.
REFERENCES
[1] P. N. Whatmough et al., “A 28nm SoC with a 1.2 GHz 568nJ/predictionsparse deep-neural-network engine with >0.1 timing error rate tolerancefor IoT applications,” Proc. of IEEE International Solid-State Circuits
Conference (ISSCC), 2017.
[2] B. Moons et al., “Minimum energy quantized neural networks,” Proc. of
IEEE Asilomar Conference on Signals, Systems, and Computers, 2017.
[3] M. Courbariaux et al., “Binarized neural networks: Training deep neuralnetworks with weights and activations constrained to +1 or -1,” arXiv
preprint arXiv:1602.02830, 2016.
[4] R. Andri et al. “YodaNN: An architecture for ultralow power binary-weight CNN acceleration,” IEEE Transactions on Computer-Aided De-
sign of Integrated Circuits and Systems, vol. 37, no. 1, pp. 48-60, 2018.
[5] B. Moons et al., “BinarEye: An always-on energy-accuracy-scalablebinary CNN processor with all memory on chip in 28nm CMOS,” Proc.
of IEEE Custom Integrated Circuits Conference (CICC), 2018.
[6] D. Bankman et al., “An always-on 3.8µJ/86% CIFAR-10 mixed-signalbinary CNN processor with all memory on chip in 28nm CMOS,” Proc.
of IEEE International Solid-State Circuits Conference (ISSCC), 2018.
[7] C. S. Poon and K. Zhou, “Neuromorphic silicon neurons and large-scale neural networks: challenges and opportunities,” Frontiers in neu-
roscience, vol. 5, p. 108, 2011.
[8] M. R. Azghadi et al., “Spike-based synaptic plasticity in silicon: design,implementation, application, and challenges,” Proceedings of the IEEE,vol. 102, no. 5, pp. 717-737, 2014.
[9] N. Zheng and P. Mazumder, “Online supervised learning for hardware-based multilayer spiking neural networks through the modulation ofweight-dependent spike-timing-dependent plasticity,” IEEE Transactions
on Neural Networks and Learning Systems, vol. 29, no. 9, pp. 4287-4302,2017.
[10] H. Mostafa, “Supervised learning based on temporal coding in spikingneural networks,” IEEE Transactions on Neural Networks and Learning
Systems, vol. 29, no. 7, pp. 3227-3235, 2017.
[11] E. Neftci et al., “Event-driven random back-propagation: Enabling neu-romorphic deep learning machines,” Frontiers in Neuroscience, vol. 11,p. 324, 2017.
[12] F. Zenke and S. Ganguli, “SuperSpike: Supervised learning in multilayerspiking neural networks,” Neural computation, vol. 30, no. 6, pp. 1514-1541, 2018.
[13] P. Knag et al., “A sparse coding neural network ASIC with on-chiplearning for feature extraction and encoding,” IEEE Journal of Solid-
State Circuits, vol. 50, no. 4, pp. 1070-1079, 2015.
[14] G. K. Chen et al., “A 4096-neuron 1M-synapse 3.8pJ/SOP spikingneural network with on-chip STDP learning and sparse weights in 10nmFinFET CMOS,” Proc. of IEEE Symposium on VLSI Circuits (VLSI-C),2018.
[15] F. Akopyan et al., “TrueNorth: Design and tool flow of a 65 mW 1million neuron programmable neurosynaptic chip,” IEEE Transactions
on Computer-Aided Design of Integrated Circuits and Systems, vol. 34,no. 10, pp. 1537-1557, 2015.
[16] M. Davies et al., “Loihi: A neuromorphic manycore processor withon-chip learning,” IEEE Micro, vol. 38, no. 1, pp. 82-99, 2018.
[17] J.-S. Seo et al., “A 45nm CMOS neuromorphic chip with a scalablearchitecture for learning in networks of spiking neurons,” Proc. of IEEE
Custom Integrated Circuits Conference (CICC), 2011.
[18] C. Frenkel et al., “A fully-synthesized 20-gate digital spike-basedsynapse with embedded online learning,” Proc. of IEEE International
Symposium on Circuits and Systems (ISCAS), pp. 17-20, 2017
[19] J. M. Brader, W. Senn and S. Fusi, “Learning real-world stimuli in aneural network with spike-driven synaptic dynamics,” Neural Computa-
tion, vol. 19, no. 11, pp. 2881-2912, 2007.
[20] C. Frenkel et al., “A 0.086-mm2 12.7-pJ/SOP 64k-synapse 256-neuron online-learning digital spiking neuromorphic processor in 28-nmCMOS,” IEEE Transactions on Biomedical Circuits and Systems, vol. 13,no. 1, pp. 145-158, 2019.
[21] S. Moradi et al., “A scalable multicore architecture with heterogeneousmemory structures for Dynamic Neuromorphic Asynchronous Processors(DYNAPs),” IEEE Transactions on Biomedical Circuits and Systems,vol. 12, no. 1, pp. 106-122, 2018.
[22] D. Bassett and E. D. Bullmore, “Small-world brain networks,” The
Neuroscientist, vol. 12, no. 6, pp. 512-523, 2006.
[23] J. Park et al., “Hierarchical address event routing for reconfigurablelarge-scale neuromorphic systems,” IEEE Transactions on Neural Net-
works and Learning Systems, vol. 28, no. 10, pp. 2408-2422, 2017.
[24] J. Navaridas et al., “Understanding the interconnection network ofSpiNNaker,” Proc. of ACM International Conference on Supercomputing
(ICS), pp. 286-295, 2009.
[25] K. A. Boahen, “Point-to-point connectivity between neuromorphic chipsusing address events,” IEEE Transactions on Circuits and Systems II,vol. 47, no. 5, pp. 416-434, 2000.
[26] W. Senn and S. Fusi, “Convergence of stochastic learning in perceptronswith binary synapses,” Physical Review E, vol. 71, no. 6, p. 061907, 2005.
[27] K. K. Parhi, VLSI Digital Signal Processing Systems: Design and
Implementation, John Wiley & Sons, 1999
[28] C. Cheng and K. K. Parhi, “High-speed parallel CRC implementationbased on unfolding, pipelining, and retiming,” IEEE Transactions on
Circuits and Systems II, vol. 53, no. 10, pp. 1017-1021, 2006.
[29] Y. LeCun and C. Cortes, “The MNIST database of handwritten digits,”1998 [Online]. Available: http://yann.lecun.com/exdb/mnist/.
[30] G. Indiveri, F. Corradi and N. Qiao, “Neuromorphic architectures forspiking deep neural networks,” Proc. of IEEE International Electron
Devices Meeting (IEDM), 2015.
[31] K. Mistry, “10nm technology leadership,” Leading at the Edge:
Intel Technology and Manufacturing Day, 2017 [Online]. Avail-able: https://newsroom.intel.com/newsroom/wp-content/uploads/sites/11/2017/03/Kaizad-Mistry-2017-Manufacturing.pdf.