IEEE TRANSACTIONS ON ELECTRON DEVICES, VOL. XX, NO. X, XXX 2015 1
Experimental demonstration and tolerancing ofa large-scale neural network (165,000 synapses),
using phase-change memory as thesynaptic weight element
Geoffrey W. Burr, Senior Member, Robert M. Shelby, Severin Sidler, Carmelo di Nolfo, Junwoo Jang,Irem Boybat, Student Member, Rohit S. Shenoy, Member, Pritish Narayanan, Member,
Kumar Virwani, Member, Emanuele U. Giacometti, Blent Kurdi, and Hyunsang Hwang, Member
AbstractUsing 2 phase-change memory (PCM) devices persynapse, a 3layer perceptron network with 164,885 synapses istrained on a subset (5000 examples) of the MNIST database ofhandwritten digits using a backpropagation variant suitable fornonvolatile memory (NVM) + selector crossbar arrays, obtaininga training (generalization) accuracy of 82.2% (82.9%). Usinga neural network (NN) simulator matched to the experimentaldemonstrator, extensive tolerancing is performed with respectto NVM variability, yield, and the stochasticity, linearity andasymmetry of the NVM-conductance response. We show thata bidirectional NVM with a symmetric, linear conductanceresponse of high dynamic range is capable of delivering the samehigh classification accuracies on this problem as a conventional,software-based implementation of this same network.
I. INTRODUCTION
DENSE arrays of nonvolatile memory (NVM) and selec-tor device-pairs (Fig. 1) can implement neuro-inspirednon-Von Neumann computing [1], [2], using pairs [2] ofNVM devices as programmable (plastic) bipolar synapses.Work to date has emphasized the Spike-Timing-Dependent-Plasticity (STDP) algorithm [1], [2], motivated by synapticmeasurements in real brains. However, experimental NVMdemonstrations have been limited in size (100 synapses),and few results have reported quantitative performance metricssuch as classification accuracy. Worse yet, it has been difficultto be sure whether the relatively poor metrics reported to datemight be due to immaturities or inefficiencies in the STDPlearning algorithm (as it is currently implemented), rather thanreflective of problems introduced by the imperfections of theNVM devices.
Unlike STDP, backpropagation is a widely-used, well-studied method in training artificial neural networks, offeringbenchmark-able performance on datasets such as handwritten
G. W. Burr (e-mail:[email protected]), R. M. Shelby, S. Sidler, C. diNolfo, P. Narayanan, K. Virwani, and B. Kurdi are with IBM Research Almaden, San Jose, CA 95120.
I. Boybat is now with EPFL, Lausanne, Switzerland.J. Jang is with the Department of Creative IT Engineering, Pohang
University of Science and Technology, Pohang 790-784, Korea.R. S. Shenoy is now with Intel, Santa Clara, CA 95054.E. U. Giacometti was an intern at IBM ResearchAlmaden, San Jose, CA
in 2014.H. Hwang is with the Department of Material Science and Engineering,
Pohang University of Science and Technology, Pohang 790-784, Korea.Manuscript received March 9, 2015; revised May, 2015.
Selector deviceSynaptic weight
NVMN1
N2
N1
M1
M2P
P1
O1pairs
Conductance
1
N2
P2
O2
Nn
Oo+ - + -
Nn MmMmMm Pp
M1+
Fig. 1. Neuro-inspired non-Von Neumann computing [1], [2], in whichneurons activate each other through dense networks of programmable synapticweights, can be implemented using dense crossbar arrays of nonvolatilememory (NVM) and selector device-pairs.
250 125 10528 250hiddenneurons
125hiddenneurons
10outputneurons
0
x1Cropped(22x24
528input
neurons
A
A
x1B
wij
(22x24pixel)
MNISTimages
1
xiA
xjB
ij
x wA
xj8
B
xi wij
xj =f(xi wij)
xj9
x528
B AA
x250B
Fig. 2. In forward evaluation of a multi-layer perceptron, each layers neuronsdrive the next layer through weights wij and a nonlinearity f(). Input neuronsare driven by pixels from successive MNIST images (cropped to 2224); the10 output neurons identify which digit was presented.
digits (MNIST) [3]. Although proposed earlier, it gained greatpopularity in the 1980s [3], [4], and with the advent ofGPUs, backpropagation now dominates the neural networkfield. In the present work, we use backpropagation to traina relatively simple multi-layer perceptron network (Fig. 2).During forward evaluation of this network, each layers inputs(xi) drive the next layers neurons through weights wij and anonlinearity f() (Fig. 2). Supervised learning occurs (Fig. 3)by backpropagating error terms j to adjust each weight wij asthe second step. A 3layer network is capable of accuracies, onpreviously unseen test images (generalization), of 97% [3](Fig.4); even higher accuracy is possible by first pre-trainingthe weights in each layer [5]. (Here we use tanh() as the
IEEE TRANSACTIONS ON ELECTRON DEVICES, VOL. XX, NO. X, XXX 2015 2
250 125 10528 C t
X1B C
250hiddenneurons
125hiddenneurons
10outputneurons
528inputneurons
0 D
Correct answer
D=x1 g1
B C 1 D D=x2 g2
j k lB C 8
D
D D=xl gl
Xi
X528
B C
wij = * xi * j9
learning rate
D=x10
D
NN result
g10i
learning rate
Fig. 3. In supervised learning, error terms j are backpropagated, adjustingeach weight wij to minimize an energy function by gradient descent,reducing classification error between computed (xDl ) and desired outputvectors (gl).
Simulated Trained only on first 5,000 MNIST images99.6
Simulated accuracy [%]
Trained on
MNIST images
99a ed o
all 60,000 MNIST images
T t97
After training, the ability of an NN to
Testaccuracies
80
90te t a g, t e ab ty o a to
generalize is typically assessed by evaluating accuracy on a previously-
unseen test set of 10,000 images
0
50 Training epoch(During each epoch, each training image is presented once)
0 5 10 15 200 ( u g eac epoc , eac t a g age s p ese ted o ce)
Fig. 4. A 3layer perceptron network can classify previously unseen (test)MNIST handwritten digits with up to97% accuracy [3]. Training on a subsetof the images sacrifices some generalization accuracy but speeds up training.
nonlinear function f(), and one bias (always-ON) neuron isadded to each layer other than the output layer in addition tothose neurons shown in Fig. 2. Like with STDP, low-powerneurons should be achievable by emphasizing brief spikes [6]and local-only clocking. However, note that no CMOS neuroncircuitry is built or even specified in this paper our focuswill be solely on the effects of the imperfections of the NVMelements.
We choose to work with phase-change memory (PCM)since we have access to large PCM arrays in hardware. Wediscuss the consequences of the fundamental asymmetry inPCM conductance response: the fact that small conductanceincreases can be implemented through PartialSET pulses,but the RESET (conductance decrease) operation tends to bequite abrupt. However, we also discuss the use of bidirec-tional NVM devices (such as non-filamentary RRAM [7]).We show that such a bidirectional NVM with a symmetric,linear conductance response is fully capable of delivering thesame high classification accuracies (on the problem we study,handwritten digit recognition) as a conventional, software-based implementation of the same neural network.
xiGij Gij
+
Signal (voltage, time)
xi+2
xi+1ij
Gi+1j
ij
Gi+1j+
+
Ij+ xi Gij+
Ij
xi GijMj
Gi+2j Gi+2j+
-
Fig. 5. By comparingtotal read signal be-tween pairs of bitlines,summation of synap-tic weights (encodedas conductance differ-ences, wij =G+G)is highly parallel.
wij =*xi *j
xiNeuron value
Back-p
*xi *j
ation ropagated co
joc
curr
ence
s du
ring
sim
ula
orrection
Fig. 6. Backpropagationcalls for each weightto be updated bywij = xi j ,where is the learningrate. Colormap showslog(occurrences), in the1st layer, during NNtraining (blue curve,Fig. 4); white contoursidentify the quantizedincrease in the integerweight.
xiNeuron
value
Back-
fire 1 pulse(timeslot A)
2 pulses(A,B)
3 pulses(A,B,C)
ALL4 pulses
NO pulses
xi -propaga2 31 40(Squareroot of learningrate isused toscalebothx and
j
ted corre
112
on
xi andj )
j
ection
1fire
durin
g si
mul
atio
1 11 pulse(timeslot B)
NO pulses
occu
rren
ces
d
1
2
2
31
2 pulses(A,C)
3 pulses(A,B,C)Programming
l 2 3 41( , , )
ALL4 pulses
pulsesaffecting
NVM
Fig. 7. In a crossbararray, efficient learningrequires neurons toupdate weights inparallel, firing pulseswhose overlap at thevarious NVM devicesimplements training.Colormap showslog(occurrences), inthe 1st layer, duringNN training (red curve,Fig. 8); white contoursidentify the quantizedincrease in the integerweight.
Conventional weight update
80
90
cy [%
] 100Conventional weight update (Fig. 6)
Crossbar-compatibleweight update (Fig. 7)
60
70
80
cura
c
99.6
99.9
C ti l
40
50
60
ed a
cc
97
99
Testi
Conventionalweight update
20
30
ulat
e
050
8090 accuracies
0 5 10 15 200
10
Sim 0 2 4 6 8 10 12 14 16 18 20
0
Training epoch
Fig. 8. Computer NN simulations show that a crossbar-compatible weight-update rule (Fig.7) is just as effective as the conventional update rule (Fig.6).
IEEE TRANSACTIONS ON ELECTRON DEVICES, VOL. XX, NO. X, XXX 2015 3
G G A tDevices Stuck-ON
G GNonlinearity(G-per-pulse @ low-G different
from G-per-pulse @ high-G)
Asymmetry(G-per-pulse for increasing G
different from G-per-pulse for decreasing G,
e g abrupt RESET in PCM)Varying e.g., abrupt RESET in PCM)VaryingGmax
Stochasticity(erroneous weight updates
ibl )are possible)
Increasing G-responseDecreasingG-response
# pulses # pulses
p
Dead devices
Fig. 9. The conductance response of an NVM device exhibits imperfections,including nonlinearity, stochasticity, varying maxima, asymmetry betweenincreasing/decreasing responses, and non-responsive devices (at low or highG).
II. CONSIDERATIONS FOR A CROSSBAR IMPLEMENTATIONBy encoding synaptic weight in the conductance difference
between a pair of nonvolatile memory devices, wij = G+G [2], forward propagation simply compares total read signalon bitlines (Fig.5). This can be performed by encoding x usingsome combination of voltage-domain or time-domain encoding(either number of read pulses or pulse duration). These CMOScircuitry choices are interesting and important topics, but arebeyond the scope of this paper.
Any non-volatile memory device that can offer a nonde-structive parallel read (as shown in Fig. 5) of memory statesthat can be smoothly adjusted up or down through a wide rangeof analog values could potentially be used in this application.Here we focus on NVM devices that offer a range of analogconductance states.
This paper is concerned with how real NVM devices willrespond to programming instructions during in situ trainingof their artificial neural network. Unfortunately, the conven-tional backpropagation algorithm [4] calls for weight updateswij xij (Fig. 6), which forces upstream i and down-stream j neurons to exchange information uniquely for eachand every synapse. This serial, element-by-element informa-tion exchange between neurons is highly undesirable in acrossbar array implementation. One alternative is to have eachneuron, downstream and upstream, fire pulses based on theirlocal knowledge of xi and j , respectively. The presence ofa nonlinear selector is critical to ensure that NVM program-ming occurs only when pulses from both the upstream anddownstream neurons overlap. This allows neurons to modifyweights in parallel, making learning much more efficient[1] (Fig. 7). (Note that to reduce peak power, one mightchoose to stagger these write pulses across the array, onesub-block at a time.) Fig. 8 shows, using a simulation of theneural network in Figs. 2,3, that this adaptation for nonvolatilememory implementation has no adverse effect on accuracy.
However, while modifying the update rule is clearly nota problem, the conductance response of any real nonvolatilememory device exhibits imperfections that can decidedly af-fect the neural network performance. These imperfections in-clude nonlinearity, stochasticity, varying maxima, asymmetrybetween increasing/decreasing responses, and non-responsivedevices at low or high conductance (Fig.9). The initial version
[%]
Linear, unbounded100Linear, bounded, bi-directional
A80
acy
[
G IncreasingG-response DecreasingG-response
60
ccur
a
Non-
# pulses # pulses
40ed a
c
Linear, bounded,uni-directional
Nonlinear, bounded,uni-directional
B
20
mul
at 1kG+ G+ G+ B
100Randomi iti l i ht
0
Sim
1G AG G10
Binitial weightdistribution
00 4 8 12 16 20
Training epochFig. 10. Bounding G values reduces NN training accuracy slightly, but uni-directionality and nonlinearity in G-response can strongly degrade accuracy.Figure insets map NVM-pair synapse states on a diamond-shaped plot of G+vs. G (weight is vertical position) for a sampled subset of the weights.
G+ CA G
+
ghtA +
C# pulses W
eig
BC
DG- D
B
D
# pulsesC
AB
G- Fig. 11. If G values can only be increased (asymmetric G-response), asynapse at point A (G+ saturated) can only increase G, leading to a lowweight value (B). If response at small G values differs from that at large G(nonlinear G-response), alternating weight updates can no longer cancel. Assynapses tend to get herded into the same portion of the G-diamond (C D), the decrease in average weight can lead to network freeze-out.
of this work [8] was the first paper to study the relativeimportance of each of these factors. This expanded versionadds significantly more explanatory details, as well as severalnew plots detailing paths for future improvement.
Bounding G values would appear to reduce neural networktraining accuracy slightly, as shown by the difference betweenthe blue and red (top two) curves in Fig. 10. However,
100
chs] on
90
95
ter
20 e
poc Training
set
85
90
cy [%
, aft
75
80
ccur
ac onTest
set70
75
ted
ac set
10 100 100060
65
imul
at
10 100 1000# of MNIST examples between RESETsS
i
Fig. 12. Synapses with large conductance values (inset, right edge of G-diamond) can be refreshed (moved left) while preserving the weight (to someaccuracy), by RESETs to both G followed by a partialSET of one. If suchRESETs are too infrequent, weight evolution stagnates and NN accuracydegrades.
IEEE TRANSACTIONS ON ELECTRON DEVICES, VOL. XX, NO. X, XXX 2015 4
Referencecurrent Iref
ReferencevoltageVd i
Read mode Write mode
ID > Iref
Sense Amplifier
VdriveWriteDriver
ID VdriveMaster Bitline selected
M2V1D ref
YES or NO?
Devicecurrent ID
Master Bitline selectedby column address
ILDvia
GSTM1V1
M1
PCMRow address
selects
via viaFET
selectswordline
Fig. 13. Mushroom-cell [9], 1T1R PCM devices (180nm node) with 2metal interconnect layers enable 5121024 arrays. A 1-bit sense amplifiermeasures G values, passing the data to software-based neurons. Conductancesare increased by identical 25ns partialSET pulses to increase G+ (G)(Fig. 7), or by RESETs to both G followed by an iterative SET procedure(Fig. 12).
unidirectionality and nonlinearity in the G-response stronglydegrade accuracy (bottom two curves, green and magenta).Figure insets (Fig. 10) map non-volatile memory pair synapsestates on a diamond-shaped plot of G+ vs. G (weight isvertical position). In this context (Fig. 11), a PCM-basedsynapse with a highly asymmetric G-response (only partialSET can be done gradually) moves only unidirectionally, fromleft-to-right. (Bipolar filamentary RRAM or CBRAM wouldhave an identical problem, except that SET is the abrupt stepand it is the RESET step which can be performed gradually.)
Once one G value is saturated, subsequent training canonly increase the other G value, reducing weight magnitude.Nonlinearity in G-response further encourages weights of lowvalue. If the response at small G values differs from thatat large G, alternating weight updates no longer cancel. Assynapses are herded into the same portion of the G-diamond(Fig. 11), the decrease in average weight can lead to networkfreeze-out (Fig. 10 inset). In such a condition, the networkchooses to update very few if any weights, meaning that thenetwork stops evolving towards higher accuracy. Worse yet,since the few weight updates that do occur are quite likelyto lead to weight magnitude decay, previously trained infor-mation is steadily erased and accuracy can actually decrease(bottom two curves, Fig. 10).
One solution to the highly asymmetric response of PCMdevices is occasional RESET [2], moving synapses back to theleft edge of the G-diamond while preserving weight value(using an iterative SET procedure, Fig. 12 inset). However,if this is not done frequently enough, weight stagnation willdegrade neural network accuracy (Fig. 12). (An analogousapproach for bipolar filamentary RRAM or CBRAM wouldbe occasional SET.)
III. EXPERIMENTAL RESULTS
We implemented a 3layer perceptron of 164,885 synapses(Figs. 2,3) on a 500 661 array of mushroom-cell [9], 1T1RPCM devices (180nm node, Fig. 13). While the weight updatealgorithm (Fig. 7) is fully compatible with a crossbar imple-mentation, our hardware allows only sequential access to each
EVENTUAL CROSSBAR IMPLEMENTATION
OUR IMPLEMENTATIONHARDWARE
FULL HARDWAREWrite random weights
to PCM G values
Read PCM G valuesRandomize weights
Present next MNIST example
TIO
N
SOFTWARE
Compute f( xi wij )
Measure I+ - I- xi wij
Non-linearity in Repeat for
Repeat for ealayer, left to
WA
RD P
ROPA
GAT
Compute j corrections
hardware neuron
yer
Repeat for each NN layer
ach NN
rightFO
RW
p j
Compute pulses to apply (Fig. 7) to each
synapse (G+ or G-)
Compute j corrections
B th t ( ) d t fo
r eac
h N
N la
y
Repeat for ealeft to
GAT
ION
Both upstream (xi) and downstream neurons (j)
apply pulses (Fig. 7), programming synapse
Repe
at
ach NN
layer, right
BACK
PRO
PAG
After each mini-batch (5 examples), apply
pulses to PCM G values
After 20 examples, 2 RESETs (+1 incremental SET) for
synapses with large G values
Fig. 14. Although Gvalues are measuredsequentially, weightsummation and weightupdate procedures inour software-basedneurons closely mimicthe column (androw) integrationsand pulse-overlapprogramming neededfor parallel operationsacross a crossbararray. However, sinceoccasional RESETis triggered whenboth G+ and G arelarge, serial deviceaccess is requiredto obtain and thenre-program individualconductances.
90
100
%] 500 x 661 PCM = (2 PCM/synapse * 164,885 synapses) + 730 unused PCM
80
90
acy
[%Experiment
60
70
accu
r
8090
100 Matched simulation
40
50
ntal
a
506070
20
30
rim
en
Map offinal 10
203040
10
20
Training epochExpe
finalPCM 0 5 10 15 200
10
conductances0 1 2
0g p
Fig. 15. Training and test accuracy for a 3layer perceptron of 164,885hardware-synapses, with all weight operations taking place on a 500 661 array of mushroom-cell [6] PCM devices (Fig. 13). Also shown is amatched computer simulation of this NN, using parameters extracted fromthe experiment.
5% devices stuck-ONMinimum G-response
80%
90%
100% Minimum G response
Initial 6MeasuredG-per-pulse
100%
50%
60%
70%
80%
ence
s weights
Final 2
4
p p[uS]
0%
40%
50%
60%
ccur
re
330,500 PCM devices
weightsFrom all 31 million partial-SET pulses
5 10 15 20-10
G [uS]
20%
30%Oc PCM devices
MaximumG-response
5 10 15 20Measured RESET
12k
8kWeightafter
1000
0
from all 708kRESETs
Modeledvariability
0 10 20 30 40 500
10%
M d d t
4k
0Weight before
RESET-1000
-1000 10000
Measured conductance [uS]Fig. 16. 50-point cumulative distributions of experimentally measured con-ductances for the 500 661 PCM array, showing variability and stuck-ON pixel rate. Insets show the measured RESET accuracy, and the rate andstochasticity of G-response, plotted as a colormap of G-per-pulse vs. G.
IEEE TRANSACTIONS ON ELECTRON DEVICES, VOL. XX, NO. X, XXX 2015 5
20
22 Modeledconductance
1responses
18
20 conductanceresponseG [uS]
p
Modeled
14
16G [ ]
Medianresponse
Modeledconductance
variability
G per pulse 100%
10
12response
4
6G-per-pulse[uS]
100%
0%
50%
6
82
4
2
4
5 10 15 20-10 G [uS]
0 3 6 9 12 15 182
Pulse #
5 10 15 20
Fig. 17. Fitted G-response vs. number of pulses (blue average, red 1responses), obtained from our computer model (inset) for the rate andstochasticity of G-response (G-per-pulse vs. G) matched to experiment(Fig. 16).
PCM device (Fig. 13). For read, a sense amplifier measures Gvalues, passing the data to software-based neurons. Althoughthis measurement is performed sequentially, weight summationand weight update procedures in the software-based neuronsclosely mimic the column- and row-based integrations. (Again,since no particular CMOS circuitry has been specified, weassume that the 8-bit value of xi is implemented completelyaccurately. Any problems introduced by inaccurate encodingof xi values by real CMOS hardware could be easily assessedusing our tolerancing simulator.)
Weights are increased (decreased) by identical partialSET pulses (Fig. 7) to increase G+ (increase G) (Fig. 14).The deviation from true crossbar implementation occurs uponoccasional RESET (Fig. 12), triggered when either G+ or G
are large, thus requiring both knowledge of and control overindividual G values. Serial device access is required, both tomeasure the G values (to determine which are in the Lshaped region at the right side of the G-diamond) and thento fire two RESET pulses (at both G+ and G) followedby an iterative SET procedure to increase one of those twoconductances until the correct synaptic weight is restored.Since the time and energy associated with this process arelarge, it is highly desirable to perform occasional-RESET asinfrequently and as inaccurately as possible.
Fig. 15 shows measured accuracies for a hardware-synapseneural network, with all weight operations taking placeon PCM devices. To reduce test time, weight updates foreach mini-batch of 5 MNIST examples were applied together.Fig. 16 plots measured G-response, stochasticity, variability,stuck-ON pixel rate, and RESET accuracy. By matching allparameters including stochasticity (Fig. 17) to those measuredduring the experiment, our neural network computer simula-tion was able to precisely reproduce the measured accuracytrends (Fig. 15).
IV. TOLERANCING AND POWER CONSIDERATIONS
We can now use this matched neural network simulation toexplore the importance of nonvolatile memory imperfections.Fig. 18 shows final training (test) accuracy as a function
of variations in nonvolatile memory and neural network pa-rameters away from the conditions used in our hardwaredemo (green dotted line). NN performance is highly robustto stochasticity (Fig. 18(a)), variable maxima (c), the presenceof non-responsive devices (d,e), and infrequent and inaccurateRESETs (f,g). A mini-batch of size 1 allows weight updatesto be applied immediately (Fig.18(h)). However, as mentionedearlier, nonlinearity and asymmetry in G-response (Fig. 18(b))limit the maximum possible accuracy (here, to 85%), andrequire precise tuning of the learning rate and neuron-response(f ) (Fig. 18(i,j)). Too low a learning rate and no weightreceives any update; too high, and the imperfections in theNVM response generate chaos. The narrow distribution ofthese parameters means that the experiment must be tunedvery carefully. An extension of an existing neural networktechnique to a crossbar-based neural network has been foundto provide a much broader distribution of the learning rate.This technique is currently under investigation and will be thesubject of a future publication.
V. DISCUSSIONWhile the asymmetric G-response of PCM makes it neces-
sary to occasionally stop training, measure all conductances,and apply RESETs and iterative SETs, energy usage can bereasonable if RESETs are infrequent (Fig. 19, inset), and iflearning rate is low (Fig. 19).
Neural network performance with bidirectional nonvolatilememory-based synapses can deliver high classification accu-racy if G-response is linear and symmetric (Fig. 20, greencurve) rather than nonlinear (red curve). Asymmetry in G-response (blue curve) strongly degrades performance. InFig. 21, we further explore the trends with an ideal butnonlinear NVM, varying both the initial steepness of the G-response and the choice of fully bidirectional weight updates(when increasing weight, for instance, we both increase G+
and decrease G together) or alternating bidirectional (wechoose one, but not both, of these two steps). Clearly, a less-steep response is favorable, and the distinction between fullyor alternating bidirectional has the most impact for steeplynonlinear G-responses.
The most ideal NVM, with a linear and symmetric conduc-tance response in both directions, would result in more regu-larly distributed weight values and less freeze-outs, leading tohigher accuracies. In Fig.22 and Fig.23, we show that a gentlelinear response (e.g., a large number of identical pulses areneeded to change the conductance from minimum to maximumconductance and vice-versa), is advantageous compared to asteep response. While both the alternating bidirectional andfully bidirectional update schemes deliver higher accuraciesthan an NVM with a nonlinear conductance response, onlythe fully bidirectional update scheme reaches the same hightest accuracies exhibited by networks in which the NVMconductances are unbounded (Fig. 23, inset). Fig. 24 replotsthe same data from Fig. 23 on a logarithmic vertical scale, toaccentuate the high accuracy region.
The reason for this difference is shown in Fig.25, where oneexample of a desired sequence of weight updates is contrasted
IEEE TRANSACTIONS ON ELECTRON DEVICES, VOL. XX, NO. X, XXX 2015 6
100 On Training
chs] a) b) c) d) set e)
60
80
ter
20 e
poc
On Test set
set
20
40
Relative Relative G-transition
Relative max-G
Percentagedead
Percentagestuck-ONcy
[%, a
ft
0.1 1 100stochasticity
0.01 0.1 1 10 1 10 1 3 10 30 100 1 3 10 30 100100
slope variability conductances conductances
ccur
ac
f) g) h) i) j)
60
80
ted
ac
20
40
# of MNISTexamples between RESET
A
Mini-batchi
RelativeRelative
slope (f) of imul
at
1 10 100 10000 1 10 100 1 3 10 30 0.1 1 10 0.1 1 10RESETs Accuracy [%] size Learning rate neuron response
p ( )
SiFig. 18. Matchedsimulations show that anNVM-based NN is highlyrobust to stochasticity,variable maxima,the presence of non-responsive devices, andinfrequent or inaccurateRESETs. A mini-batchsize of 1 avoids the needto accumulate weightupdates before applyingthem. However, thenonlinear and asymmetricG-response limitsaccuracy to 85%, andrequires learning rate andneuron-response (f ) tobe precisely tuned.
7 [ 3pJ SET 30pJ RESET30
gy[m
J] PCM[ 3pJ SET, 30pJ RESET]
6
7
mJ] PCM
[ 3pJ SET, 30pJ RESETapplied each 50 examples]
3
10
ng E
nerg
5
rgy
[m
10 30 100 300 1000
1 Trai
ni
# of MNISTExamples between RESETs
4
Ener
10 30 100 300 1000
2
3
ning
Bi-directionalNVM
1
2
Trai
n NVM[ 1pJ SET & RESET ]
0.1 0.2 0.5 1 2 50
T
Learning rate
Fig. 19. Despite the higher power involved in RESET rather than partialSET (30pJ and 3pJ for highly-scaled PCM [1]), total energy costs of trainingcan be minimized if RESETs are sufficiently infrequent (inset). Low-energytraining requires low learning rates, which minimize the number of synapticprogramming pulses. At higher learning rates, even a bi-directional, linearNVM requiring no RESET and offering low power (1pJ per pulse) can leadto large training energy.
100%] Linear, bounded,
80
100
cy [% symmetric, bi-directional
60
80
cura Non-linear, bounded, symmetric, bi-directional
asymmetric40
60
d ac
c
Linear, bounded,
nse
asymmetric, bi-directional
20
40
ulat
e
eG
-res
po
0
20 TrainingepochSi
mu
Rela
tive
Decreasing G-responseIncreasing G-response
0 5 10 15 200
S
Fig. 20. NN performance is improved if G-response is linear and symmetric(green curve) rather than nonlinear (red). However, asymmetry between theup- and down-going G-responses (blue), if not corrected in the weight-updaterule (Fig. 7), can strongly degrade performance by favoring particular regionsof the G-diamond (Figs. 10,11).
100 Linear, unbounded, symmetric
1%y
[%]
901%5%
1%
5%
80ura
cy
alternatingbidirectionalfully
5%10%5%
10%80
d ac
cu
70
bidirectionalfullybidirectional
20%
ated
70
Nonlinear,
0%
20%
imul 60
(Fraction of (Gmax - Gmin) covered by 1st pulse) steeper initial G-response
Nonlinear,bounded,
symmetric
20%
Training epochS 500 4 12 16 208
Training epoch
Fig. 21. NN performance with a bidirectional but nonlinear G-response (samebasic shape as Fig. 20, red inset curve) improves as the response becomesmore gentle and the initial slope is less steep. The choice between alwaysupdating both conductances when updating the weight (fully bidirectional)and alternating between updating G+ or G but not both (alternatingbidirectional), has the most impact when the G-response is steeply nonlinear.(Note that due to changes in learning rate and the slope of the nonlinearfunction f(), the red curve in Fig. 20 is not duplicated here.)
100 Linear, unbounded, symmetric
500200100
y [%
]
90
50020010050
20
80ura
cy
1020
(# of pulses needed to move from Gmin to Gmax) larger dynamic range:
80
d ac
cu
70 5
ated
70 5Alternating
Bidirectional
imul
Training epoch
60 Linear, bounded, symmetric
S 500 4 12 16 20
Training epoch8
Fig. 22. NN performance when alternating between updating G+ or G butnot both (alternating bidirectional), with a linear G-response. This updatemethod cannot reach the performance of a network with unbounded synapticweights, even when the dynamic range of the linear response is large (e.g.,when the change due to any one pulse is only a small fraction of the rangebetween the minimum and maximum conductances).
IEEE TRANSACTIONS ON ELECTRON DEVICES, VOL. XX, NO. X, XXX 2015 7
100 Linear, unbounded, symmetric 500200
y [%
]
90
200100
5020
(# of pulses needed to move from Gmin to Gmax) larger dynamic range:
80ura
cy
510 Fully bidirectional
Linear, bounded, symmetric80
d ac
cu
70
5
ated
70
imul 60
Training epochS 500 4 12 16 208
Training epoch
Fig. 23. NN performance (classification accuracy during training) whenupdating both G+ and G (fully bidirectional scheme), with a linear G-response. The inset shows that, when the dynamic range of the linear responseis large, the classification accuracy can now reach that of the original network(a test accuracy of 94% when trained with 5,000 images; 97% when trainedwith all 60,000 images).
]
99.9 500Fully bidirectional
y [%
]
99.6 200
Fully bidirectionalLinear, bounded, symmetric
urac
y
99 10050
200
Linear, unbounded, symmetric
accu 97
50
20
ated
90
80 510
mul
a
Training epoch
80
50
5
Sim
0 4 12 16 20Training epoch
80
Fig. 24. The same NN performance data for a linear, symmetric NVM underthe fully bidirectional scheme (Fig. 23), here replotted on a logarithmic scalethat accentuates the high accuracy region. Here only the classification accuracyon the training set is shown, which can reach close to 100%, at which pointthe accuracy on the test set of 10,000 images that the network has neverseen before (Fig. 23) becomes the only relevant way to gauge the networkperformance.
to the actual weight updates that get implemented in thesetwo update schemes. In Fig. 25(b), we show that when thestate of the synapse is at the boundaries of the G-diamond,there is a significant chance that the next weight update usingthe alternating bidirectional scheme will have little or noimpact, simply because a conductance that is already saturatedcannot be increased (decreased) any further. In the fully-bidirectional update scheme, some amount of weight updatewill still occur at the edges of the G-diamond, leading tosmaller discrepancies between the desired and actual weightchanges, and thus higher performance. In addition, becausethe weights only move up and down the G-diamond inthe fully bidirectional scheme, the synapses stay in the centerstripe of the G-diamond (Fig. 26(b)), where they have accessto the full dynamic range available. In contrast, because eachweight update in the alternating bidirectional scheme movesalong a diagonal line, some number of synapses end up at theedges of the G diamond, where the effective dynamic rangewhich they can access is significantly reduced (Fig. 26(a)).
These results demonstrate conclusively that NVM devices
should be fully capable of delivering the same classificationaccuracy on the MNIST handwritten digits as a conventionalimplementation of this artificial neural network. All that is re-quired of the NVM device is that it offer a bidirectional, linear,and symmetric response in conductance with large dynamicrange (e.g., the change due to any one pulse represents onlya small fraction of the entire conductance range available).
Other future work will be needed to demonstrate a fullcrossbar-array implementation, including dedicated CMOScircuitry for summation of synaptic weights during bothforward and backpropagation through nearly-identical high-performance nonlinear selector devices. The values of neurons(x) and backpropagated errors () will need to be storedin CMOS circuitry and presented to the crossbar, throughsome combination of analog voltage levels, number of readpulses, and/or duration of read pulses. The need to synchronizewrite pulse timing between upstream and downstream neurons,and techniques to disperse the high-energy writes in time (toreduce the load on write drivers and voltage supplies) mustalso be addressed in future work.
VI. CONCLUSIONSUsing 2 phase-change memory (PCM) devices per synapse,
a 3layer perceptron with 164,885 synapses was trained withbackpropagation on a subset (5000 examples) of the MNISTdatabase of handwritten digits to high accuracy of (82.2%,82.9% on test set). A weight-update rule compatible forNVM+selector crossbar arrays was developed, and was shownto have no adverse effect on accuracy. A novel G-diamondconcept (Fig. 11) was introduced to illustrate issues created bynonlinearity and asymmetry in NVM conductance response.Asymmetry can be mitigated by an occasional RESET strat-egy (Fig. 12), which can be both infrequent and inaccurate(Figs. 12,18(f,g)).
Using a neural network (NN) simulator matched to the ex-perimental demonstrator, extensive tolerancing was performed(Fig. 18). Results show that network parameters such aslearning rate and the slope of the nonlinear neuron-responsefunction (Figs. 18(i,j)), and the nonlinearity, symmetry, andbounded nature of the conductance response (Figs. 911, 2026), are critical to achieving high performance in an NVM-based neural network.
Our results show that all NVM-based neural networks (notjust those based on PCM) can be expected to be highlyresilient to random effects (NVM variability, yield, andstochasticity), but will be highly sensitive to gradienteffects that act to steer all synaptic weights. A learningrate just high enough to avoid network freeze-out is shownto be advantageous for both high accuracy and low trainingenergy. We also prove that a bidirectional NVM with asymmetric, linear conductance response of high dynamic range(each conductance step is relatively small) would be fullycapable of delivering the same high classification accuracieson the MNIST handwriting digit database as a conventional,software-based implementation, ranging from >94% whentrained on 5000 examples to >97% when trained on the fullset of 60,000 training examples.
IEEE TRANSACTIONS ON ELECTRON DEVICES, VOL. XX, NO. X, XXX 2015 8
(a)( )
Example actual weight actual weightExample sequence of
some desiredweight changes
actual weight changes under
alternating bidirectional d t h
actual weight changes under fullybidirectional
d t hupdate scheme update scheme
(b)
G+
( )
G+
Each weight change implementedby changing
Gactual weight
h d h
by changing G+ OR G but not both
G- changes under the alternating
bidirectional update scheme
(c)Each weight change implemented by
G+
( ) p ychanging both
G+ AND G
G+
Effective Effective dynamic range
for weights depends on
position within
G actual weight changes
position within the G-diamond
G- actual weight changes under the fully
bidirectional update schemeupdate scheme
Fig. 25. Because of the finite bounds on conductance values, any desired sequence of weight changes (lefthand portion of (a)) will not be fully implementedin an NVM-based neuromorphic system. Parts (b) and (c) illustrate the actual weight updates that occur in (b) an alternating bidirectional update scheme,in which we alternate between updating G+ or G but not both, and (c) a fully bidirectional update scheme, in which we always update both G+ or G.With the alternating bidirectional scheme, synapses whose conductance values are located at/near the boundaries of the G-diamond can potentially lead toa situation where a large weight update is completely ignored. In contrast, in the bidirectional scheme, such large weight updates are simply reduced in themagnitude of their effect, and synapses tend to remain in the center of the G-diamond.
(a) Alternatingbidirectional G+ Linear,
bounded, symmetric
Gmin Gmax1k
10ksymmetric
Gmin Gmax
G- 10100
GminGmax
G1
GminGmax
(b) Fully bidirectional G+ Linear,
bounded, symmetric
Gmin Gmax
symmetric
1kGmin Gmax
G-10
100
GminGmax
G1
GminGmax
Fig. 26. When the G-response is steeply nonlinear, a fully bidirectionalscheme exhibits lower accuracy (see Fig. 21), because any single weightupdate could potentially make two overly large conductance changes insteadof just one. However, the fully bidirectional scheme provides better per-formance for a linear response with high dynamic range (compare Figs. 22and 23), because the small symmetric changes of each conductance move thesynaptic weight up and down along the central vertical axis of the G-diamond.In contrast, the alternating bidirectional scheme can move some synapses tothe left or right edges of the G-diamond, where the effective dynamic range(maximum weight magnitude) is significantly reduced.
REFERENCES
[1] B. L. Jackson, B. Rajendran, G. S. Corrado, M. Breitwisch, G. W.Burr, R. Cheek, K. Gopalakrishnan, S. Raoux, C. T. Rettner, A. Padilla,A. G. Schrott, R. S. Shenoy, B. N. Kurdi, C. H. Lam, and D. S.Modha, Nanoscale electronic synapses using phase change devices,ACM Journal On Emerging Technologies in Computing Systems, vol. 9,no. 2, p. 12, 2013.
[2] M. Suri, O. Bichler, D. Querlioz, O. Cueto, L. Perniola, V. Sousa,D. Vuillaume, C. Gamrat, and B. DeSalvo, Phase change memory assynapse for ultradense neuromorphic systems: application to complexvisual pattern extraction, in IEDM Technical Digest, 2011, p. 4.4.
[3] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learningapplied to document recognition, Proceedings of the IEEE, vol. 86,no. 11, p. 2278, 1998.
[4] D. Rumelhart, G. E. Hinton, and J. L. McClelland, A general frameworkfor parallel distributed processing, in Parallel Distributed Processing.MIT Press, 1986.
[5] G. E. Hinton and R. R. Salakhutdinov, Reducing the dimensionality ofdata with neural networks, Science, vol. 313, no. 5786, p. 504, 2006.
[6] B. Rajendran, Y. Liu, J.-S. Seo, K. Gopalakrishnan, L. Chang, D. J.Friedman, and M. B. Ritter, Specifications of nanoscale devices andcircuits for neuromorphic computational systems, IEEE Trans. Electr.Dev., vol. 60, no. 1, pp. 246253, 2013.
[7] J.-W. Jang, S. Park, G. W. Burr, H. Hwang, and Y.-H. Jeong, Optimiza-tion of conductance change in Pr1xCaxMnO3based synaptic devicesfor neuromorphic systems, IEEE Electron Device Letters, vol. 36, no. 5,pp. 457459, 2015.
[8] G. W. Burr, R. M. Shelby, C. di Nolfo, J. W. Jang, R. S. Shenoy,P. Narayanan, K. Virwani, E. U. Giacometti, B. Kurdi, and H. Hwang,Experimental demonstration and tolerancing of a large-scale neural
network (165,000 synapses), using phase-change memory as the synapticweight element, in IEDM, 2014, p. 29.5.
[9] M. Breitwisch, T. Nirschl, C. F. Chen, Y. Zhu, M. H. Lee, M. Lamorey,G. W. Burr, E. Joseph, A. Schrott, J. B. Philipp, R. Cheek, T. D. Happ,S. H. Chen, S. Zaidi, P. Flaitz, J. Bruley, R. Dasaka, B. Rajendran,S. Rossnagel, M. Yang, Y. C. Chen, R. Bergmann, H. L. Lung, andC. Lam, Novel lithographyindependent pore phase change memory,in Symposium on VLSI Technology, 2007, pp. 100101.
Geoffrey W. Burr (S87M96SM13) receivedthe Ph.D. degree in electrical engineering from theCalifornia Institute of Technology, Pasadena, CA,USA, in 1996.
He joined IBM Research Almaden, San Jose,CA, USA, where he is currently a Principal ResearchStaff Member, in 1996. His current research interestsinclude nonvolatile memory and cognitive comput-ing.
Robert M. Shelby received the Ph.D. degree inchemistry from the University of California atBerkeley, Berkeley, CA, USA.
He joined IBM, Armonk, NY, USA, in 1978. Heis currently a Research Staff Member with IBMResearchAlmaden, San Jose, CA, USA.
Dr. Shelby is a fellow of the Optical Society ofAmerica.
Severin Sidler received the B.S. degree in electri-cal engineering from EPFL, Lausanne, Switzerland,where he is continuing his studies in electrical en-gineering. He expects to receive the M.S. degree in2017. At present, he is an intern at IBM ResearchAlmaden, San Jose, CA, USA.
His current research interests include cognitivecomputing and systems engineering.
Carmelo di Nolfo received the B.S. and M.S.degrees in electrical and computer engineering fromthe University of Lige, Belgium, in 2012 and 2014,respectively. During his M.S. degree work, he stud-ied at cole Polytechnique Fdrale de Lausanne(EPFL), Lausanne, Switzerland and spent six monthsas an intern at IBM Research Almaden, San Jose,CA, USA. He joined the SyNAPSE team at IBMResearch Almaden, San Jose, CA, USA in March2015.
IEEE TRANSACTIONS ON ELECTRON DEVICES, VOL. XX, NO. X, XXX 2015 9
Junwoo Jang received the B.S. degree in electricalengineering from the Pohang University of Scienceand Technology (POSTECH), South Korea in 2012.He moved to the Department of Creative IT En-gineering for his graduate studies, and expects toreceive his Ph.D. degree in 2016. He spent fourmonths visiting IBM ResearchAlmaden, San Jose,CA, USA in late 2013.
His current research interests include circuit de-sign for cognitive computing systems.
Irem Boybat (S15) received the B.S. degree inelectronics engineering from Sabanci University, Is-tanbul, Turkey. She is continuing her studies in elec-trical engineering at EPFL, Lausanne, Switzerland,and is expected to receive the M.S. degree in 2015.She was an intern at IBM Research Almaden,San Jose, CA, USA between September 2014 andFebruary 2015.
Her current research interests include cognitivecomputing, semi-custom design and embedded sys-tems.
Rohit S. Shenoy (M04) received the B.Tech. degreein engineering physics from IIT Bombay, Mumbai,India, and the M.S. and Ph.D. degrees in electricalengineering from Stanford University, Stanford, CA,USA.
He was a Research Staff Member with IBMResearchAlmaden, San Jose, CA, USA, from 2005to 2014. He joined Intel, Santa Clara, CA, USA, asa Device Engineer, in 2014, where he is involved inNAND flash development.
Pritish Narayanan (M14) received the Ph.D. de-gree in electrical and computer engineering from theUniversity of Massachusetts Amherst, Amherst, MA,USA, in 2013.
He joined IBM Research Almaden, San Jose,CA, USA, as a Research Staff Member. His currentresearch interests include emerging technologies forlogic, nonvolatile memory, and cognitive computing.
Kumar Virwani (S05M10) received the Ph.D.degree from the University of Arkansas at Fayet-teville, Fayetteville, AR, USA, in 2007.
He has been with IBM Research Almaden, SanJose, CA, USA, since 2008. He has developed novelconducting scanning probe microscopy methods forelectrical characterization of mixed ionic electronicconductor materials and devices. He is involvedin projects on storage class memory, lithium-airbatteries, low-k dielectrics, and photovoltaics.
Emanuele U. Giacometti received the B.S. in elec-tronic engineering, and his International Masters inNanotechnologies from the Politecnico di Torino,Torino, Italy in 2012 and 2014, respectively. Afterclasses at the Institut Polytechnique de Grenoble(INP), France and cole Polytechnique Fdrale deLausanne (EPFL), Lausanne, Switzerland he spentsix months as an intern at IBM ResearchAlmaden,San Jose, CA, USA. He will begin his Ph.D. studiesin late 2015.
Blent N. Kurdi received the Ph.D. degree fromthe Institute of Optics, University of Rochester,Rochester, NY, USA, in 1989.
After eleven years at IBM Research Almaden,San Jose, CA, USA, he joined Wavesplitter Tech-nologies, Inc., Fremont, CA, USA, in 2000. Then,he returned to IBM Research Almadenin 2003.He is now Senior Manager of the Novel DevicePrototyping & Characterization department.
Hyunsang Hwang received his Ph.D. in MaterialsScience from the University of Texas at Austin,Austin, Texas, USA in 1992. After five years at LGSemiconductor Corporation, he became a Professorof Materials Science and Engineering at GwangjuInstitute of Science and Technology, Gwangju, SouthKorea in 1997. In 2012, he moved to the Materi-als Science and Engineering department at PohangUniversity of Science and Technology (POSTECH),South Korea.