Download pdf - Experimental demonstration and tolerancing of a large ... · Geoffrey W. Burr, Senior Member, Robert M. Shelby, Severin Sidler, Carmelo di Nolfo, Junwoo Jang, Irem Boybat, Student

IEEE TRANSACTIONS ON ELECTRON DEVICES, VOL. XX, NO. X, XXX 2015 1

Experimental demonstration and tolerancing ofa large-scale neural network (165,000 synapses),

using phase-change memory as thesynaptic weight element

Geoffrey W. Burr, Senior Member, Robert M. Shelby, Severin Sidler, Carmelo di Nolfo, Junwoo Jang,Irem Boybat, Student Member, Rohit S. Shenoy, Member, Pritish Narayanan, Member,

Kumar Virwani, Member, Emanuele U. Giacometti, Blent Kurdi, and Hyunsang Hwang, Member

AbstractUsing 2 phase-change memory (PCM) devices persynapse, a 3layer perceptron network with 164,885 synapses istrained on a subset (5000 examples) of the MNIST database ofhandwritten digits using a backpropagation variant suitable fornonvolatile memory (NVM) + selector crossbar arrays, obtaininga training (generalization) accuracy of 82.2% (82.9%). Usinga neural network (NN) simulator matched to the experimentaldemonstrator, extensive tolerancing is performed with respectto NVM variability, yield, and the stochasticity, linearity andasymmetry of the NVM-conductance response. We show thata bidirectional NVM with a symmetric, linear conductanceresponse of high dynamic range is capable of delivering the samehigh classification accuracies on this problem as a conventional,software-based implementation of this same network.

I. INTRODUCTION

DENSE arrays of nonvolatile memory (NVM) and selec-tor device-pairs (Fig. 1) can implement neuro-inspirednon-Von Neumann computing [1], [2], using pairs [2] ofNVM devices as programmable (plastic) bipolar synapses.Work to date has emphasized the Spike-Timing-Dependent-Plasticity (STDP) algorithm [1], [2], motivated by synapticmeasurements in real brains. However, experimental NVMdemonstrations have been limited in size (100 synapses),and few results have reported quantitative performance metricssuch as classification accuracy. Worse yet, it has been difficultto be sure whether the relatively poor metrics reported to datemight be due to immaturities or inefficiencies in the STDPlearning algorithm (as it is currently implemented), rather thanreflective of problems introduced by the imperfections of theNVM devices.

Unlike STDP, backpropagation is a widely-used, well-studied method in training artificial neural networks, offeringbenchmark-able performance on datasets such as handwritten

G. W. Burr (e-mail:[email protected]), R. M. Shelby, S. Sidler, C. diNolfo, P. Narayanan, K. Virwani, and B. Kurdi are with IBM Research Almaden, San Jose, CA 95120.

I. Boybat is now with EPFL, Lausanne, Switzerland.J. Jang is with the Department of Creative IT Engineering, Pohang

University of Science and Technology, Pohang 790-784, Korea.R. S. Shenoy is now with Intel, Santa Clara, CA 95054.E. U. Giacometti was an intern at IBM ResearchAlmaden, San Jose, CA

in 2014.H. Hwang is with the Department of Material Science and Engineering,

Pohang University of Science and Technology, Pohang 790-784, Korea.Manuscript received March 9, 2015; revised May, 2015.

Selector deviceSynaptic weight

NVMN1

N2

N1

M1

M2P

P1

O1pairs

Conductance

1

N2

P2

O2

Nn

Oo+ - + -

Nn MmMmMm Pp

M1+

Fig. 1. Neuro-inspired non-Von Neumann computing [1], [2], in whichneurons activate each other through dense networks of programmable synapticweights, can be implemented using dense crossbar arrays of nonvolatilememory (NVM) and selector device-pairs.

250 125 10528 250hiddenneurons

125hiddenneurons

10outputneurons

0

x1Cropped(22x24

528input

neurons

A

A

x1B

wij

(22x24pixel)

MNISTimages

1

xiA

xjB

ij

x wA

xj8

B

xi wij

xj =f(xi wij)

xj9

x528

B AA

x250B

Fig. 2. In forward evaluation of a multi-layer perceptron, each layers neuronsdrive the next layer through weights wij and a nonlinearity f(). Input neuronsare driven by pixels from successive MNIST images (cropped to 2224); the10 output neurons identify which digit was presented.

digits (MNIST) [3]. Although proposed earlier, it gained greatpopularity in the 1980s [3], [4], and with the advent ofGPUs, backpropagation now dominates the neural networkfield. In the present work, we use backpropagation to traina relatively simple multi-layer perceptron network (Fig. 2).During forward evaluation of this network, each layers inputs(xi) drive the next layers neurons through weights wij and anonlinearity f() (Fig. 2). Supervised learning occurs (Fig. 3)by backpropagating error terms j to adjust each weight wij asthe second step. A 3layer network is capable of accuracies, onpreviously unseen test images (generalization), of 97% [3](Fig.4); even higher accuracy is possible by first pre-trainingthe weights in each layer [5]. (Here we use tanh() as the


250 125 10528 C t

X1B C

250hiddenneurons

125hiddenneurons

10outputneurons

528inputneurons

0 D

Correct answer

D=x1 g1

B C 1 D D=x2 g2

j k lB C 8

D

D D=xl gl

Xi

X528

B C

wij = * xi * j9

learning rate

D=x10

D

NN result

g10i

learning rate

Fig. 3. In supervised learning, error terms j are backpropagated, adjustingeach weight wij to minimize an energy function by gradient descent,reducing classification error between computed (xDl ) and desired outputvectors (gl).

Simulated Trained only on first 5,000 MNIST images99.6

Simulated accuracy [%]

Trained on

MNIST images

99a ed o

all 60,000 MNIST images

T t97

After training, the ability of an NN to

Testaccuracies

80

90te t a g, t e ab ty o a to

generalize is typically assessed by evaluating accuracy on a previously-

unseen test set of 10,000 images

0

50 Training epoch(During each epoch, each training image is presented once)

0 5 10 15 200 ( u g eac epoc , eac t a g age s p ese ted o ce)

Fig. 4. A 3layer perceptron network can classify previously unseen (test)MNIST handwritten digits with up to97% accuracy [3]. Training on a subsetof the images sacrifices some generalization accuracy but speeds up training.

nonlinear function f(), and one bias (always-ON) neuron isadded to each layer other than the output layer in addition tothose neurons shown in Fig. 2. Like with STDP, low-powerneurons should be achievable by emphasizing brief spikes [6]and local-only clocking. However, note that no CMOS neuroncircuitry is built or even specified in this paper our focuswill be solely on the effects of the imperfections of the NVMelements.

We choose to work with phase-change memory (PCM)since we have access to large PCM arrays in hardware. Wediscuss the consequences of the fundamental asymmetry inPCM conductance response: the fact that small conductanceincreases can be implemented through PartialSET pulses,but the RESET (conductance decrease) operation tends to bequite abrupt. However, we also discuss the use of bidirec-tional NVM devices (such as non-filamentary RRAM [7]).We show that such a bidirectional NVM with a symmetric,linear conductance response is fully capable of delivering thesame high classification accuracies (on the problem we study,handwritten digit recognition) as a conventional, software-based implementation of the same neural network.

xiGij Gij

+

Signal (voltage, time)

xi+2

xi+1ij

Gi+1j

ij

Gi+1j+

+

Ij+ xi Gij+

Ij

xi GijMj

Gi+2j Gi+2j+

-

Fig. 5. By comparingtotal read signal be-tween pairs of bitlines,summation of synap-tic weights (encodedas conductance differ-ences, wij =G+G)is highly parallel.

wij =*xi *j

xiNeuron value

Back-p

*xi *j

ation ropagated co

joc

curr

ence

s du

ring

sim

ula

orrection

Fig. 6. Backpropagationcalls for each weightto be updated bywij = xi j ,where is the learningrate. Colormap showslog(occurrences), in the1st layer, during NNtraining (blue curve,Fig. 4); white contoursidentify the quantizedincrease in the integerweight.

xiNeuron

value

Back-

fire 1 pulse(timeslot A)

2 pulses(A,B)

3 pulses(A,B,C)

ALL4 pulses

NO pulses

xi -propaga2 31 40(Squareroot of learningrate isused toscalebothx and

j

ted corre

112

on

xi andj )

j

ection

1fire

durin

g si

mul

atio

1 11 pulse(timeslot B)

NO pulses

occu

rren

ces

d

1

2

2

31

2 pulses(A,C)

3 pulses(A,B,C)Programming

l 2 3 41( , , )

ALL4 pulses

pulsesaffecting

NVM

Fig. 7. In a crossbararray, efficient learningrequires neurons toupdate weights inparallel, firing pulseswhose overlap at thevarious NVM devicesimplements training.Colormap showslog(occurrences), inthe 1st layer, duringNN training (red curve,Fig. 8); white contoursidentify the quantizedincrease in the integerweight.

Conventional weight update

80

90

cy [%

] 100Conventional weight update (Fig. 6)

Crossbar-compatibleweight update (Fig. 7)

60

70

80

cura

c

99.6

99.9

C ti l

40

50

60

ed a

cc

97

99

Testi

Conventionalweight update

20

30

ulat

e

050

8090 accuracies

0 5 10 15 200

10

Sim 0 2 4 6 8 10 12 14 16 18 20

0

Training epoch

Fig. 8. Computer NN simulations show that a crossbar-compatible weight-update rule (Fig.7) is just as effective as the conventional update rule (Fig.6).


G G A tDevices Stuck-ON

G GNonlinearity(G-per-pulse @ low-G different

from G-per-pulse @ high-G)

Asymmetry(G-per-pulse for increasing G

different from G-per-pulse for decreasing G,

e g abrupt RESET in PCM)Varying e.g., abrupt RESET in PCM)VaryingGmax

Stochasticity(erroneous weight updates

ibl )are possible)

Increasing G-responseDecreasingG-response

# pulses # pulses

p

Dead devices

Fig. 9. The conductance response of an NVM device exhibits imperfections,including nonlinearity, stochasticity, varying maxima, asymmetry betweenincreasing/decreasing responses, and non-responsive devices (at low or highG).

II. CONSIDERATIONS FOR A CROSSBAR IMPLEMENTATIONBy encoding synaptic weight in the conductance difference

between a pair of nonvolatile memory devices, wij = G+G [2], forward propagation simply compares total read signalon bitlines (Fig.5). This can be performed by encoding x usingsome combination of voltage-domain or time-domain encoding(either number of read pulses or pulse duration). These CMOScircuitry choices are interesting and important topics, but arebeyond the scope of this paper.

Any non-volatile memory device that can offer a nonde-structive parallel read (as shown in Fig. 5) of memory statesthat can be smoothly adjusted up or down through a wide rangeof analog values could potentially be used in this application.Here we focus on NVM devices that offer a range of analogconductance states.

This paper is concerned with how real NVM devices willrespond to programming instructions during in situ trainingof their artificial neural network. Unfortunately, the conven-tional backpropagation algorithm [4] calls for weight updateswij xij (Fig. 6), which forces upstream i and down-stream j neurons to exchange information uniquely for eachand every synapse. This serial, element-by-element informa-tion exchange between neurons is highly undesirable in acrossbar array implementation. One alternative is to have eachneuron, downstream and upstream, fire pulses based on theirlocal knowledge of xi and j , respectively. The presence ofa nonlinear selector is critical to ensure that NVM program-ming occurs only when pulses from both the upstream anddownstream neurons overlap. This allows neurons to modifyweights in parallel, making learning much more efficient[1] (Fig. 7). (Note that to reduce peak power, one mightchoose to stagger these write pulses across the array, onesub-block at a time.) Fig. 8 shows, using a simulation of theneural network in Figs. 2,3, that this adaptation for nonvolatilememory implementation has no adverse effect on accuracy.

However, while modifying the update rule is clearly nota problem, the conductance response of any real nonvolatilememory device exhibits imperfections that can decidedly af-fect the neural network performance. These imperfections in-clude nonlinearity, stochasticity, varying maxima, asymmetrybetween increasing/decreasing responses, and non-responsivedevices at low or high conductance (Fig.9). The initial version

[%]

Linear, unbounded100Linear, bounded, bi-directional

A80

acy

[

G IncreasingG-response DecreasingG-response

60

ccur

a

Non-

# pulses # pulses

40ed a

c

Linear, bounded,uni-directional

Nonlinear, bounded,uni-directional

B

20

mul

at 1kG+ G+ G+ B

100Randomi iti l i ht

0

Sim

1G AG G10

Binitial weightdistribution

00 4 8 12 16 20

Training epochFig. 10. Bounding G values reduces NN training accuracy slightly, but uni-directionality and nonlinearity in G-response can strongly degrade accuracy.Figure insets map NVM-pair synapse states on a diamond-shaped plot of G+vs. G (weight is vertical position) for a sampled subset of the weights.

G+ CA G

+

ghtA +

C# pulses W

eig

BC

DG- D

B

D

# pulsesC

AB

G- Fig. 11. If G values can only be increased (asymmetric G-response), asynapse at point A (G+ saturated) can only increase G, leading to a lowweight value (B). If response at small G values differs from that at large G(nonlinear G-response), alternating weight updates can no longer cancel. Assynapses tend to get herded into the same portion of the G-diamond (C D), the decrease in average weight can lead to network freeze-out.

of this work [8] was the first paper to study the relativeimportance of each of these factors. This expanded versionadds significantly more explanatory details, as well as severalnew plots detailing paths for future improvement.

Bounding G values would appear to reduce neural networktraining accuracy slightly, as shown by the difference betweenthe blue and red (top two) curves in Fig. 10. However,

100

chs] on

90

95

ter

20 e

poc Training

set

85

90

cy [%

, aft

75

80

ccur

ac onTest

set70

75

ted

ac set

10 100 100060

65

imul

at

10 100 1000# of MNIST examples between RESETsS

i

Fig. 12. Synapses with large conductance values (inset, right edge of G-diamond) can be refreshed (moved left) while preserving the weight (to someaccuracy), by RESETs to both G followed by a partialSET of one. If suchRESETs are too infrequent, weight evolution stagnates and NN accuracydegrades.


Referencecurrent Iref

ReferencevoltageVd i

Read mode Write mode

ID > Iref

Sense Amplifier

VdriveWriteDriver

ID VdriveMaster Bitline selected

M2V1D ref

YES or NO?

Devicecurrent ID

Master Bitline selectedby column address

ILDvia

GSTM1V1

M1

PCMRow address

selects

via viaFET

selectswordline

Fig. 13. Mushroom-cell [9], 1T1R PCM devices (180nm node) with 2metal interconnect layers enable 5121024 arrays. A 1-bit sense amplifiermeasures G values, passing the data to software-based neurons. Conductancesare increased by identical 25ns partialSET pulses to increase G+ (G)(Fig. 7), or by RESETs to both G followed by an iterative SET procedure(Fig. 12).

unidirectionality and nonlinearity in the G-response stronglydegrade accuracy (bottom two curves, green and magenta).Figure insets (Fig. 10) map non-volatile memory pair synapsestates on a diamond-shaped plot of G+ vs. G (weight isvertical position). In this context (Fig. 11), a PCM-basedsynapse with a highly asymmetric G-response (only partialSET can be done gradually) moves only unidirectionally, fromleft-to-right. (Bipolar filamentary RRAM or CBRAM wouldhave an identical problem, except that SET is the abrupt stepand it is the RESET step which can be performed gradually.)

Once one G value is saturated, subsequent training canonly increase the other G value, reducing weight magnitude.Nonlinearity in G-response further encourages weights of lowvalue. If the response at small G values differs from thatat large G, alternating weight updates no longer cancel. Assynapses are herded into the same portion of the G-diamond(Fig. 11), the decrease in average weight can lead to networkfreeze-out (Fig. 10 inset). In such a condition, the networkchooses to update very few if any weights, meaning that thenetwork stops evolving towards higher accuracy. Worse yet,since the few weight updates that do occur are quite likelyto lead to weight magnitude decay, previously trained infor-mation is steadily erased and accuracy can actually decrease(bottom two curves, Fig. 10).

One solution to the highly asymmetric response of PCMdevices is occasional RESET [2], moving synapses back to theleft edge of the G-diamond while preserving weight value(using an iterative SET procedure, Fig. 12 inset). However,if this is not done frequently enough, weight stagnation willdegrade neural network accuracy (Fig. 12). (An analogousapproach for bipolar filamentary RRAM or CBRAM wouldbe occasional SET.)

III. EXPERIMENTAL RESULTS

We implemented a 3layer perceptron of 164,885 synapses(Figs. 2,3) on a 500 661 array of mushroom-cell [9], 1T1RPCM devices (180nm node, Fig. 13). While the weight updatealgorithm (Fig. 7) is fully compatible with a crossbar imple-mentation, our hardware allows only sequential access to each

EVENTUAL CROSSBAR IMPLEMENTATION

OUR IMPLEMENTATIONHARDWARE

FULL HARDWAREWrite random weights

to PCM G values

Read PCM G valuesRandomize weights

Present next MNIST example

TIO

N

SOFTWARE

Compute f( xi wij )

Measure I+ - I- xi wij

Non-linearity in Repeat for

Repeat for ealayer, left to

WA

RD P

ROPA

GAT

Compute j corrections

hardware neuron

yer

Repeat for each NN layer

ach NN

rightFO

RW

p j

Compute pulses to apply (Fig. 7) to each

synapse (G+ or G-)

Compute j corrections

B th t ( ) d t fo

r eac

h N

N la

y

Repeat for ealeft to

GAT

ION

Both upstream (xi) and downstream neurons (j)

apply pulses (Fig. 7), programming synapse

Repe

at

ach NN

layer, right

BACK

PRO

PAG

After each mini-batch (5 examples), apply

pulses to PCM G values

After 20 examples, 2 RESETs (+1 incremental SET) for

synapses with large G values

Fig. 14. Although Gvalues are measuredsequentially, weightsummation and weightupdate procedures inour software-basedneurons closely mimicthe column (androw) integrationsand pulse-overlapprogramming neededfor parallel operationsacross a crossbararray. However, sinceoccasional RESETis triggered whenboth G+ and G arelarge, serial deviceaccess is requiredto obtain and thenre-program individualconductances.

90

100

%] 500 x 661 PCM = (2 PCM/synapse * 164,885 synapses) + 730 unused PCM

80

90

acy

[%Experiment

60

70

accu

r

8090

100 Matched simulation

40

50

ntal

a

506070

20

30

rim

en

Map offinal 10

203040

10

20

Training epochExpe

finalPCM 0 5 10 15 200

10

conductances0 1 2

0g p

Fig. 15. Training and test accuracy for a 3layer perceptron of 164,885hardware-synapses, with all weight operations taking place on a 500 661 array of mushroom-cell [6] PCM devices (Fig. 13). Also shown is amatched computer simulation of this NN, using parameters extracted fromthe experiment.

5% devices stuck-ONMinimum G-response

80%

90%

100% Minimum G response

Initial 6MeasuredG-per-pulse

100%

50%

60%

70%

80%

ence

s weights

Final 2

4

p p[uS]

0%

40%

50%

60%

ccur

re

330,500 PCM devices

weightsFrom all 31 million partial-SET pulses

5 10 15 20-10

G [uS]

20%

30%Oc PCM devices

MaximumG-response

5 10 15 20Measured RESET

12k

8kWeightafter

1000

0

from all 708kRESETs

Modeledvariability

0 10 20 30 40 500

10%

M d d t

4k

0Weight before

RESET-1000

-1000 10000

Measured conductance [uS]Fig. 16. 50-point cumulative distributions of experimentally measured con-ductances for the 500 661 PCM array, showing variability and stuck-ON pixel rate. Insets show the measured RESET accuracy, and the rate andstochasticity of G-response, plotted as a colormap of G-per-pulse vs. G.


20

22 Modeledconductance

1responses

18

20 conductanceresponseG [uS]

p

Modeled

14

16G [ ]

Medianresponse

Modeledconductance

variability

G per pulse 100%

10

12response

4

6G-per-pulse[uS]

100%

0%

50%

6

82

4

2

4

5 10 15 20-10 G [uS]

0 3 6 9 12 15 182

Pulse #

5 10 15 20

Fig. 17. Fitted G-response vs. number of pulses (blue average, red 1responses), obtained from our computer model (inset) for the rate andstochasticity of G-response (G-per-pulse vs. G) matched to experiment(Fig. 16).

PCM device (Fig. 13). For read, a sense amplifier measures Gvalues, passing the data to software-based neurons. Althoughthis measurement is performed sequentially, weight summationand weight update procedures in the software-based neuronsclosely mimic the column- and row-based integrations. (Again,since no particular CMOS circuitry has been specified, weassume that the 8-bit value of xi is implemented completelyaccurately. Any problems introduced by inaccurate encodingof xi values by real CMOS hardware could be easily assessedusing our tolerancing simulator.)

Weights are increased (decreased) by identical partialSET pulses (Fig. 7) to increase G+ (increase G) (Fig. 14).The deviation from true crossbar implementation occurs uponoccasional RESET (Fig. 12), triggered when either G+ or G

are large, thus requiring both knowledge of and control overindividual G values. Serial device access is required, both tomeasure the G values (to determine which are in the Lshaped region at the right side of the G-diamond) and thento fire two RESET pulses (at both G+ and G) followedby an iterative SET procedure to increase one of those twoconductances until the correct synaptic weight is restored.Since the time and energy associated with this process arelarge, it is highly desirable to perform occasional-RESET asinfrequently and as inaccurately as possible.

Fig. 15 shows measured accuracies for a hardware-synapseneural network, with all weight operations taking placeon PCM devices. To reduce test time, weight updates foreach mini-batch of 5 MNIST examples were applied together.Fig. 16 plots measured G-response, stochasticity, variability,stuck-ON pixel rate, and RESET accuracy. By matching allparameters including stochasticity (Fig. 17) to those measuredduring the experiment, our neural network computer simula-tion was able to precisely reproduce the measured accuracytrends (Fig. 15).

IV. TOLERANCING AND POWER CONSIDERATIONS

We can now use this matched neural network simulation toexplore the importance of nonvolatile memory imperfections.Fig. 18 shows final training (test) accuracy as a function

of variations in nonvolatile memory and neural network pa-rameters away from the conditions used in our hardwaredemo (green dotted line). NN performance is highly robustto stochasticity (Fig. 18(a)), variable maxima (c), the presenceof non-responsive devices (d,e), and infrequent and inaccurateRESETs (f,g). A mini-batch of size 1 allows weight updatesto be applied immediately (Fig.18(h)). However, as mentionedearlier, nonlinearity and asymmetry in G-response (Fig. 18(b))limit the maximum possible accuracy (here, to 85%), andrequire precise tuning of the learning rate and neuron-response(f ) (Fig. 18(i,j)). Too low a learning rate and no weightreceives any update; too high, and the imperfections in theNVM response generate chaos. The narrow distribution ofthese parameters means that the experiment must be tunedvery carefully. An extension of an existing neural networktechnique to a crossbar-based neural network has been foundto provide a much broader distribution of the learning rate.This technique is currently under investigation and will be thesubject of a future publication.

V. DISCUSSIONWhile the asymmetric G-response of PCM makes it neces-

sary to occasionally stop training, measure all conductances,and apply RESETs and iterative SETs, energy usage can bereasonable if RESETs are infrequent (Fig. 19, inset), and iflearning rate is low (Fig. 19).

Neural network performance with bidirectional nonvolatilememory-based synapses can deliver high classification accu-racy if G-response is linear and symmetric (Fig. 20, greencurve) rather than nonlinear (red curve). Asymmetry in G-response (blue curve) strongly degrades performance. InFig. 21, we further explore the trends with an ideal butnonlinear NVM, varying both the initial steepness of the G-response and the choice of fully bidirectional weight updates(when increasing weight, for instance, we both increase G+

and decrease G together) or alternating bidirectional (wechoose one, but not both, of these two steps). Clearly, a less-steep response is favorable, and the distinction between fullyor alternating bidirectional has the most impact for steeplynonlinear G-responses.

The most ideal NVM, with a linear and symmetric conduc-tance response in both directions, would result in more regu-larly distributed weight values and less freeze-outs, leading tohigher accuracies. In Fig.22 and Fig.23, we show that a gentlelinear response (e.g., a large number of identical pulses areneeded to change the conductance from minimum to maximumconductance and vice-versa), is advantageous compared to asteep response. While both the alternating bidirectional andfully bidirectional update schemes deliver higher accuraciesthan an NVM with a nonlinear conductance response, onlythe fully bidirectional update scheme reaches the same hightest accuracies exhibited by networks in which the NVMconductances are unbounded (Fig. 23, inset). Fig. 24 replotsthe same data from Fig. 23 on a logarithmic vertical scale, toaccentuate the high accuracy region.

The reason for this difference is shown in Fig.25, where oneexample of a desired sequence of weight updates is contrasted


100 On Training

chs] a) b) c) d) set e)

60

80

ter

20 e

poc

On Test set

set

20

40

Relative Relative G-transition

Relative max-G

Percentagedead

Percentagestuck-ONcy

[%, a

ft

0.1 1 100stochasticity

0.01 0.1 1 10 1 10 1 3 10 30 100 1 3 10 30 100100

slope variability conductances conductances

ccur

ac

f) g) h) i) j)

60

80

ted

ac

20

40

# of MNISTexamples between RESET

A

Mini-batchi

RelativeRelative

slope (f) of imul

at

1 10 100 10000 1 10 100 1 3 10 30 0.1 1 10 0.1 1 10RESETs Accuracy [%] size Learning rate neuron response

p ( )

SiFig. 18. Matchedsimulations show that anNVM-based NN is highlyrobust to stochasticity,variable maxima,the presence of non-responsive devices, andinfrequent or inaccurateRESETs. A mini-batchsize of 1 avoids the needto accumulate weightupdates before applyingthem. However, thenonlinear and asymmetricG-response limitsaccuracy to 85%, andrequires learning rate andneuron-response (f ) tobe precisely tuned.

7 [ 3pJ SET 30pJ RESET30

gy[m

J] PCM[ 3pJ SET, 30pJ RESET]

6

7

mJ] PCM

[ 3pJ SET, 30pJ RESETapplied each 50 examples]

3

10

ng E

nerg

5

rgy

[m

10 30 100 300 1000

1 Trai

ni

# of MNISTExamples between RESETs

4

Ener

10 30 100 300 1000

2

3

ning

Bi-directionalNVM

1

2

Trai

n NVM[ 1pJ SET & RESET ]

0.1 0.2 0.5 1 2 50

T

Learning rate

Fig. 19. Despite the higher power involved in RESET rather than partialSET (30pJ and 3pJ for highly-scaled PCM [1]), total energy costs of trainingcan be minimized if RESETs are sufficiently infrequent (inset). Low-energytraining requires low learning rates, which minimize the number of synapticprogramming pulses. At higher learning rates, even a bi-directional, linearNVM requiring no RESET and offering low power (1pJ per pulse) can leadto large training energy.

100%] Linear, bounded,

80

100

cy [% symmetric, bi-directional

60

80

cura Non-linear, bounded, symmetric, bi-directional

asymmetric40

60

d ac

c

Linear, bounded,

nse

asymmetric, bi-directional

20

40

ulat

e

eG

-res

po

0

20 TrainingepochSi

mu

Rela

tive

Decreasing G-responseIncreasing G-response

0 5 10 15 200

S

Fig. 20. NN performance is improved if G-response is linear and symmetric(green curve) rather than nonlinear (red). However, asymmetry between theup- and down-going G-responses (blue), if not corrected in the weight-updaterule (Fig. 7), can strongly degrade performance by favoring particular regionsof the G-diamond (Figs. 10,11).

100 Linear, unbounded, symmetric

1%y

[%]

901%5%

1%

5%

80ura

cy

alternatingbidirectionalfully

5%10%5%

10%80

d ac

cu

70

bidirectionalfullybidirectional

20%

ated

70

Nonlinear,

0%

20%

imul 60

(Fraction of (Gmax - Gmin) covered by 1st pulse) steeper initial G-response

Nonlinear,bounded,

symmetric

20%

Training epochS 500 4 12 16 208

Training epoch

Fig. 21. NN performance with a bidirectional but nonlinear G-response (samebasic shape as Fig. 20, red inset curve) improves as the response becomesmore gentle and the initial slope is less steep. The choice between alwaysupdating both conductances when updating the weight (fully bidirectional)and alternating between updating G+ or G but not both (alternatingbidirectional), has the most impact when the G-response is steeply nonlinear.(Note that due to changes in learning rate and the slope of the nonlinearfunction f(), the red curve in Fig. 20 is not duplicated here.)

100 Linear, unbounded, symmetric

500200100

y [%

]

90

50020010050

20

80ura

cy

1020

(# of pulses needed to move from Gmin to Gmax) larger dynamic range:

80

d ac

cu

70 5

ated

70 5Alternating

Bidirectional

imul

Training epoch

60 Linear, bounded, symmetric

S 500 4 12 16 20

Training epoch8

Fig. 22. NN performance when alternating between updating G+ or G butnot both (alternating bidirectional), with a linear G-response. This updatemethod cannot reach the performance of a network with unbounded synapticweights, even when the dynamic range of the linear response is large (e.g.,when the change due to any one pulse is only a small fraction of the rangebetween the minimum and maximum conductances).


100 Linear, unbounded, symmetric 500200

y [%

]

90

200100

5020

(# of pulses needed to move from Gmin to Gmax) larger dynamic range:

80ura

cy

510 Fully bidirectional

Linear, bounded, symmetric80

d ac

cu

70

5

ated

70

imul 60

Training epochS 500 4 12 16 208

Training epoch

Fig. 23. NN performance (classification accuracy during training) whenupdating both G+ and G (fully bidirectional scheme), with a linear G-response. The inset shows that, when the dynamic range of the linear responseis large, the classification accuracy can now reach that of the original network(a test accuracy of 94% when trained with 5,000 images; 97% when trainedwith all 60,000 images).

]

99.9 500Fully bidirectional

y [%

]

99.6 200

Fully bidirectionalLinear, bounded, symmetric

urac

y

99 10050

200

Linear, unbounded, symmetric

accu 97

50

20

ated

90

80 510

mul

a

Training epoch

80

50

5

Sim

0 4 12 16 20Training epoch

80

Fig. 24. The same NN performance data for a linear, symmetric NVM underthe fully bidirectional scheme (Fig. 23), here replotted on a logarithmic scalethat accentuates the high accuracy region. Here only the classification accuracyon the training set is shown, which can reach close to 100%, at which pointthe accuracy on the test set of 10,000 images that the network has neverseen before (Fig. 23) becomes the only relevant way to gauge the networkperformance.

to the actual weight updates that get implemented in thesetwo update schemes. In Fig. 25(b), we show that when thestate of the synapse is at the boundaries of the G-diamond,there is a significant chance that the next weight update usingthe alternating bidirectional scheme will have little or noimpact, simply because a conductance that is already saturatedcannot be increased (decreased) any further. In the fully-bidirectional update scheme, some amount of weight updatewill still occur at the edges of the G-diamond, leading tosmaller discrepancies between the desired and actual weightchanges, and thus higher performance. In addition, becausethe weights only move up and down the G-diamond inthe fully bidirectional scheme, the synapses stay in the centerstripe of the G-diamond (Fig. 26(b)), where they have accessto the full dynamic range available. In contrast, because eachweight update in the alternating bidirectional scheme movesalong a diagonal line, some number of synapses end up at theedges of the G diamond, where the effective dynamic rangewhich they can access is significantly reduced (Fig. 26(a)).

These results demonstrate conclusively that NVM devices

should be fully capable of delivering the same classificationaccuracy on the MNIST handwritten digits as a conventionalimplementation of this artificial neural network. All that is re-quired of the NVM device is that it offer a bidirectional, linear,and symmetric response in conductance with large dynamicrange (e.g., the change due to any one pulse represents onlya small fraction of the entire conductance range available).

Other future work will be needed to demonstrate a fullcrossbar-array implementation, including dedicated CMOScircuitry for summation of synaptic weights during bothforward and backpropagation through nearly-identical high-performance nonlinear selector devices. The values of neurons(x) and backpropagated errors () will need to be storedin CMOS circuitry and presented to the crossbar, throughsome combination of analog voltage levels, number of readpulses, and/or duration of read pulses. The need to synchronizewrite pulse timing between upstream and downstream neurons,and techniques to disperse the high-energy writes in time (toreduce the load on write drivers and voltage supplies) mustalso be addressed in future work.

VI. CONCLUSIONSUsing 2 phase-change memory (PCM) devices per synapse,

a 3layer perceptron with 164,885 synapses was trained withbackpropagation on a subset (5000 examples) of the MNISTdatabase of handwritten digits to high accuracy of (82.2%,82.9% on test set). A weight-update rule compatible forNVM+selector crossbar arrays was developed, and was shownto have no adverse effect on accuracy. A novel G-diamondconcept (Fig. 11) was introduced to illustrate issues created bynonlinearity and asymmetry in NVM conductance response.Asymmetry can be mitigated by an occasional RESET strat-egy (Fig. 12), which can be both infrequent and inaccurate(Figs. 12,18(f,g)).

Using a neural network (NN) simulator matched to the ex-perimental demonstrator, extensive tolerancing was performed(Fig. 18). Results show that network parameters such aslearning rate and the slope of the nonlinear neuron-responsefunction (Figs. 18(i,j)), and the nonlinearity, symmetry, andbounded nature of the conductance response (Figs. 911, 2026), are critical to achieving high performance in an NVM-based neural network.

Our results show that all NVM-based neural networks (notjust those based on PCM) can be expected to be highlyresilient to random effects (NVM variability, yield, andstochasticity), but will be highly sensitive to gradienteffects that act to steer all synaptic weights. A learningrate just high enough to avoid network freeze-out is shownto be advantageous for both high accuracy and low trainingenergy. We also prove that a bidirectional NVM with asymmetric, linear conductance response of high dynamic range(each conductance step is relatively small) would be fullycapable of delivering the same high classification accuracieson the MNIST handwriting digit database as a conventional,software-based implementation, ranging from >94% whentrained on 5000 examples to >97% when trained on the fullset of 60,000 training examples.


(a)( )

Example actual weight actual weightExample sequence of

some desiredweight changes

actual weight changes under

alternating bidirectional d t h

actual weight changes under fullybidirectional

d t hupdate scheme update scheme

(b)

G+

( )

G+

Each weight change implementedby changing

Gactual weight

h d h

by changing G+ OR G but not both

G- changes under the alternating

bidirectional update scheme

(c)Each weight change implemented by

G+

( ) p ychanging both

G+ AND G

G+

Effective Effective dynamic range

for weights depends on

position within

G actual weight changes

position within the G-diamond

G- actual weight changes under the fully

bidirectional update schemeupdate scheme

Fig. 25. Because of the finite bounds on conductance values, any desired sequence of weight changes (lefthand portion of (a)) will not be fully implementedin an NVM-based neuromorphic system. Parts (b) and (c) illustrate the actual weight updates that occur in (b) an alternating bidirectional update scheme,in which we alternate between updating G+ or G but not both, and (c) a fully bidirectional update scheme, in which we always update both G+ or G.With the alternating bidirectional scheme, synapses whose conductance values are located at/near the boundaries of the G-diamond can potentially lead toa situation where a large weight update is completely ignored. In contrast, in the bidirectional scheme, such large weight updates are simply reduced in themagnitude of their effect, and synapses tend to remain in the center of the G-diamond.

(a) Alternatingbidirectional G+ Linear,

bounded, symmetric

Gmin Gmax1k

10ksymmetric

Gmin Gmax

G- 10100

GminGmax

G1

GminGmax

(b) Fully bidirectional G+ Linear,

bounded, symmetric

Gmin Gmax

symmetric

1kGmin Gmax

G-10

100

GminGmax

G1

GminGmax

Fig. 26. When the G-response is steeply nonlinear, a fully bidirectionalscheme exhibits lower accuracy (see Fig. 21), because any single weightupdate could potentially make two overly large conductance changes insteadof just one. However, the fully bidirectional scheme provides better per-formance for a linear response with high dynamic range (compare Figs. 22and 23), because the small symmetric changes of each conductance move thesynaptic weight up and down along the central vertical axis of the G-diamond.In contrast, the alternating bidirectional scheme can move some synapses tothe left or right edges of the G-diamond, where the effective dynamic range(maximum weight magnitude) is significantly reduced.

REFERENCES

[1] B. L. Jackson, B. Rajendran, G. S. Corrado, M. Breitwisch, G. W.Burr, R. Cheek, K. Gopalakrishnan, S. Raoux, C. T. Rettner, A. Padilla,A. G. Schrott, R. S. Shenoy, B. N. Kurdi, C. H. Lam, and D. S.Modha, Nanoscale electronic synapses using phase change devices,ACM Journal On Emerging Technologies in Computing Systems, vol. 9,no. 2, p. 12, 2013.

[2] M. Suri, O. Bichler, D. Querlioz, O. Cueto, L. Perniola, V. Sousa,D. Vuillaume, C. Gamrat, and B. DeSalvo, Phase change memory assynapse for ultradense neuromorphic systems: application to complexvisual pattern extraction, in IEDM Technical Digest, 2011, p. 4.4.

[3] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learningapplied to document recognition, Proceedings of the IEEE, vol. 86,no. 11, p. 2278, 1998.

[4] D. Rumelhart, G. E. Hinton, and J. L. McClelland, A general frameworkfor parallel distributed processing, in Parallel Distributed Processing.MIT Press, 1986.

[5] G. E. Hinton and R. R. Salakhutdinov, Reducing the dimensionality ofdata with neural networks, Science, vol. 313, no. 5786, p. 504, 2006.

[6] B. Rajendran, Y. Liu, J.-S. Seo, K. Gopalakrishnan, L. Chang, D. J.Friedman, and M. B. Ritter, Specifications of nanoscale devices andcircuits for neuromorphic computational systems, IEEE Trans. Electr.Dev., vol. 60, no. 1, pp. 246253, 2013.

[7] J.-W. Jang, S. Park, G. W. Burr, H. Hwang, and Y.-H. Jeong, Optimiza-tion of conductance change in Pr1xCaxMnO3based synaptic devicesfor neuromorphic systems, IEEE Electron Device Letters, vol. 36, no. 5,pp. 457459, 2015.

[8] G. W. Burr, R. M. Shelby, C. di Nolfo, J. W. Jang, R. S. Shenoy,P. Narayanan, K. Virwani, E. U. Giacometti, B. Kurdi, and H. Hwang,Experimental demonstration and tolerancing of a large-scale neural

network (165,000 synapses), using phase-change memory as the synapticweight element, in IEDM, 2014, p. 29.5.

[9] M. Breitwisch, T. Nirschl, C. F. Chen, Y. Zhu, M. H. Lee, M. Lamorey,G. W. Burr, E. Joseph, A. Schrott, J. B. Philipp, R. Cheek, T. D. Happ,S. H. Chen, S. Zaidi, P. Flaitz, J. Bruley, R. Dasaka, B. Rajendran,S. Rossnagel, M. Yang, Y. C. Chen, R. Bergmann, H. L. Lung, andC. Lam, Novel lithographyindependent pore phase change memory,in Symposium on VLSI Technology, 2007, pp. 100101.

Geoffrey W. Burr (S87M96SM13) receivedthe Ph.D. degree in electrical engineering from theCalifornia Institute of Technology, Pasadena, CA,USA, in 1996.

He joined IBM Research Almaden, San Jose,CA, USA, where he is currently a Principal ResearchStaff Member, in 1996. His current research interestsinclude nonvolatile memory and cognitive comput-ing.

Robert M. Shelby received the Ph.D. degree inchemistry from the University of California atBerkeley, Berkeley, CA, USA.

He joined IBM, Armonk, NY, USA, in 1978. Heis currently a Research Staff Member with IBMResearchAlmaden, San Jose, CA, USA.

Dr. Shelby is a fellow of the Optical Society ofAmerica.

Severin Sidler received the B.S. degree in electri-cal engineering from EPFL, Lausanne, Switzerland,where he is continuing his studies in electrical en-gineering. He expects to receive the M.S. degree in2017. At present, he is an intern at IBM ResearchAlmaden, San Jose, CA, USA.

His current research interests include cognitivecomputing and systems engineering.

Carmelo di Nolfo received the B.S. and M.S.degrees in electrical and computer engineering fromthe University of Lige, Belgium, in 2012 and 2014,respectively. During his M.S. degree work, he stud-ied at cole Polytechnique Fdrale de Lausanne(EPFL), Lausanne, Switzerland and spent six monthsas an intern at IBM Research Almaden, San Jose,CA, USA. He joined the SyNAPSE team at IBMResearch Almaden, San Jose, CA, USA in March2015.


Junwoo Jang received the B.S. degree in electricalengineering from the Pohang University of Scienceand Technology (POSTECH), South Korea in 2012.He moved to the Department of Creative IT En-gineering for his graduate studies, and expects toreceive his Ph.D. degree in 2016. He spent fourmonths visiting IBM ResearchAlmaden, San Jose,CA, USA in late 2013.

His current research interests include circuit de-sign for cognitive computing systems.

Irem Boybat (S15) received the B.S. degree inelectronics engineering from Sabanci University, Is-tanbul, Turkey. She is continuing her studies in elec-trical engineering at EPFL, Lausanne, Switzerland,and is expected to receive the M.S. degree in 2015.She was an intern at IBM Research Almaden,San Jose, CA, USA between September 2014 andFebruary 2015.

Her current research interests include cognitivecomputing, semi-custom design and embedded sys-tems.

Rohit S. Shenoy (M04) received the B.Tech. degreein engineering physics from IIT Bombay, Mumbai,India, and the M.S. and Ph.D. degrees in electricalengineering from Stanford University, Stanford, CA,USA.

He was a Research Staff Member with IBMResearchAlmaden, San Jose, CA, USA, from 2005to 2014. He joined Intel, Santa Clara, CA, USA, asa Device Engineer, in 2014, where he is involved inNAND flash development.

Pritish Narayanan (M14) received the Ph.D. de-gree in electrical and computer engineering from theUniversity of Massachusetts Amherst, Amherst, MA,USA, in 2013.

He joined IBM Research Almaden, San Jose,CA, USA, as a Research Staff Member. His currentresearch interests include emerging technologies forlogic, nonvolatile memory, and cognitive computing.

Kumar Virwani (S05M10) received the Ph.D.degree from the University of Arkansas at Fayet-teville, Fayetteville, AR, USA, in 2007.

He has been with IBM Research Almaden, SanJose, CA, USA, since 2008. He has developed novelconducting scanning probe microscopy methods forelectrical characterization of mixed ionic electronicconductor materials and devices. He is involvedin projects on storage class memory, lithium-airbatteries, low-k dielectrics, and photovoltaics.

Emanuele U. Giacometti received the B.S. in elec-tronic engineering, and his International Masters inNanotechnologies from the Politecnico di Torino,Torino, Italy in 2012 and 2014, respectively. Afterclasses at the Institut Polytechnique de Grenoble(INP), France and cole Polytechnique Fdrale deLausanne (EPFL), Lausanne, Switzerland he spentsix months as an intern at IBM ResearchAlmaden,San Jose, CA, USA. He will begin his Ph.D. studiesin late 2015.

Blent N. Kurdi received the Ph.D. degree fromthe Institute of Optics, University of Rochester,Rochester, NY, USA, in 1989.

After eleven years at IBM Research Almaden,San Jose, CA, USA, he joined Wavesplitter Tech-nologies, Inc., Fremont, CA, USA, in 2000. Then,he returned to IBM Research Almadenin 2003.He is now Senior Manager of the Novel DevicePrototyping & Characterization department.

Hyunsang Hwang received his Ph.D. in MaterialsScience from the University of Texas at Austin,Austin, Texas, USA in 1992. After five years at LGSemiconductor Corporation, he became a Professorof Materials Science and Engineering at GwangjuInstitute of Science and Technology, Gwangju, SouthKorea in 1997. In 2012, he moved to the Materi-als Science and Engineering department at PohangUniversity of Science and Technology (POSTECH),South Korea.