8
Exploring STT-MRAM based In-Memory Computing Paradigm with Application of Image Edge Extraction Zhezhi He * , Shaahin Angizi , Deliang Fan *†‡ Department of Electrical and Computer Engineering, University of Central Florida, Orlando, 32816 USA { * Elliot.He, Angizi}@Knights.ucf.edu, [email protected] Abstract—In this paper, we propose a novel Spin-Transfer Torque Magnetic Random-Access Memory (STT-MRAM) array design that could simultaneously work as non-volatile memory and implement a reconfigure in-memory logic operation with- out add-on logic circuits to the memory chip. The computed output could be simply read out like a typical MRAM bit-cell through the modified peripheral circuit. Such intrinsic in-memory computation can be used to process data locally and transfers the “cooked” data to the primary processing unit (i.e. CPU or GPU) for complex computation with high precision requirement. It greatly reduces power-hungry and long distance data commu- nication, and further leads to extreme parallelism within memory. In this work, we further propose an in-memory edge extraction algorithm as a case study to demonstrate the efficiency of in- memory preprocessing methodology. The simulation results show that our edge extraction method reduces data communication as much as 8x for grayscale image, thus greatly reducing system energy consumption. Meanwhile, the F-measure result shows only 10% degradation compared to conventional edge detection operator, such as Prewitt, Sobel and Roberts. Index Terms—In-memory computing, STT-MRAM, image pro- cessing, edge detection I. I NTRODUCTION The era of big data reveals the emerging needs to rethink and redesign the current computing architecture that can support memory-oriented processing for large datasets at exascale (10 18 bytes/s or flops) [1]. Especially, the ability of conven- tional computing platforms to address these needs is beginning to stall fundamentally due to the exiting bottlenecks either in architecture (i.e. memory wall [2]) or device technology (power wall [2]). In the existing von Neumann computing architecture, the separation of memory and computing units interconnected via buses has faced serious challenges, such as long memory access latency, significant congestion at I/Os, limited memory bandwidth, and huge leakage power consumption in big data-driven applications [3]. To mitigate these challenges, in-memory computing architecture and non- volatile memory technology have been proposed to inte- grate memory and logic, leading to a more energy efficient information processing platform. However, such in-memory processing platforms typically impose additional area overhead to the memory chip owning to add-on logic elements, and thus sacrificing valuable memory capacity [3]. This material is based upon work supported in part by the National Science Foundation under Grant No.1740126. From the device technology perspective, there are many recent researches carried out about emerging Non-Volatile Memories (NVM) that are promising to design such in- memory computing concept, for example, Resistive Random- Access Memory (RRAM) [3], Phase Change Memory (PCM) [4], Spin-Transfer Torque Magnetic Random-Access Memory (STT-MRAM) [5]. With great development of fabrication technology and commercialization progress, STT-MRAM has emerged as a leading non-volatile memory candidate owning to its unique features [5], [6], such as non-volatility, zero standby leakage, high write/read speed, compatibility with CMOS fabrication process, scalability, excellent endurance and high integration density. They are among the most promi- nent features which could pave a novel way to realize area and energy-efficient system supporting in-memory computing, normally-off computing, instant-on computing [5]–[8]. Memory Processor Data storage Computation Heavy data communication Memory Processor High-precsion Computation light data communication Data storage Low-precison computation & Fig. 1: In-memory computing to reduce the data communica- tion and increase the computation parallelism. For big data processing, one potential solution is extracting the abstract of data, and utilizing such extracted data for further processing. As depicted in Fig. 1, from in-memory computing architecture perspectives, memory takes care of data feature extraction with bit-wise low precision computation in extreme parallel approach, while the primary processing unit handles high precision operation with limited I/O bandwidth. In this work, edge extraction is used as a case study to demon- strate the efficiency improvement with in-memory computing paradigm. Edge detection is one of the fundamental operations in computer vision and image processing, which drastically reduces the complexity of original image while extracts the crucial boundary information for further processing procedures [9]. Currently, most edge detection algorithms can be classified into two categories [10]: search-based (e.g. Sobel [11], Prewitt [12], Roberts [13]) or zero-crossing based (e.g. Canny [9], 2017 IEEE 35th International Conference on Computer Design 1063-6404/17 $31.00 © 2017 IEEE DOI 10.1109/ICCD.2017.78 439

Exploring STT-MRAM Based In-Memory Computing Paradigm with

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Exploring STT-MRAM Based In-Memory Computing Paradigm with

Exploring STT-MRAM based In-MemoryComputing Paradigm with Application of Image

Edge ExtractionZhezhi He∗, Shaahin Angizi†, Deliang Fan‡

∗†‡Department of Electrical and Computer Engineering, University of Central Florida, Orlando, 32816 USA∗Elliot.He, †[email protected], ‡ [email protected]

Abstract—In this paper, we propose a novel Spin-TransferTorque Magnetic Random-Access Memory (STT-MRAM) arraydesign that could simultaneously work as non-volatile memoryand implement a reconfigure in-memory logic operation with-out add-on logic circuits to the memory chip. The computedoutput could be simply read out like a typical MRAM bit-cellthrough the modified peripheral circuit. Such intrinsic in-memorycomputation can be used to process data locally and transfersthe “cooked” data to the primary processing unit (i.e. CPU orGPU) for complex computation with high precision requirement.It greatly reduces power-hungry and long distance data commu-nication, and further leads to extreme parallelism within memory.In this work, we further propose an in-memory edge extractionalgorithm as a case study to demonstrate the efficiency of in-memory preprocessing methodology. The simulation results showthat our edge extraction method reduces data communication asmuch as 8x for grayscale image, thus greatly reducing systemenergy consumption. Meanwhile, the F-measure result showsonly ∼10% degradation compared to conventional edge detectionoperator, such as Prewitt, Sobel and Roberts.

Index Terms—In-memory computing, STT-MRAM, image pro-cessing, edge detection

I. INTRODUCTION

The era of big data reveals the emerging needs to rethink andredesign the current computing architecture that can supportmemory-oriented processing for large datasets at exascale(1018 bytes/s or flops) [1]. Especially, the ability of conven-tional computing platforms to address these needs is beginningto stall fundamentally due to the exiting bottlenecks eitherin architecture (i.e. memory wall [2]) or device technology(power wall [2]). In the existing von Neumann computingarchitecture, the separation of memory and computing unitsinterconnected via buses has faced serious challenges, suchas long memory access latency, significant congestion atI/Os, limited memory bandwidth, and huge leakage powerconsumption in big data-driven applications [3]. To mitigatethese challenges, in-memory computing architecture and non-volatile memory technology have been proposed to inte-grate memory and logic, leading to a more energy efficientinformation processing platform. However, such in-memoryprocessing platforms typically impose additional area overheadto the memory chip owning to add-on logic elements, and thussacrificing valuable memory capacity [3].

This material is based upon work supported in part by the National ScienceFoundation under Grant No.1740126.

From the device technology perspective, there are manyrecent researches carried out about emerging Non-VolatileMemories (NVM) that are promising to design such in-memory computing concept, for example, Resistive Random-Access Memory (RRAM) [3], Phase Change Memory (PCM)[4], Spin-Transfer Torque Magnetic Random-Access Memory(STT-MRAM) [5]. With great development of fabricationtechnology and commercialization progress, STT-MRAM hasemerged as a leading non-volatile memory candidate owningto its unique features [5], [6], such as non-volatility, zerostandby leakage, high write/read speed, compatibility withCMOS fabrication process, scalability, excellent enduranceand high integration density. They are among the most promi-nent features which could pave a novel way to realize areaand energy-efficient system supporting in-memory computing,normally-off computing, instant-on computing [5]–[8].

Memory Processor

Data storage ComputationHeavy data communication

Memory Processor

High-precsionComputation

light data communication

Data storage

Low-precison computation

Red

uce

da

ta

com

mu

nic

atio

n

&

Fig. 1: In-memory computing to reduce the data communica-tion and increase the computation parallelism.

For big data processing, one potential solution is extractingthe abstract of data, and utilizing such extracted data for furtherprocessing. As depicted in Fig. 1, from in-memory computingarchitecture perspectives, memory takes care of data featureextraction with bit-wise low precision computation in extremeparallel approach, while the primary processing unit handleshigh precision operation with limited I/O bandwidth. In thiswork, edge extraction is used as a case study to demon-strate the efficiency improvement with in-memory computingparadigm. Edge detection is one of the fundamental operationsin computer vision and image processing, which drasticallyreduces the complexity of original image while extracts thecrucial boundary information for further processing procedures[9]. Currently, most edge detection algorithms can be classifiedinto two categories [10]: search-based (e.g. Sobel [11], Prewitt[12], Roberts [13]) or zero-crossing based (e.g. Canny [9],

2017 IEEE 35th International Conference on Computer Design

1063-6404/17 $31.00 © 2017 IEEE

DOI 10.1109/ICCD.2017.78

439

Page 2: Exploring STT-MRAM Based In-Memory Computing Paradigm with

Marr-Hildreth[14]).Thesearch-basedalgorithmsarebasedonlookingforthelocalmaximumofthefirst-orderderiva-tives,withapredefinedthresholdtodecidetheedgepixels.Meanwhile,thezero-crossingbasedalgorithmsidentifytheedgethroughseekingthezerocrossingpointsinthesecond-orderderivativesoftheinputimage.However,suchderivativesoperationiscomputationexpensive,whichurgesthediscoveryofotherenergyefficientedgeextractionmethod.

Inthispaper,weproposeanovelSTT-MRAMarraythatcouldworkasconventionalmemoryandin-memorycomput-ingkernel,whichisinspiredbytherelatedworksin[15]–[17].Fortheadditionalin-memorycomputingfunction,nomemorycapacityissacrificed,whileanymemorybit-cellsallocatedinthesamerow/columncanbeefficientlyleveragedtorealizebit-wiseAND/OR/XORandotherlinear/nonlinearoperationswithlargerfan-ins.Moreover,weproposealocaledgeextrac-tionmethodfordigitalimage(i.e.binaryandgrayscaleimage),utilizingtheintroducedin-memorycomputingplatform.Thus,onlyimagefeatureisextractedandtransferredtotheprimaryprocessingunit(i.e.CPUorGPU)forfurtherprocessing.

II.SPIN-TRANSFERTORQUERANDOM-ACCESSMEMORY

Atypical MagneticTunnelJunction(MTJ)structure,asshowninFig.2a,consistsoftwoferromagneticlayerswithatunnelbarriersandwichedbetweenthem.DuetotheTunnelMagnetoResistance(TMR)effect[18]–[20],theresistanceofMTJishigh(low)whenthemagnetizationoftwoferromag-neticlayersareinanti-parallel(parallel)state.TheTMRratioisdefinedas(RAP-RP)/RP,whichmayvaryfrom10%to400%dependingonmaterialsandtemperature[18]–[21].Thus,thedataarestoredasthemagnetizationdirectioninthefreelayer,whichcouldbeprogrammedthroughcurrentinducedSpin-TransferTorque(STT).Notethat,theMTJwithPerpendicularMagneticAnisotropy(PMA)isusedinthiswork.The1T1Rbit-celliswidelyusedinthetypicalSTT-MRAMdesign,asdepictedinFig.2b,whichiscorrespondinglycontrolledbyBitLine(BL), WordLine(WL)andSourceLine(SL).ThebiasingconditionsofmemoryreadandwritearepresentedinFig.2c.Forbothmemoryreadandwriteoperation,theWLisenabled,whichturnsontheaccesstransistor.Then,avoltagedrop-VDDor+VDDisappliedacrosstheBLandSL,inordertorealizewrite‘1’or‘0’respectively.Formemoryread,asensingcurrent(IREAD)isappliedontheBLandconsequentlygeneratesasensingvoltage,whichcanbedetectedbysenseamplifier.

Wejointlyusethe Non-Equilibrium Green’sFunction(NEGF)andLandau-Lifshitz-Gilbert(LLG)equationtomodelSTT-MRAMbitcell(i.e. MTJ)forcircuitlevelsimulation.ThemagnetizationdynamicsofFL(m)canbemodeledbyLLGequationwithspin-transfertorqueterms,whichcanbemathematicallydescribedas[22],[23]:

dm

dt=−|γ|m×Heff+α m×

dm

dt

+|γ|β(m×mp×m)−|γ|β (m×mp)

Pinned LayerTunneling barrierFree Layer

Anti-ParallelState (AP):RAP -> 1

ParallelState (P):RP -> 0

WL

BL SL

WWL

WBL

SL

RBL

RWL

Pinned LayerTunneling barrierFree Layer

Heavy Metal

Pinned LayerTunneling barrierFree Layer

Heavy Metal

X

YZ

WL

BL SL

XY

Z

Write 0

XY

Z

Anti-ParallelState (AP):RAP -> 1

ParallelState (P):RP -> 0

P AP AP P Write 1

Spin current generation due to spin Hall effect

Write current Read current Spin current

WWL

WBL

SL

RBL

RWL

Pinned LayerTunnel barrierFree Layer

Pinned LayerTunneling barrierFree Layer

Heavy Metal

X

YZ

XY

Z

Write 00

XY

Z

Anti-ParallelState (AP):RAP -> 1

ParallelState (P):RP -> 0

P AP AP P

Write 11

Spin current generation due to spin Hall effectSpin current generation due to spin Hall effect

Write current Read current Spin current

WWL

WBL

SL

RBL

RWL

IWRITE

IWRITE

IWRITE

IREAD

Pinned LayerTunnel barrierFree Layer

X

YZ

XY

Z

Anti-ParallelState (AP):RAP -> 1

ParallelState (P):RP -> 0

P AP AP P

IWRITE

IWRITE

(1)

Pinned LayerTunneling barrierFree Layer

Anti-ParallelState (AP):RAP -> 1

ParallelState (P):RP -> 0

WL

BL SL

WWL

WBL

SL

RBL

RWL

Pinned LayerTunneling barrierFree Layer

Heavy Metal

Pinned LayerTunneling barrierFree Layer

Heavy Metal

X

YZ

WL

BL SL

XY

Z

Write 0

XY

Z

Anti-ParallelState (AP):RAP -> 1

ParallelState (P):RP -> 0

P AP AP P Write 1

Spin current generation due to spin Hall effect

Write current Read current Spin current

WWL

WBL

SL

RBL

RWL

Pinned LayerTunnel barrierFree Layer

Pinned LayerTunneling barrierFree Layer

Heavy Metal

X

YZ

XY

Z

Write 00

XY

Z

Anti-ParallelState (AP):RAP -> 1

ParallelState (P):RP -> 0

P AP AP P

Write 11

Spin current generation due to spin Hall effectSpin current generation due to spin Hall effect

Write current Read current Spin current

WWL

WBL

SL

RBL

RWL

IWRITE

IWRITE

IWRITE

IREAD

Pinned LayerTunnel barrierFree Layer

X

YZ

XY

Z

Anti-ParallelState (AP):RAP -> 1

ParallelState (P):RP -> 0

P AP AP P

IWRITE

IWRITE

(a)

(b)

OperationsWrite‘1’(‘0’)

Read

WL VDD VDDBL GND(VDD) IREADSL VDD(GND) GND

(c)

Fig.2:(a)Devicestructureofconventionalmagnetictunneljunctioninparallel-andanti-parallelstates, withcurrent-inducedspin-transfertorqueswitchingscheme.(b)Bit-cellschematicof1T1RSTT-MRAM.(c)BiasingconditionsforSTT-MRAMoperations.

β=|h

2µ0e|IcP

AMTJtFLMs(2)

wherehisthereducedplankconstant,γisthegyromagneticratio,IcisthechargecurrentflowingthroughMTJ,tFListhethicknessoffreelayer, isthesecondSpintransfertorquecoefficient,andHeffistheeffectivemagneticfield,Pistheeffectivepolarizationfactor,AMTJ isthecrosssectionalareaofMTJ,mpistheunitpolarizationdirection.TheFig.3ahasshownthenormalizedmagnetizationdynamicsoffreelayerinx-,y-andz-axis,whenperformingtheSTT-MRAMwriteschemeasdescribedinFig.2c.

TABLEI.SIMULATIONPARAMETERS

Parameter Value

Freelayerdimension,(W ×L×t)FL 65×65×2nm3

Polarizationfactor,P 0.4GilbertDampingFactor,α 0.007SaturationMagnetization,Ms 850kA/m

Oxidethickness,tox 1.2nmRAproduct,RAp/TMR 10.58Ω·µm2/171.2%

Supplyvoltage 1VCMOStechnology 45nmSTT-MRAMcellarea 48F2

Accesstransistorwidth 9FCellaspectRatio 1.34

BasedonthesimulationparameterslistedinTableI,themagnetizationdynamicfromLLGequationcanprovidetherelativeangleθbetweenthemagnetizationofPL(z)andFL(m).Therefore,thereal-timeconductanceofMTJ(GMTJ)isgivenby[24]:

GMTJ=GP+GAP2

+GP−GAP2

cosθ (3)

whereGPandGAParetheconductanceof MTJinparallel(θ=0)andanti-parallel(θ=180)configurations.Both

440

Page 3: Exploring STT-MRAM Based In-Memory Computing Paradigm with

0 1 2 3 4-1

0

1

Normalized

magnetization

AP to P switching (write '0')

Mx My Mz

0 1 2 3 4

Time (s)

-1

0

1

Normalized

magnetization

P to AP switching (write '1')

Mx My Mz

1 1.1 1.2 1.3 1.4 1.5

Oxide thickness t (nm)

0

50

100

150

200

250

300

RA product ( m 2)

RA of ParallelRA of Anti-parallel

(a) (b)

Fig.3:(a)Thetransientplotofthenormalizedmagnetizationswitchinginx-,y-andz-axis,withprovidesSTT-MRAMwritescheme.(b)TheResistance-Areaproductw.r.tthethicknessofMTJtunneloxide(tox).

GPandGAP areobtainedfromtheatomisticlevelsimula-tionframeworkbasedonNon-EquilibriumGreen’sFunction(NEGF)[25],whiletheResistance-AreaProductwithrespecttothethicknessofMTJtunneloxideisshowninFig.3b.

III.STT-MRAMBASEDIN-MEMORYCOMPUTING

InadditiontotheconventionalmemoryfunctionofSTT-MRAM,through modifyingtherow/columndecodersofmemorysub-arrayandsensingcircuitry,areconfigurablein-memorylogic(i.e.AND,OR,XORandetc.)canbeimple-mentedwithoutadd-onlogiccircuits.

Vsense

RM1

R1

Isense

RM2

R2

SA

Rref

Iref

Vref

VP,P VAP,P VAP,AP

ANDOR

VP,P VAP,P VAP,AP

ANDOR

Vsense

RM1

R1

Isense

SA

Rref

Iref

Vref

VP VAP

Read

VP VAP

Read

A.Sensing-basedin-memorycomputingscheme

Vsense

RM1

R1

Isense

RM2

R2

SA

Rref

Iref

Vref

VP,P VAP,P VAP,AP

ANDOR

VP,P VAP,P VAP,AP

ANDOR

Vsense

RM1

R1

Isense

SA

Rref

Iref

Vref

VP VAP

Read

VP VAP

Read

(a) (b)

Fig.4:TheideaofvoltagecomparisonbetweenVsenseandVreffor(a)memoryreadand(b)in-memorylogicoperation.

Thekeyideatoperform memoryreadandin-memorycomputingistochoosedifferentthresholdswhensensingtheselectedmemorycell(s).AsshowninFig.4a,formemoryreadoperation,asinglememorycellisaddressedandroutedinthememoryreadpathtogenerateasensevoltage(Vsense),whichwillbecomparedwithareferencevoltage(Vref).Owingtotheparallel-oranti-parallelstateofselectedSTT-MRAMbit-cell(RM1),thesensevoltageareVPorVAP(VP<VAP)respectively.Thus,throughsettingthereferencevoltageat(VAP+VP)/2,thesenseamplifieroutputsbinary‘1’whenVsense>Vref,otherwisethesenseamplifieroutputs‘0’.

Fig.4bdepictsthesensing-based methodofin-memoryBooleancomputing,wheretwomemorybit-cells(RM1 andRM2)aresensedsimultaneously.R1andR2correspondstotheaccesstransistorswithinthesensingpath.OwingtothedifferentresistancecombinationsoftwoselectedSTT-MRAMbit-cells(i.e.RAP,RAP;RAP,RP;RP,RP),threedifferentsensevoltagesVsense(VAP,AP;VAP,P;VP,P)couldbegener-atedrespectively.Considersettingthereferencevoltageas(VAP,AP+VAP,P)/2throughtuningreferenceresistance,thesenseamplifieronlyoutputs‘1’whenbothselectedSTT-MRAMsareinanti-parallelstate(Vsense>Vref).Thus,thissensingoperationwithmodifiedreferencevoltageperformsanANDlogicoperationtakenthebinarydatastoredinRM1andRM2aslogicinputs.Similarly,whenthereferencevoltageisshiftedto(VP,P+VAP,P)/2,theORlogicoperationcanbeperformedaswell.Therefore,throughtuningthereferencevoltageforcomparison,thesenseamplifiercanperformreconfigurablein-memorycomputations.

B.in-memorycomputingplatformdesign

BasedontheSTT-MRAMdevicemodelingapproachdis-cussedabove,wehaveperformedthecircuitlevelSTT-MRAMsimulationsinCadenceSpectrewith45nmNCSUPDK[26]asCMOSlibrary.TheMTJresistivemodelisobtainedfromthemodelingapproachdescribedinSectionIIwithrespecttothedeviceparameterslistedinTable.I.Fig.5adepictsthearchitectureofmemorysub-array,wherememoryread/writepathofthespecificbit-cellisenabledbytherow/columndecoders.AsshowninFig.5c,themodifiedrow/columndecoderscanenableeithersingleline(memorywrite/read)ordoublelines(bit-wiseBooleancomputation),dependingontheaddresses(Addr1andAddr2)provided.Formemorywrite,thevoltagedropacrossBLandSLisgeneratedbytheVoltageDrivers(VD),whichrealizethefastmemoryswitchingasdepictedinFig.3a.FormemoryreadandBooleancomputation,asmallsensecurrent(Isense 3µA)isinjectedintothereadpathtogenerateasensevoltage(Vsense),whichistakenastheinputofmodifiedsensecircuit.AsshowninFig.5d,themodifiedsensecircuitcanprovidememoryread,AND/NAND,OR/NORandXOR/XNORfunctions,throughcombiningtwosenseamplifiers(i.e.StrongARMlatchshowninFig.5b),externalCMOSlogicgateandcontrolunits.OwingtothecomplementaryoutputsofSA,themodifiedsensingcircuitcanprovideNANDandNORwithoutadditionalcost.Moreover,accordingtotheBooleanrepresentationofp q=(p∨q)∧(p∧q),theXORlogiccanberealizedwithtwosenseamplifiers(i.e.performingANDandNORlogicrespectively)andanadditionalCMOSNORgate.Itisnoteworthythat,othercomputationrequiringtwothresholdcanalsobeimplementbyaddingadditionalreferencebranches.Theoperationofsuchsensecircuitisdeterminedbythecontrolsignals(ENAND,ENORandENM),whilethedesiredresultisacquiredbytheselectionsignal(SEL)oftheoutputmultiplexer.Notethat,onlyonesenseamplifierisusedduringAND/OR/memory-readoperation,inordertoreducethepowerconsumptionofsensing.

441

Page 4: Exploring STT-MRAM Based In-Memory Computing Paradigm with

Ro

w D

eco

de

rColumn Decoder

C_AddrR

_Ad

dr

VD VD VD VDVD VD

Ise

nse

SenseCircuit

BL

[0]

WL[0]

SL

[0]

BL

[1]

SL

[1]

BL

[m]

WL[1]

WL[n]

SL

[m]

(a)

CLK

CLK CLK

IN+ IN-

OUT+ OUT-

SAIN+

IN- OUT-

OUT+

RMTJ

RMTJ/2 RMTJ/2

W1 W2

R

R

W1 W2

(b) Addr1

decoder1

decoder2

Addr2

(c)

SANAND

AND

SANOR

OR

XOR

XNOR

Vsense

Vref1

Vref2

SELl3

ENAND

Iref

ENOR

ENM

Iref

Vsense

RMRAM1

R1'

Isense

RMRAM2

R2'

RAND

ROR

RM

(d)

Fig. 5: (a) The modified sub-array structure of STT-MRAM.(b) The schematic of StrongARM latch as Sense Amplifier(SA). (c) Modified decoder which provides single/multiplelines enable function. (d) Modified sense circuit for regularmemory W/R and in-memory computing operations.

C. performance evaluation

Fig. 6 depicts the transient simulation result of the sensecircuit under a 2ns period clock signal (CLK), which takesthe data stored in MRAM1 and MRAM2 as inputs. WhenCLK is high, the sense amplifier is in pre-charge phase andthe output is reset to ‘0’. When CLK is low, the sense am-plifier is in sampling phase, and generates logic computationresult depending on the reference voltage configuration. Vcmpincludes all the input signals of SAs, which are sense voltage(Vsense) and two reference voltage (Vref1 and Vref2). Vref1 isset to (VAP,AP+VAP,P)/2, and Vref2 is set to (VP,P+VAP,P)/2, forperforming AND and OR respectively. Note that, the rippleof Vcmp are resulted from the kickback noise due to theclock switching of SA. In general, the transient curve hasdemonstrated the correct logic operation of sense circuit witharound 200ps output latency.

Furthermore, in order to validate the variation tolerance of

00.5

1

CL

K

(V)

00.5

1

Inp

ut

MRAM1 MRAM2

0

10

20

30

Vcm

p

(mV

) Vsense

Vref1

Vref2

00.5

1

VO

R

(V)

00.5

1

VA

ND

(V)

0 1 2 3 4 5 6

Time (ns)

00.5

1

VX

OR

(V)

Pre-charge Sampling Sampling Pre-charge SamplingPre-charge

'0''0'

'0' '0'

'1'

'1'

'1'

'1''0'

('0', '0') ('0', '1') ('1', '1')

Fig. 6: Transient simulation results of in-memory computingoperations (i.e. AND, OR and XOR).

Fig. 7: Monte-Carlo simulation result of sense voltage (Vsense)distribution for (top) conventional memory read operationwith single STT-MRAM, and in-memory computing operationwith (middle) two selected STT-MRAM bit-cells (down) fourselected STT-MRAM bit-cells.

sense circuit, we have performed Monte-Carlo simulation with100000 trials. A σ = 2% variation is added to the Resistance-Area product (RAP), and a σ = 5% process variation isadded on the TMR. The simulation result of sense voltage(Vsense) distributions in Fig. 7 shows the sense margin forconventional memory read, two fan-in in-memory computationand four fan-ins sense-based operation. It can be seen thatsense margin gradually reduces when increasing the numberof fan-ins (selected STT-MRAM cells for computation). Notethat, such sense margin could be improved by increasing thesense current, but with a sacrifice of read operation energyefficiency.

TABLE II. Performance evaluation of STT-MRAM based in-memory computing platform

Metrics Memory mode ComputemodeWrite Read

Dynamic Energy 826.149pJ 870.042pJ 985.851pJMat Dynamic Energy 9.151pJ 14.637pJ 21.303pJ

Subarray Dynamic Energy 2.218pJ 3.590pJ 5.252pJLeakage Power 830.847mW

Area 1.271mm x 5.216mm = 6.632mm2

In order to evaluate the performance of the proposed STT-MRAM based in-memory computing platform, we configure

442

Page 5: Exploring STT-MRAM Based In-Memory Computing Paradigm with

the memorychiporganizationbydividingitinto multipleBanksconsistingofmultiple Mats.Each Matincludesmul-tiplesub-arraysorganizedinaH-treeroutingmanner. Weemploymodifiedself-consistentNVSim[27]alongwithanin-housedevelopedC++codetoverifythearchitecturelevelperformanceofSTT-MRAMin-memorycomputingplatform,whichincludesthemetricsforread/writeinmemorymodeandlogicoperationsincompute mode.TableIIliststheenergyconsumptionofasample4MBmemorywith512word-widthin45nmprocessnode.Notethat,theenergyoverheadcomesfromthepreviousdescribedmodifieddecoders,senseamplifier,etc.

IV.EDGEDETECTIONUSINGIN-MEMORYBOOLEANCOMPUTING

Inthissection,anin-memoryedgeextractionalgorithmisproposedtoacquiretheimageedgefeature,whichleveragestheintrinsicin-memorycomputingfunctionalityandmassivelyreducesthedatacommunicationbetweenmemoryandmainprocessingunit.

A.BinaryImage

Consideringabinaryimagewithdimensionofm×n

2 by 2

m

n n-1

m-1

Sliding window Input image

(binary matrix)Output Edge map(binary matrix)

VD VD

Vsense

4RAP

4 RP

3 RAP,1 RP

2 RAP,2 RP

1 RAP,3 RPEdge pixel

Vref1

Vref2

.AsdescribedinFig.8a,a2by2edgedetectionoperatorshiftfromthetop-lefttothebottom-rightcorner.Ifandonlyifallfourneighboringpixelsareoftheidenticalmagnitude(binary‘1’or‘0’),itindicatestheabsenceofimagevariation.Then,thecorrespondingpixelattheedgemapissettobinary‘0’(black).Fortheothercombinations,thecorrespondingedgemappixelissettobinary‘1’(white).ThecorrespondingedgeextractionalgorithmisdescribedinAlgorithm1.

(a)

(b) (c)

Fig.8:(a)Theedgeextractionprocedureforbinaryimage(i.e.individualbit-plane).(b)Originalbinaryimageand(c)itsedgemapextractedfrommemory.

Inordertoextracttheedgemapthroughtheproposedin-memorycomputingplatform,thedimensionofmemorysub-arrayshouldbesufficienttostoretheentirebinaryimagematrix.Forextractingimageedge,fourneighboringSTT-MRAMbit-cellsareselectedsimultaneouslybythemodified

Algorithm1In-memoryedgeextractionalgorithmforbinaryimage(singlebit-plane)

Input:inputbinaryimageI(m,n)Output:edgemapOUT=edge(I);fori=1tom-1doforj=1ton-1dotmp=sum(I(i:i+1,j:j+1));iftmp<1or>3thenOUT(i,j)=1;elseOUT(i,j)=0;endifendforendforreturn OUT

Algorithm 2 In-memoryedgeextractionalgorithmforgrayscaleimage(multiplebit-planes)

Input:inputgrayscaleimageI(m,n,8);p:numberofbitplanesusedforedgeextractionOutput:edgemapOUT;OUT=zeros(m-1,n-1);fork=8+1-pto8dotmpedge=edge(I(:,:,k));fori=1tom-1doforj=1ton-1doOUT=or(tmpedge(i,j),OUT(i,j));endforendforendforreturn OUT

rowandcolumndecoders.Then,theedgedetectionoperatorcanbeintrinsicallyperformedthroughthepreviousdescribedmodifiedsenseamplifierbyconfiguringthetworeferencevoltagesto(V4AP+V3AP,P)/2and(V4P+V3P,AP)/2,respectively,asmarkedinFig.8a.Abinaryimageanditsextractededgeimagearedisplayedin8band8casanexample.Notethat,insuchprocess,theedgemapcouldbedirectly’readout’usingthemodifiedsenseamplifier.While,intraditionaldesign,therawimageneedstobereadoutfrommemoryandsenttoprocessingunitforedgeextractioncomputation.

B.GrayscaleImage

Theproposedin-memoryedgeextractionalgorithmcouldalsobeextendedtogreysalceimage,asdescribedinAl-gorithm2.Consideringan8-bitgrayscaleimage(Fig.9a)isdividedintoeightbit-planes,from mostsignificantbit(MSB)toleastsignificantbit(LSB)asshowninFigs.9bto9i.Itpresentsthatthehighorderbit-planescontainsmoreinformationforlocalprocessing,whiletheloworderbit-planearenoisy.Inordertoperformtheedgeextractionbasedonthegrayscaleimage,thesamealgorithmshowninAlgorithm1isappliedonbit-planesseparately.Then,anin-memoryORoperationisimplementedoneachpixeloverthecalculated

443

Page 6: Exploring STT-MRAM Based In-Memory Computing Paradigm with

(a) grayscale image (b) 8th bit-plane (MSB) (c) 7th bit-plane

(d) 6th bit-plane (e) 5th bit-plane (f) 4th bit-plane

(g) 3rd bit-plane (h) 2nd bit-plane (i) 1st bit-plane (LSB)

Fig. 9: (a) grayscale image of lena. The bit-planes from MSBto LSB are shown (b-i), which indicate that most imageinformations are included in (b-e).

bit-planes. The extracted edge images are shown in Fig. 10,through the combination of different numbers of bit-planes.

(a) (b)

(c) (d)

Fig. 10: The edge image that combines (i.e. OR operation)the in-memory edge extraction result of (a) 8th (b) 8th-7th (c)8th-6th (d) 8th-5th most significant bit-planes.

The F-measure (2 · Precision · recallprecision+ recall

) is taken as indi-

cator of extracted edge quality. We perform the edge detec-tion evaluation based on the Berkeley Segmentation Dataset(BSD300) [28], which contains 300 natural images and human

annotations as ground truth. The simulation results listed inTable III shows that the best F-measure (only 8th bit-plane) ofour in-memory edge method merely has 0.05∼0.06 (∼10%)degradation in comparison with Prewitt [12], Sobel [11] andRoberts [13] such conventional edge operators. Increasingthe number of bit-planes for computation may involves moreimage details for edge extraction, but introducing amounts ofnoise as well which lowers F-measure of extracted edge image.

TABLE III. Evaluation of different edge detection algorithmson BSD300 dataset

Conventionaledge oeprator

Canny[1]

Prewitt[2]

Sobel[3]

Roberts[4]

F-measure 0.58 0.48 0.48 0.47

Our methods: 1 bit-plane(Fig. 10a)

2 bit-planes(Fig. 10b)

3 bit-planes(Fig. 10c)

4 bit-planes(Fig. 10d)

F-measure 0.42 0.40 0.35 0.32

Fig. 11: Energy consumption of conventional edge detectionalgorithm using ASIC and our in-memory edge extractionmethod for one image (lena.tif, 512× 512 pixels).

We further employ NVsim (memory setup in Section III-C)and design compiler (processing element synthesis) to evaluatethe image edge extraction system energy consumption in 45nmCMOS technology. Fig. 11 shows the energy consumptioncomparison between various image edge extraction algorithms.It shows that our edge extraction method achieves more than8× energy reduction in comparison with conventional edgeextraction algorithms. Note that, the total energy consumptionincludes two parts: computation energy and memory accessenergy. Since the memory access energy for transferring datafrom memory to processing element (PE) is much larger thancomputation energy, reducing data communication betweenmemory and PE greatly reduces the total energy dissipa-tion. The “computation energy” for in-memory computingalgorithm shown in Fig. 11 comes from decoders and SAsconfiguration, data buffer, copy and write-back operationsin in-memory data processing. However, such overhead ismuch smaller, due to local processing, compared with memoryaccess energy.

C. Edge-based Motion Detection

In additional to using the F-measure for evaluating our in-memory edge extraction quantitatively, we utilize the edge-based motion detection [29] to demonstrate the extractedimages have relatively good quality for the edge relatedapplication. Note that, the extracted edge from memory is

444

Page 7: Exploring STT-MRAM Based In-Memory Computing Paradigm with

(a) (b) (c)

(d) (e) (f)

(g) (h) (i)

Fig. 12: (a-c) The three successive frames captured from astationary camera. The corresponding edge representation ofmoving object are shown in (d-f) using Canny edge detector.(g-h) our in-memory edge extraction algorithm.

forwarded to the main processing unit (i.e. CPU or GPU) forprocessing. Such motion detection is performed with singleface dataset [30], where three successive frames and its motiondetection result using conventional canny edge operator andour in-memory edge extraction method are taken as exampleand shown in Fig. 12. It can be seen that the object motion issuccessfully detected using our extracted edge maps, similaras Canny edge detector.

V. CONCLUSION

In this work, we have presented a STT-MRAM based in-memory computing platform and novel edge extraction algo-rithm for in-memory image preprocessing. In-memory com-puting has emerged as promising computation architecture forbig-data application, where the data communication betweenmemory and primary processing unit (i.e. CPU and GPU)merely involves the preprocessed data. Such computationmethodology drastically reduces the data communication, thusleading to energy-efficient operation and extreme computationparallelism. As a case study, we have proposed an image edgeextraction algorithm that can be intrinsically performed by ourin-memory computing platform with 8× energy reduction, andonly 10% F-measure score degradation.

REFERENCES

[1] Y. Wang et al., “An energy-efficient nonvolatile in-memory computingarchitecture for extreme learning machine by domain-wall nanowiredevices,” IEEE TNANO, vol. 14, no. 6, pp. 998–1012, 2015.

[2] B. Hoefflinger, “Itrs: The international technology roadmap for semi-conductors,” in Chips 2020. Springer, 2011, pp. 161–174.

[3] P. Chi et al., “Prime: A novel processing-in-memory architecture forneural network computation in reram-based main memory,” in Proceed-ings of the 43rd ISCA. IEEE Press, 2016, pp. 27–39.

[4] H.-S. P. Wong, S. Raoux, S. Kim, J. Liang, J. P. Reifenberg, B. Ra-jendran, M. Asheghi, and K. E. Goodson, “Phase change memory,”Proceedings of the IEEE, vol. 98, no. 12, pp. 2201–2227, 2010.

[5] Y. Kim et al., “Write-optimized reliable design of stt mram,” inProceedings of the 2012 ACM/IEEE ISLPED. ACM, 2012, pp. 3–8.

[6] X. Fong et al., “Spin-transfer torque memories: Devices, circuits, andsystems,” Proceedings of the IEEE, vol. 104, pp. 1449–1488, 2016.

[7] Y. Seo et al., “High performance and energy-efficient on-chip cacheusing dual port (1r/1w) spin-orbit torque mram,” IEEE Journal onEmerging and Selected Topics in Circuits and Systems, vol. 6, no. 3,pp. 293–304, 2016.

[8] G. Prenat et al., “Ultra-fast and high-reliability sot-mram: From cachereplacement to normally-off computing,” IEEE Transactions on Multi-Scale Computing Systems, vol. 2, pp. 49–60, 2016.

[9] J. Canny, “A computational approach to edge detection,” IEEE Transac-tions on pattern analysis and machine intelligence, no. 6, pp. 679–698,1986.

[10] X. Jia, H. Huang, Y. Sun, J. Yuan, and D. M. Powers, “A noveledge detection approach using a fusion model,” Multimedia Tools andApplications, vol. 75, no. 2, pp. 1099–1133, 2016.

[11] I. Sobel, “Camera models and machine perception,” DTIC Document,Tech. Rep., 1970.

[12] J. M. Prewitt, “Object enhancement and extraction,” Picture processingand Psychopictorics, vol. 10, no. 1, pp. 15–19, 1970.

[13] L. G. Roberts, “Machine perception of three-dimensional soups,” Ph.D.dissertation, Massachusetts Institute of Technology, 1963.

[14] D. Marr and E. Hildreth, “Theory of edge detection,” Proceedings ofthe Royal Society of London B: Biological Sciences, vol. 207, no. 1167,pp. 187–217, 1980.

[15] S. Li, C. Xu, Q. Zou, J. Zhao, Y. Lu, and Y. Xie, “Pinatubo: Aprocessing-in-memory architecture for bulk bitwise operations in emerg-ing non-volatile memories,” in Design Automation Conference (DAC),2016 53nd ACM/EDAC/IEEE. IEEE, 2016, pp. 1–6.

[16] S. Jain, A. Ranjan, K. Roy, and A. Raghunathan, “Computing inmemory with spin-transfer torque magnetic ram,” arXiv preprintarXiv:1703.02118, 2017.

[17] Z. He, S. Angizi, F. Parveen, and D. Fan, “Leveraging dual-modemagnetic crossbar for ultra-low energy in-memory data encryption,” inProceedings of the on Great Lakes Symposium on VLSI 2017. ACM,2017, pp. 83–88.

[18] G. Autes, J. Mathon, and A. Umerski, “Strong enhancement of thetunneling magnetoresistance by electron filtering in an fe/mgo/fe/gaas(001) junction,” Physical review letters, p. 217202, 2010.

[19] P. Mavropoulos et al., “Half-metallic ferromagnets for magnetic tunneljunctions by ab initio calculations,” Physical Review B, vol. 72, no. 17,p. 174428, 2005.

[20] M. Bowen et al., “Nearly total spin polarization in la 2/3 sr 1/3 mno3 from tunneling experiments,” Applied Physics Letters, vol. 82, no. 2,pp. 233–235, 2003.

[21] J. Hayakawa et al., “Dependence of giant tunnel magnetoresistance ofsputtered cofeb/mgo/cofeb magnetic tunnel junctions on mgo barrierthickness and annealing temperature,” Japanese Journal of AppliedPhysics, vol. 44, no. 4L, p. L587, 2005.

[22] M. J. Donahue, “Oommf user’s guide, version 1.0,” -6376, 1999.[23] X. Fong, S. K. Gupta, N. N. Mojumder, S. H. Choday, C. Augustine, and

K. Roy, “Knack: A hybrid spin-charge mixed-mode simulator for eval-uating different genres of spin-transfer torque mram bit-cells,” in 2011International Conference on Simulation of Semiconductor Processes andDevices, Sept 2011, pp. 51–54.

[24] D. Fan, S. Maji, K. Yogendra, M. Sharad, and K. Roy, “Injection-lockedspin hall-induced coupled-oscillators for energy efficient associativecomputing,” IEEE Transactions on Nanotechnology, vol. 14, no. 6, pp.1083–1093, Nov 2015.

[25] G. Panagopoulos et al., “A framework for simulating hybrid mtj/cmoscircuits: Atoms to system approach,” in DATE. IEEE, 2012.

[26] (2011) Ncsu eda freepdk45. [Online]. Available: http://www.eda.ncsu.edu/wiki/FreePDK45:Contents

[27] X. Dong, C. Xu, N. Jouppi, and Y. Xie, “Nvsim: A circuit-level perfor-mance, energy, and area model for emerging non-volatile memory,” inEmerging Memory Technologies. Springer, 2014, pp. 15–50.

445

Page 8: Exploring STT-MRAM Based In-Memory Computing Paradigm with

[28] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik, “Contour detection andhierarchical image segmentation,” IEEE transactions on pattern analysisand machine intelligence, vol. 33, no. 5, pp. 898–916, 2011.

[29] A. Sappa and F. Dornaika, “An edge-based approach to motion detec-tion,” Computational Science–ICCS 2006, pp. 563–570, 2006.

[30] E. Maggio and A. Cavallaro, “Hybrid particle filter and mean shifttracker with adaptive transition model,” in Proceedings. (ICASSP ’05).IEEE International Conference on Acoustics, Speech, and Signal Pro-cessing, 2005., vol. 2, March 2005, pp. 221–224.

446