14
ISCA 2020 Submission #518 Confidential Draft: DO NOT DISTRIBUTE TIMELY: Pushing Data Movements and Interfaces in PIM Accelerators Towards Local and in Time Domain Abstract—Resistive-random-access-memory (ReRAM) based processing-in-memory (R 2 PIM) accelerators show promise in bridging the gap between Internet of Thing devices’ con- strained resources and Convolutional/Deep Neural Networks’ (CNNs/DNNs’) prohibitive energy cost. Specifically, R 2 PIM ac- celerators enhance energy efficiency by eliminating the cost of weight movements and improving the computational density through ReRAM’s high density. However, the energy efficiency is still limited by the dominant energy cost of input and partial sum (Psum) movements and the cost of digital-to-analog (D/A) and analog-to-digital (A/D) interfaces. In this work, we identify three energy-saving opportunities in R 2 PIM accelerators: analog data locality, time-domain interfacing, and input access reduction, and propose an innovative R 2 PIM accelerator called TIMELY, with three key contributions: (1) TIMELY adopts analog local buffers (ALBs) within ReRAM crossbars to greatly enhance the data locality, minimizing the energy overheads of both input and Psum movements; (2) TIMELY largely reduces the energy of each single D/A (and A/D) conversion and the total number of conversions by using time-domain interfaces (TDIs) and the employed ALBs, respectively; (3) we develop an only-once input read (O 2 IR) mapping method to further decrease the energy of input accesses and the number of D/A conversions. The evaluation with more than 10 CNN/DNN models and various chip configurations shows that, TIMELY outperforms the baseline R 2 PIM accelerator, PRIME, by one order of magnitude in energy efficiency while maintaining better computational density (up to 31.2×) and throughput (up to 736.6×). Furthermore, comprehensive studies are performed to evaluate the effectiveness of the proposed ALB, TDI, and O 2 IR innovations in terms of energy savings and area reduction. I. I NTRODUCTION While deep learning-powered Internet of Things (IoT) de- vices promise to revolutionize the way we live and work by enhancing our ability to recognize, analyze, and classify the world around us, this revolution has yet to be unleashed. IoT devices – such as smart phones, smart sensors, and drones – have limited energy and computation resources since they are battery-powered and have a small form factor. On the other hand, high-performance Convolutional/Deep Neural Networks (CNNs/DNNs) come at a cost of prohibitive energy consumption [58] and can have hundreds of layers [24] and tens of millions of parameters [62]. Therefore, CNN/DNN- based applications can drain the battery of an IoT device very quickly if executed frequently [71], and requires an increase in form factor for storing and executing CNNs/DNNs [10], [56]. The situation continues to worsen due to the fact that This work was supported in part by NSF 1816833, 1719160, 1725447, 1730309, and NSF grant CCF 1740352 and SRC nCORE NC-2766-A. CNNs/DNNs are becoming increasingly complex as they are designed to solve more diverse and bigger tasks [32]. To close the gap between the constrained resources of IoT devices and the growing complexity of CNNs/DNNs, many energy-efficient accelerators have been proposed [1], [6], [9], [12], [30], [31]. As the energy cost of CNN/DNN accelerators is dominated by memory accesses of inputs, weights and partial sums (Psums) (see Fig. 1 (a)) (e.g., up to 95% in DianNao [12]), processing-in-memory (PIM) accelerators have emerged as a promising solution in which the computation is moved into the memory arrays and weight movements are eliminated (see Fig. 1 (b)). Among PIM accelerators on various memory technologies [5], [13], [43], [44], [54], [56], [61], [73], resistive-random-access-memory-(ReRAM)-based- PIM (R 2 PIM) accelerators have gained extensive research interest due to ReRAM’s high density (e.g. 25×–50× higher over SRAM [68], [74]). However, the energy efficiency of R 2 PIM accelerators (such as PRIME [13], ISAAC [56], and PipeLayer [61]) is still limited due to two bottlenecks (see Fig. 1 (b)): (1) although the weights are kept stationary in memory, the energy cost of data movements due to inputs and Psums is still large (as high as 81% in PRIME [13]); (2) the energy of the interfacing circuits (such as analog-to-digital converters (ADCs)/digital-to-analog converters (DACs)) is an- other limiting factor (as high as 61% in ISAAC [56]). To address the aforementioned energy bottlenecks, we an- alyze and identify opportunities for greatly enhancing the energy efficiency of R 2 PIM accelerators (see Section III-A), and develop three novel techniques that strive to push data movements and interfaces in PIM accelerators towards local and in time domain (see Section III-B). While these three techniques are in general effective for enhancing the energy efficiency of PIM accelerators, we evaluate them in a R 2 PIM accelerator, and demonstrate an improvement of energy effi- ciency by one order of magnitude over state-of-the-art R 2 PIM accelerators. The contribution of this paper is as follows: We propose three new ideas for aggressively improving energy efficiency of R 2 PIM accelerators: (1) adopting analog local buffers (ALBs) within memory crossbars for enhancing (analog) data locality, (2) time-domain interfaces (TDIs) to reduce energy cost of single D/A (and A/D) conversion, and (3) a new mapping method called only-once input read (O 2 IR) to further save the number of input/Psum accesses and D/A conversions. We develop an innovative R 2 PIM architecture (see 1

TIMELY: Pushing Data Movements and Interfaces in PIM ...efficiency of PIM accelerators, we evaluate them in a R2PIM accelerator, and demonstrate an improvement of energy effi-ciency

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

  • ISCA 2020 Submission #518Confidential Draft: DO NOT DISTRIBUTE

    TIMELY: Pushing Data Movements and Interfacesin PIM Accelerators Towards Local and in Time

    Domain

    Abstract—Resistive-random-access-memory (ReRAM) basedprocessing-in-memory (R2PIM) accelerators show promise inbridging the gap between Internet of Thing devices’ con-strained resources and Convolutional/Deep Neural Networks’(CNNs/DNNs’) prohibitive energy cost. Specifically, R2PIM ac-celerators enhance energy efficiency by eliminating the cost ofweight movements and improving the computational densitythrough ReRAM’s high density. However, the energy efficiency isstill limited by the dominant energy cost of input and partial sum(Psum) movements and the cost of digital-to-analog (D/A) andanalog-to-digital (A/D) interfaces. In this work, we identify threeenergy-saving opportunities in R2PIM accelerators: analog datalocality, time-domain interfacing, and input access reduction, andpropose an innovative R2PIM accelerator called TIMELY, withthree key contributions: (1) TIMELY adopts analog local buffers(ALBs) within ReRAM crossbars to greatly enhance the datalocality, minimizing the energy overheads of both input and Psummovements; (2) TIMELY largely reduces the energy of each singleD/A (and A/D) conversion and the total number of conversionsby using time-domain interfaces (TDIs) and the employed ALBs,respectively; (3) we develop an only-once input read (O2IR)mapping method to further decrease the energy of input accessesand the number of D/A conversions. The evaluation with morethan 10 CNN/DNN models and various chip configurations showsthat, TIMELY outperforms the baseline R2PIM accelerator,PRIME, by one order of magnitude in energy efficiency whilemaintaining better computational density (up to 31.2×) andthroughput (up to 736.6×). Furthermore, comprehensive studiesare performed to evaluate the effectiveness of the proposed ALB,TDI, and O2IR innovations in terms of energy savings and areareduction.

    I. INTRODUCTIONWhile deep learning-powered Internet of Things (IoT) de-

    vices promise to revolutionize the way we live and work byenhancing our ability to recognize, analyze, and classify theworld around us, this revolution has yet to be unleashed.IoT devices – such as smart phones, smart sensors, anddrones – have limited energy and computation resources sincethey are battery-powered and have a small form factor. Onthe other hand, high-performance Convolutional/Deep NeuralNetworks (CNNs/DNNs) come at a cost of prohibitive energyconsumption [58] and can have hundreds of layers [24] andtens of millions of parameters [62]. Therefore, CNN/DNN-based applications can drain the battery of an IoT device veryquickly if executed frequently [71], and requires an increasein form factor for storing and executing CNNs/DNNs [10],[56]. The situation continues to worsen due to the fact that

    This work was supported in part by NSF 1816833, 1719160, 1725447,1730309, and NSF grant CCF 1740352 and SRC nCORE NC-2766-A.

    CNNs/DNNs are becoming increasingly complex as they aredesigned to solve more diverse and bigger tasks [32].

    To close the gap between the constrained resources of IoTdevices and the growing complexity of CNNs/DNNs, manyenergy-efficient accelerators have been proposed [1], [6], [9],[12], [30], [31]. As the energy cost of CNN/DNN acceleratorsis dominated by memory accesses of inputs, weights andpartial sums (Psums) (see Fig. 1 (a)) (e.g., up to 95% inDianNao [12]), processing-in-memory (PIM) accelerators haveemerged as a promising solution in which the computationis moved into the memory arrays and weight movementsare eliminated (see Fig. 1 (b)). Among PIM accelerators onvarious memory technologies [5], [13], [43], [44], [54], [56],[61], [73], resistive-random-access-memory-(ReRAM)-based-PIM (R2PIM) accelerators have gained extensive researchinterest due to ReRAM’s high density (e.g. 25×–50× higherover SRAM [68], [74]). However, the energy efficiency ofR2PIM accelerators (such as PRIME [13], ISAAC [56], andPipeLayer [61]) is still limited due to two bottlenecks (seeFig. 1 (b)): (1) although the weights are kept stationary inmemory, the energy cost of data movements due to inputs andPsums is still large (as high as 81% in PRIME [13]); (2) theenergy of the interfacing circuits (such as analog-to-digitalconverters (ADCs)/digital-to-analog converters (DACs)) is an-other limiting factor (as high as 61% in ISAAC [56]).

    To address the aforementioned energy bottlenecks, we an-alyze and identify opportunities for greatly enhancing theenergy efficiency of R2PIM accelerators (see Section III-A),and develop three novel techniques that strive to push datamovements and interfaces in PIM accelerators towards localand in time domain (see Section III-B). While these threetechniques are in general effective for enhancing the energyefficiency of PIM accelerators, we evaluate them in a R2PIMaccelerator, and demonstrate an improvement of energy effi-ciency by one order of magnitude over state-of-the-art R2PIMaccelerators. The contribution of this paper is as follows:• We propose three new ideas for aggressively improving

    energy efficiency of R2PIM accelerators: (1) adoptinganalog local buffers (ALBs) within memory crossbarsfor enhancing (analog) data locality, (2) time-domaininterfaces (TDIs) to reduce energy cost of single D/A(and A/D) conversion, and (3) a new mapping methodcalled only-once input read (O2IR) to further save thenumber of input/Psum accesses and D/A conversions.

    • We develop an innovative R2PIM architecture (see

    1

  • a 8-bit MAC, b 16-bit MAC

    * Excluding the area of off-chip DRAM

    PipeLayerb

    PRIMEa

    ISAACb

    Eyerissb*

    TIMELYb

    Mem

    ory

    Weights

    Inputs

    Bu

    s

    Bus

    PsumsBus

    PE array Analog

    PE array(Weights)

    PIM

    (ReRAM)Memory

    wall

    (a) (b)

    PsumsBus PsumsBus

    Inputs BusInputs Bus

    BottleneckBottleneck

    (c)

    Inputs Weights Psums

    MACsMACs ADCDAC MACs ADCDAC

    Bottleneck

    Energy of accessing

    Eyeriss

    Non-PIM

    ReRAM-based PIM

    30.4 %27.9 % 41.7%

    Memory Memory

    TIMELYa

    11

    22

    Fig. 1. An illustration of (a) the “Memory walls” in CNN/DNN accelerators due to data movements of inputs, weights, and Psums, and an example of theirenergy breakdown [9], (b) the energy efficiency bottlenecks of PIM accelerators: (1) input and Psum movements (i.e. Bottleneck ¶) and (2) the DAC/ADCinterfacing (i.e. Bottleneck ·), and (c) bench-marking the energy efficiency and computational density of the proposed TIMELY over state-of-the-art CNN/DNNaccelerators, including a non-PIM accelerator (Eyeriss [9]) and R2PIM accelerators (PRIME [13], ISAAC [56], and PipeLayer [61]).

    Section IV), TIMELY (Time-domain, In-MemoryExecution, LocalitY), that integrates the three aforemen-tioned ideas to (1) maximize (analog) data locality viaALBs and O2IR and (2) minimize the D/A (and A/D)interfaces’ energy cost by making use of the more energy-efficient TDIs, the ALBs and the O2IR method. TIMELYoutperforms the most competitive R2PIM accelerators inboth energy efficiency (over PRIME) and computationaldensity (over PipeLayer) (see Fig. 1 (c)).

    • We perform a thorough evaluation of TIMELY against4 state-of-the-art R2PIM accelerators on >10 CNN andDNN models under various chip configurations, andshow that TIMELY achieves up to 18.2× improvement(over ISAAC) in energy efficiency, 31.2× improvement(over PRIME) in computational density, and 736.6× inthroughput (over PRIME), demonstrating a promisingarchitecture for accelerating CNNs and DNNs. Further-more, we perform ablation studies to evaluate the effec-tiveness of each TIMELY’s feature (i.e., ALB, TDI, andO2IR) in reducing energy and area costs, and demonstratethat TIMELY’s innovative ideas can be generalized toother R2PIM accelerators.

    II. BACKGROUNDThis section provides the background of R2PIM CNN/DNN

    accelerators. First, we introduce CNNs and the input reuseopportunities in CNNs’ convolutional (CONV) operations inSection II-A, and ReRAM basics in Section II-B. Second, wecompare digital-to-time converter (DTC)/time-to-digital con-verter (TDC) and DAC/ADC, which are two types of digital-to-analog (D/A) and analog-to-digital (A/D) conversion, interms of energy costs and accuracy in Section II-C.A. CNN and Input Reuse

    CNN is composed of multiple CONV layers. Given CNNparameters in Table I, the computation in a CONV layer canbe described as:

    O[v][u][x][y] =C−1∑k=0

    G−1∑i=0

    Z−1∑j=0

    I[v][k][Sx+ i][Sy+ j]×W [u][k][i][ j]

    +B[u], 0≤ v < M,0≤ u < D,0≤ x < F,0≤ y < E(1)

    where O, I, W , and B denote matrices of the output featuremaps, input feature maps, filters, and biases, respectively.Fully-connected (FC) layers are typically behind CONV lay-ers. Different from CONV layers, the filters of FC layers are

    of the same size as the input feature maps [9]. Eq. (1) candescribe FC layers with additional constraints, i.e., Z = H,G =W , S = 1, and E = F = 1.

    Three types of input reuses exist in CNN CONV operationsyielding 3-D Psums. Consider the example in Fig. 2 where Cand M are set to 1 for simplicity because input reuses areindependent on them. First (see Fig. 2 (a)), one input featuremap is shared by multiple (e.g. two in Fig. 2) output channels’filters. Second (see Fig. 2 (b)), as filters slide horizontally,input pixels are reused to generate outputs in the same row –e.g. b and f are used twice to generate w and x, respectively.Third (see Fig. 2 (c)), as filters slide vertically, input pixelsare reused to generate outputs in the same column – e.g. eand f are used twice to generate w and y, respectively. Givena layer with D output channels, filter size of Z, and stride ofS, each input pixel is reused DZ2/S2 times [69]. For example,f is reused 8 times in Fig. 2 where D=2, Z=2, and S=1. Notethat weights are private in DNN CONV operations, which isthe main difference between CNNs and DNNs [77].

    B. ReRAM Basics

    ReRAM is a type of nonvolatile memory storing datathrough resistance modulation [29], [40], [65], [66], [68]. AReRAM cell with a metal-insulator-metal (MIM) structureconsists of top/bottom electrodes and a metal-oxide layer [68].Analog multiplication can be performed in ReRAM cells(see Fig. 3 (a)), with the biased voltages serving as inputs,ReRAM cells’ conductance as weights, and resulting currentsas outputs. Addition operation is realized through currentsumming among ReRAM cells of the same columns [27],[72] – e.g. I1 = V1/R11 +V2/R12 in Fig. 3 (a). At the circuitlevel, digital inputs are read from input memory, converted toanalog voltages by DACs, and then applied on ReRAM cells.The resulting analog Psums are converted to digital values byADCs, and then stored back into output memory.C. DTC/TDC Vs. DAC/ADC

    As shown in Fig. 3 (b), DTC/TDC can perform the con-

    Inputs (4x4)

    Psums (3x3x2)

    (b) Shared by w and x

    *ea

    fb c d

    j k lg h

    m n o piea

    fb c d

    j k lg h

    m n o pi

    (c) Shared by w and y

    Two filters (2x2)

    (a) Shared by two filters

    =w xy zw xy z

    Fig. 2. Illustrating the three types of input reuses.

    2

  • Filter(weights)

    Input Psum

    (a) (b)

    Logical

    Physical

    Digital signal (Dx)

    DTC

    TDC

    DACADC

    I1

    V1 V2

    I2

    ReRAM crossbar

    array2x2 (B=2)

    I1=V1/R11+V2/R12

    R11 R12

    DAC DAC

    ADC

    ADC

    Input memory

    Analog time signal

    Txt

    V

    Analog voltage signal

    ......

    t

    V ......

    t

    VAnalog voltage signal

    ......

    t

    V

    Vx

    Fig. 3. (a) ReRAM operation basics and (b) two types of interfacing circuits.

    version between an analog time signal and the correspondingdigital signal; DAC/ADC can do so between an analog voltagesignal and the digital signal. One analog signal (e.g. Vx inFig. 3 (b)) can be represented as a continuous delay witha fixed high/low voltage (corresponding to 1/0) in the timedomain (e.g. Tx in Fig. 3 (b)) [2], [4], [7], [11], or as a discretevoltage in the voltage domain (e.g. Vx in Fig. 3 (b)). Comparedwith a DTC/TDC which can be implemented using digitalcircuits [3], [41], [48]–[50], [75], a DAC/ADC typically relieson analog circuits that (1) are more power consuming and(2) vulnerable to noises and process, voltage and temperature(PVT) variations, and (3) benefit much less from processscaling in energy efficiency [47].

    III. OPPORTUNITIES AND INNOVATIONS

    This section aims to answer the question of “how canTIMELY outperform state-of-the-art R2PIM accelera-tors?” by first identifying R2PIM accelerators’ energy-saving

    TABLE IA SUMMARY OF PARAMETERS USED IN TIMELY

    CNN Params DescriptionM batch size of 3-D feature maps

    C/D input / output channelH/W input feature map height / widthZ/G filter height / width

    S stride sizeE/F output feature map height/width

    Archi. Params DescriptionB # of ReRAM bit cells in one crossbar array is B2

    NCB # of ReRAM crossbar arrays in one sub-Chip is N2CB

    Ri jthe resistance of the ReRAM bit cell at the ith row

    and jth column of a crossbar arrayTi the time input for the ith row of a ReRAM crossbar array

    To,8b/4b the time Psum for 8-bit inputs and 4-bit weightsV DD the logic high voltage of the time-domain signalsVth the threshold voltage of a comparatorCc the charging capacitanceTdel the unit delay of a DTC/TDC

    γ one DTC/TDC is shared by γ rows/columnsin one ReRAM crossbar arrayφ the reset phase of a sub-Chip (reset: φ=1)χ the number of sub-Chips in one TIMELY chipε the potential error of one X-subBuf

    Energy Params DescriptioneDTC the energy of one conversion in DTCeT DC the energy of one conversion in TDCeDAC the energy of one conversion in DACeADC the energy of one conversion in ADCeP the unit energy of accessing P-subBufeX the unit energy of accessing X-subBufeR2 the unit energy of accessing ReRAM input/output buffers

    opportunities that inspire us to propose TIMELY. Note thatall parameters used in this section are summarized in Table I.

    A. Opportunities

    We first identify three opportunities for greatly reducingenergy costs of R2PIM accelerators by analyzing performancelimitations in state-of-the-art designs. Specifically, Opportunity#1 is motivated by the energy bottleneck of (1) input and Psummovements (i.e., Bottleneck ¶ in Fig. 1 (b)) and (2) interfacingcircuits (i.e., Bottleneck · in Fig. 1 (b)); Opportunity #2is inspired by the bottleneck of interfacing circuits; andOpportunity #3 is motivated by both two bottlenecks.

    Opportunity #1. Enhancing (analog) data locality to greatlyreduce the energy/time costs of both data movements andD/A and A/D interfaces. We identify this opportunity basedon the following considerations. Since in-ReRAM processingcomputes in the analog domain, the operands, including inputs,weights, and Psums, are all analog. If we can mostly accessanalog operands locally, we can expect large energy savingsassociated with input and Psum movements and largely re-move the need to activate D/A and A/D interfaces. In the priorworks, the input/Psum movements and interfaces dominate theenergy cost of R2PIM accelerators. First, input and Psum ac-cesses involve energy-hungry data movements. While weightsstay stationary in R2PIM accelerators, input and Psum accessesare still needed. Although one input/Psum access can be sharedby B ReRAM cells in the same row/column, for dot-productoperations in a B×B ReRAM crossbar array, a large numberof input and Psum accesses are still required. For example,more than 55 million inputs and 15 million Psums need tobe accessed during the VGG-D [60] and ResNet-50 [24]inferences, respectively (see Fig. 4 (a)). While inputs requireonly memory read, Psums involve both memory write andread, resulting in a large energy cost. As an example, 36% and47% of the total energy in PRIME [13] are spent on input andPsum accesses, respectively (see Fig. 4 (b)). Second, voltage-domain D/A and A/D conversions involve large energy cost.For example, in PRIME, except the data movement energy,most of the remaining energy cost is consumed by D/A andA/D conversions (see Fig. 4 (b)).

    Opportunity #2. Time-domain interfacing can reduce theenergy cost of a single D/A (and A/D) conversion. Since time-domain D/A and A/D conversion is more energy efficientthan voltage-domain conversion (see Section II-C), we havean opportunity to use DTCs and TDCs for interfacing be-tween the digital signals stored in memory and analog signalscomputated in ReRAM crossbar arrays. In prior works, DACsand ADCs limit the energy efficiency of R2PIM accelerators.

    Comm. 19%

    Memory 12%

    Digital comp. 8%

    Analog comp. (DAC/ ADC)61%

    (a)

    VGG-D

    Input* Psum*

    ResNet-5059.8 M

    15.1 M

    (b)

    * All CONV layers

    59.9 M

    55.6 M

    # of accessing Inputs

    36%Psums & Outputs

    47%

    ADC 17%

    DAC 0%

    (c)

    Fig. 4. (a) The number of input/Psum accesses, (b) energy breakdown ofPRIME [13], and (c) energy breakdown of ISAAC [56].

    3

  • Although ISAAC optimizes the energy cost of DAC/ADCinterface, the interface energy is as large as 61% in ISAAC(see Fig. 4 (c)). Specifically, ISAAC [56] decreases the numberof ADCs by sharing one ADC among 128 ReRAM bitlines,and thus the ADC sampling rate increases by 128×, increasingthe energy cost of each A/D conversion.

    Opportunity #3. Reducing the number of input accessescan save the energy cost of both input accesses and D/Aconversions. We find that the input reuse of CNNs can stillbe improved over the prior works for reducing the energyoverhead of input accesses and corresponding interfaces.Though each input connected to one row of an ReRAMarray is naturally shared by B ReRAM cells along the row,each input on average has to be accessed DZ2/S2/B times.Taking ISAAC [56] as an example, one 16-bit input involvesDZ2/S2/B times unit eDRAM read energy (i.e. 4416× theenergy of a 16-bit ReRAM MAC), input register file readenergy (i.e. 264.5× the energy of a 16-bit ReRAM MAC)and D/A conversion energy (i.e. 109.7× the energy of 16-bitReRAM MAC). For MSRA-3 [25] adopted by ISAAC, eachinput of CONV layers is read and activated the interfaces 47times on average.

    B. TIMELY Innovations

    The three aforementioned opportunities inspire us to de-velop the three innovations in TIMELY for greatly improvingthe acceleration energy efficiency. Fig. 5 (a) and (b) show aconceptual view of the difference between existing R2PIM ac-celerators and TIMELY. Specifically, TIMELY mostly movesdata in the analog domain as compared to the fully digitaldata movements in the existing designs and adopts DTCs andTDCs instead of DACs and ADCs for interfacing.

    Innovation #1. TIMELY adopts ALBs to aggressivelyenhance (analog) data locality, leading to about NCB× re-duction in data movement energy costs per input and perPsum compared with existing designs, assuming a total ofNCB ×NCB crossbars in each sub-Chip. Multiple sub-Chipscompose one chip. One key difference between TIMELY andexisting R2PIM resides in their sub-Chip design (see Fig. 5(a) vs. (b)). Specifically, each crossbar (i.e. CB in Fig. 5) inexisting designs fetches inputs from a high-cost memory (e.g.input buffers in Fig. 5 (a)). Therefore, for each sub-Chip, thereis an energy cost of BN2CBeR2 for accessing BN

    2CB inputs. In

    TIMELY (see Fig. 5 (b)), an input fetched from the high-cost memory is shared by one row of the sub-Chip thanksto adopted local ALB buffers (e.g. X-subBufs in Fig. 5 (b))that are sandwiched between the crossbar arrays, resulting inan energy cost of BNCBeR2 +BN

    2CBeX for handling the same

    number of inputs, leading to an energy reduction of NCB×per input (see Fig. 5 (c). Similarly, each crossbar in existingR2PIM accelerators directly writes and reads Psums to andfrom the high-cost output buffers, whereas in TIMELY thePsums in each column of the sub-Chip are accumulated beforebeing written back to output buffers, leading to an energy costreduction of NCB× per Psum (see Fig. 5 (c). Furthermore,accessing the high-cost memory requires about one order of

    I-adderI-adderI-adderI-adderI-adderI-adderI-adderI-adderI-adder

    (a)(b)

    Data typeper Input

    Existing

    Data access

    Interfacing

    per Psum eR2eR

    TIMELYeX+eR /NCB

    eP+2eR /NCBper Input eDAC eDTC/NCBper Psum eADC eTDC/NCB

    (c)

    (d)

    1eADCeTDC 0.05eADCeTDC 0.05

    1eADCeTDC 0.05

    Normalized energy

    1eDACeDTC 0.02eDACeDTC 0.02

    1eDACeDTC 0.02

    1

    0.030.11

    eRePeX

    1

    0.030.11

    eRePeX

    Output BuffersNCB=3

    CBCBCB

    ADC

    DAC

    Reasons

    &

    CB*

    CB*

    CB*

    CB*

    CB*

    CB*

    CB*

    CB*

    CB*

    Inpu

    t Buf

    fers

    CB: ReRAM crossbar

    array P: P-subBuf X: X-subBuf

    Inpu

    t Buf

    fers

    DTC

    DTC

    DTC

    DTC

    DTC

    DTC

    NCB=3

    XX X

    X

    X

    XX CBCBCBCBCB CBCBCBCBCB

    P P P

    CB X CBCBCBCBCB CBCBCBCBCB

    P P P

    CB

    CB X CBCBCBCBCB CBCBCBCBCB

    TDCTDC TDCTDC TDCTDC

    Output Buffers

    XX

    XX

    2

    2

    2

    2

    2

    Energy of

    11

    22 11

    11

    Analog domain:

    Analog data movement

    Digital data movement

    22

    Fig. 5. A high-level view of (a) a sub-Chip within a chip of state-of-the-artR2PIMs and (b) TIMELY’s sub-Chip, (c) the energy cost per input and perPsum in state-of-the-art R2PIMs and TIMELY, and (d) the normalized energyof different data accesses and interfaces, where eR2 , eX , and eP are the unitenergy of accessing ReRAM input/output buffers, X-subBuf, and P-subBuf,respectively, while eDAC , eADC , eDTC , and eT DC denote the energy of oneDAC, ADC, DTC, and TDC [13], [39], [42], [49], [56], [63], respectively.

    magnitude higher energy cost than that of a local buffer.Specifically, the average energy of one high-cost memoryaccess in PRIME is about 9× and 33× higher than that ofP-subBufs and X-subBufs [74] in TIMELY, respectively. NCBis typically >10 (e.g. NCB = 12 in PRIME). Therefore, aboutNCB× energy reduction for handling input/Psum accesses canbe achieved in TIMELY. Additionally, the much reducedrequirements of input/output buffer size in TIMELY make itpossible to eliminate inter sub-Chip memory (see Fig. 6 (a)and Fig. 9 (c), leading to additional energy savings.

    Innovation #2. TIMELY adopts TDIs and ALBs to min-imize the energy cost of a single conversion and the totalnumber of conversions, respectively. As a result, TIMELYreduces the interfacing energy cost per input and per Psumby q1NCB and q2NCB, respectively, compared with currentpractices, where q1 = eDAC/eDTC and q2 = eADC/eT DC. It iswell recognized that the energy cost of ADC/DAC interfaces isanother bottleneck in existing R2PIM accelerators , in additionto that of data movements. For example, the energy cost ofADCs and DACs in ISAAC accounts for >61% of its totalenergy cost. In contrast, TIMELY adopts (1) TDCs/DTCsinstead of ADCs/DACs to implement the interfacing circuitsof crossbars and (2) only one TDC/DTC conversion for eachrow/column of one sub-Chip, whereas each row/column ofcrossbar needs one ADC/DAC conversion in existing designs,leading to a total of q1NCB× and q2NCB× reduction perinput and Psum, respectively, as compared to existing designs.Specifically, q1 and q2 are about 50 and 20 [39], [42], [49],[56], [63], respectively.

    Innovation #3. TIMELY employs O2IR to further reducethe number and thus energy cost of input accesses and D/Aconversions. As accessing the input and output buffers in sub-

    4

  • t

    (f)

    In Out

    GNDIin=Iout

    To X-subBufs

    ...

    ...

    ...

    T1

    T2

    TB

    Io1

    ...

    ...

    ...

    ...

    ...

    ...

    I12T1

    T2

    TB

    I1N

    R11

    R12

    R1N

    form X-subBufs

    In φ Out

    Vin=Vout

    φ Out

    Vin=Vout

    In φ Out

    Vin=Vout

    0

    T2

    V T1

    Vth

    To1

    0 t

    Io1Cc

    Io1Cc

    TB

    V

    Ip1+I11Cc

    Ip1+I11Cc

    Ip1+I11+I12Cc

    Ip1+I11+I12Cc

    t

    V

    0

    VDD

    1111

    1111

    0000

    0000

    1000

    0000

    256Tdel

    8-bit DTC

    t

    V

    0

    VDD

    1111

    1111

    0000

    0000

    1000

    0000

    256Tdel

    8-bit DTC

    VDDVDDVDD

    Cc or Cc/2

    (g)

    (b)

    (c)

    (d)(a)

    sub-Chip

    Bus

    (ReR

    AM

    ) In

    put

    Buf

    fers

    CB: ReRAM crossbar

    TDCs

    (ReRAM) Output Buffers

    Reciprocal & Shifter & Adder & ReLU/Pooling

    Charging units & comparators

    CBBxBCBBxB

    X-s

    ubB

    ufs

    CBBxBCBBxB

    CBBxBCBBxBV

    -SH

    P-subBufs

    P-subBufs P-subBufs

    P-subBufs

    X-s

    ub

    Bu

    fs

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ... ... ... ... ...

    ... ... ...

    ... ... ... ... ...

    ... ... ... ... ... ... ... ...

    ... ... ...... ... ... ... ...

    ... ... ...

    CBBxBCBBxB

    CBBxBCBBxB

    P-subBufs

    P-subBufs

    ... ... ... ... ...

    ... ... ...

    ... ... ... ... ...

    ... ... ...

    CBBxBCBBxB

    CBBxBCBBxB V

    -SH

    X-s

    ubB

    ufs ......

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    CBBxBCBBxB

    X-s

    ub

    Bu

    fsX

    -su

    bB

    ufs

    X-s

    ub

    Bu

    fsX

    -su

    bB

    ufs

    X-s

    ubB

    ufs

    X-s

    ubB

    ufs

    X-s

    ubB

    ufs

    X-s

    ubB

    ufs

    X-s

    ub

    Bu

    fsX

    -su

    bB

    ufs

    X-s

    ubB

    ufs

    X-s

    ubB

    ufs

    I-adders I-adders I-adders

    CBBxBCBBxB

    Analog domain

    Charging units & comparators

    CBBxB

    X-s

    ubB

    ufs

    CBBxB

    CBBxBV

    -SH

    P-subBufs

    P-subBufs P-subBufs

    P-subBufs

    X-s

    ub

    Bu

    fs

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ... ... ... ... ...

    ... ... ... ... ... ... ... ...

    ... ... ...

    CBBxB

    CBBxB

    P-subBufs

    P-subBufs

    ... ... ... ... ...

    ... ... ...

    CBBxB

    CBBxB V

    -SH

    X-s

    ubB

    ufs ......

    ...

    ...

    ...

    ...

    ...

    ...

    CBBxB

    X-s

    ub

    Bu

    fsX

    -su

    bB

    ufs

    X-s

    ubB

    ufs

    X-s

    ubB

    ufs

    X-s

    ub

    Bu

    fsX

    -sub

    Buf

    s

    I-adders I-adders I-adders

    CBBxB

    Analog domainD

    TCs

    DTC

    s

    To1

    Cc

    φ φφ

    Vth Vth Vth...

    ...

    Cc/2 Cc/2

    Ip1

    Out In

    Iout=∑Iin

    I-adders

    ...

    I-adders

    ...

    (e)

    I-adders

    ...

    I-adders

    ...

    I-adders

    ...

    I-adders

    ...

    Vo1

    11

    22

    33

    22

    11

    22

    33

    44

    55

    55

    44

    ControllerController

    Fig. 6. An illustration of the (a) TIMELY architecture: (b) ¶ X-subBuf, (c) ·P-subBuf, (d) ¸ I-adder, (e) ¹ ReRAM crossbar located in the first crossbarcolumn and last row of a sub-Chip, º charging units, and comparators, (f)the input/output characteristics of an 8-bit DTC, and (g) the input/outputcharacteristic of dot-product operations and charging & comparison units inthe leftmost ReRAM column of a sub-Chip.

    Chips costs about one order of magnitude higher energy thanthat of accessing local buffers between the crossbar arrays(see the left part of Fig. 5 (d)), we propose a O2IR strategy toincrease the input reuse opportunities for minimizing the costof input accesses and associated D/A conversion.

    IV. TIMELY ARCHITECTURE

    In this section, we first show an architecture overview (seeSection IV-A), and then describe how the TIMELY architec-ture integrates the three innovations for aggressively improvingthe acceleration energy efficiency in Sections IV-B, IV-C,and IV-D, respectively. In addition, we introduce our pipelinedesign for enhancing throughput in Section IV-E and thesoftware-hardware interface design for offering programma-bility in Section IV-F. The parameters used in TIMELY aresummarized in Table I.

    A. Overview

    Fig. 6 (a) shows the TIMELY architecture, which consistsof a number of sub-Chips connected via bus [13], [56]. Specif-ically, each sub-Chip includes DTCs/TDCs (on the left/at

    the bottom), ReRAM input/output buffers (on the left/at thebottom), ReRAM crossbars (see ¹ in Fig. 6 (a)) with eachhaving B×B bit cells, a mesh grid of local ALB buffers –i.e., X-subBufs (see ¶ in Fig. 6 (a)) and P-subBufs (see · inFig. 6 (a)) – between the ReRAM crossbar arrays, respectively,current adders (i.e. I-adders, ¸ in Fig. 6 (a)), and a block ofshift-and-add, ReLU, max-pooling units.

    The TIMELY architecture processes CNNs/DNNs’ infer-ence as follows. The pre-trained weights are pre-loaded intoTIMELY’s ReRAM arrays. Inputs of the CNN/DNN layersare fetched into the input buffers of one sub-Chip or severalsub-Chips that handle the corresponding layers, starting fromthe first CNN/DNN layer. Within each sub-Chip, the inputsare applied to the DTCs for converting the digital inputs intoanalog time signals, which are then shared by ReRAM bit cellsin the same row of all crossbar arrays along the horizontaldirection to perform dot products with the correspondingresistive weights. The calculated Psums at the same columnof all crossbars in the vertical direction are aggregated in I-adders, and converted into a voltage signal and then an analogtime signal by a charging units and comparators block (see ºin Fig. 6 (a)) before being converted into a digital signal viaa TDC. Note that the output of each P-subBuf is connectedto the I-adder separately. Finally, the resulting digital signalsare applied to the block of shift-and-add, ReLU, max-poolingunits, and then written to the output buffers.

    B. Enhancing (Analog) Data Locality

    Within each sub-chip of TIMELY, the converted inputs andcalculated Psums are moved in the analog domain with the aidof the adopted ALBs (see Fig. 6 (a)) after the digital inputs areconverted into time signals by DTCs and before the Psums areconverted into digital signals by TDCs. In this subsection, wefirst introduce the data movement mechanism and then presentthe operation of local analog buffers.

    Data Movement Mechanism. In the TIMELY architecture,time inputs from the DTCs move horizontally across theReRAM crossbar arrays in the same row via X-subBufs (see¶ in Fig. 6 (a))) for maximizing input reuses and minimizinghigh-cost memory accesses. Meanwhile, the resulting currentPsums move vertically via P-subBufs (see · in Fig. 6 (a)).Note that only the crossbars in the leftmost column fetchinputs from DTCs while those in all the remaining columnsfetch inputs from their analog local time buffers (i.e., theX-subBufs to their left). Similarly, only the outputs of thecrossbars in the bottommost row are converted into the digitalsignals via TDCs before they are stored back into the outputbuffers, while those in all the remaining rows are passed intothe crossbars right below via analog current buffers (i.e., theP-subBufs right below them). In this way, TIMELY processesmost data movements in the analog domain within each sub-chip, greatly enhancing data locality for improving the energyefficiency and throughput.

    Local Analog Buffers. The local analog buffers make itpossible to handle most (analog) data movements locally inTIMELY. Specifically, X-subBuf buffers the time signals (i.e.,

    5

  • outputs of DTCs) by latching it, i.e., copying the input delaytime to the latch outputs (see Fig. 6 (b)); while P-subBufbuffers the current signal outputted from the ReRAM crossbararray, i.e. copying the input current to their outputs (see Fig. 6(c)). The key is that X-subBuf and P-subBuf are more energyand area efficient than input/output buffers (see Fig. 5 (d)). Inparticular, an X-subBuf buffer consists of two cross-coupledinverters that form a positive feedback to speed up the responseat its output and thus reduces the delay between its inputs andoutputs [67]. Since cross-coupled inverters invert the input, athird inverter is used to invert the signal back. X-subBufs arereset in each pipeline-cycle by setting φ to be high (see Fig. 6(b)). The P-subBuf buffer is implemented using an NMOS-paircurrent mirror (see Fig. 6 (c)) [38].

    C. Time-Domain Dot Products and DTCs/TDCs Interfacing

    TIMELY performs dot products with time-domain inputsfrom the DTCs and converts time-domain dot product resultsinto digital signals via TDCs. In this subsection, we firstpresent dot product operations in TIMELY and then introducetheir associated DTCs/TDCs.

    Dot Products. First, let us consider Psums in one ReRAMcrossbar array. Take the first column of the ReRAM crossbararray in Fig. 6 (e) as an example. A total of B time inputsTi (i = 1,2, ...,B) are applied to their corresponding ReRAMbit cells with resistance values of R1i (i.e. weights) to gen-erate a Psum current (i.e. time-controlled current) based onKirchoff’s Law. Then, let us focus on Psums in one sub-Chip.The Psum currents at the same column of all NCB crossbarsin the vertical direction are aggregated in I-adder [35] (see¸ in Fig. 6 (a)), and then are converted into a voltage signalVo1 by charging a capacitor (e.g. Cc in Fig. 6 (e)). Note thatTIMELY accumulates NCB Psums by charging the capacitorwith the output current of I-adders. The current value is relatedto the weight and the charging time is related to the time input.Based on Charge Conservation, we can derive that the time-domain Psum To,8b/4b (see To1 in Fig. 6 (e)), where 8b/4brepresents 8-bit inputs and 4-bit weights, to be:

    To,8b/4b =CcVth

    VDD ∑BNCBi=1 Ti/R1i(2)

    where VDD is the logic high voltage of the time-domainsignals, and Vth is the threshold voltage of the comparators (seeº in Fig. 6 (e)). Fig. 6 (g) shows the input/output characteristicof dot-product and charging-and-comparison operations. Tohandle the inversely proportional output, TIMELY adds anadditional reciprocal function before sending the computedoutputs to adders or ReLU (if addition is not needed).

    To realize dot products with 8-bit weights and inputs, weemploy a sub-ranging design [20], [45], [79] in which 8-bitweights are mapped into two adjacent columns of bit-cellswith the top-4 most significant bits (MSBs) weights and theremaining 4 least significant bits (LSBs) weights, respectively.The charging capacitors associated with MSB-weight columnand LSB-weight column are Cc and Cc/2, respectively. To,8b/4bof the MSB-weight column of LSB-weight column are added

    after reciprocal operations to get the dot-product result for 8-bit weights.

    DTCs/TDCs. We adopt 8-bit DTCs/TDCs for TIMELYbased on the measurement-validated designs in [42], [49]. Theinput/output characteristics of a 8-bit DTC is shown in Fig.6 (f), where digital signals of “1111111” and “00000000”correspond to the time-domain analog signals with the min-imum and maximum delays, respectively, and the dynamicrange of the time-domain analog signals are 256×Tdel withTdel being the unit delay. Meanwhile, a TDC’s input/outputcharacteristics can also be viewed in Fig. 6 (g) by switchingV and t axes. In TIMELY, Tdel is designed to be 50 ps, leadingto a conversion time of 25 ns (including a design margin) forthe 8-bit DTC/TDC. In addition, to trade off energy efficiencyand computational density, one DTC/TDC is shared by γ (γ≥1) ReRAM crossbar rows/columns.D. TIMELY’s Only-Once Input Read Mapping Method

    O2IR follows three principles: (1) for reusing the inputsby different filters, we map these filters in parallel within thecrossbar arrays (see Fig. 7 (a)); (2) for reusing the inputs whensliding the filter vertically within an input feature map, weduplicate the filters with a shifted offset equal to Z× S (seeFig. 7 (b)), where Z and S are the filter height (= width) andthe stride value, respectively; and (3) for reusing inputs whensliding the filter horizontally within an input feature map, wetransfer inputs to the adjacent X-subBufs with an step equalto S (see Fig. 7 (c)). Single-direction input transfer betweenadjacent X-subBufs can be implemented by introducing onlyone switch and one control signal to one X-subBuf.

    E. Pipeline Design

    To enhance throughput, we adopt pipeline designs betweenand within sub-Chips, i.e., inter-sub-Chip and intra-sub-Chip

    Inpu

    t Bu

    fsD

    TCs

    ww

    q

    r

    p

    yy

    A

    B

    G

    H

    I

    J

    K

    L

    M

    A

    B

    G

    H

    I

    J

    K

    L

    M

    A

    B

    G

    H

    I

    J

    K

    L

    M

    Weightduplication

    X-s

    ubBu

    fs

    b

    c

    e

    f

    g

    i

    j

    k

    a

    b

    c

    e

    f

    g

    i

    j

    k

    a

    xx

    o

    p

    n

    uu

    c

    d

    f

    g

    h

    j

    k

    l

    b

    c

    d

    f

    g

    h

    j

    k

    l

    bInput transfer

    2nd pipeline-cycle

    Fetc

    h 12

    pix

    els

    On

    ly f

    etc

    h 4

    ne

    w p

    ixel

    s

    (a)

    (b)

    Filter1 & 2Inputs

    Psums

    *

    a b c d

    i j k l

    e f g h

    m n o p

    a b c d

    i j k l

    e f g h

    m n o p

    w x

    y u=

    A B G

    H I J

    K L M

    A B G

    H I J

    K L M

    Filter1'sweights

    Filter2'sweightsD

    TC

    sD

    TC

    s

    One ReRAM bit cell

    =

    Mapping for

    (c)

    1st pipeline-cycle

    Inpu

    t Bu

    fsIn

    put B

    ufs

    X-s

    ub

    Bu

    fs

    A

    B

    G

    H

    I

    J

    K

    L

    M

    A

    B

    G

    H

    I

    J

    K

    L

    M

    A

    B

    G

    H

    I

    J

    K

    L

    M

    A

    B

    G

    H

    I

    J

    K

    L

    M

    A

    B

    G

    H

    I

    J

    K

    L

    M

    A

    B

    G

    H

    I

    J

    K

    L

    M

    A

    B

    G

    H

    I

    J

    K

    L

    M

    A

    B

    G

    H

    I

    J

    K

    L

    M

    A

    B

    G

    H

    I

    J

    K

    L

    M

    ......

    Physical Logical

    Inp

    ut

    Bu

    fsD

    TCs

    Inp

    ut

    Bu

    fsD

    TCs

    Inpu

    t Bu

    fsD

    TCs

    Inpu

    t Bu

    fsD

    TCs

    Fig. 7. The proposed O2IR: (a) mapping filters using the same inputs into thesame rows of crossbars; (b) duplicating filters with a vertical offset of Z×Sbetween adjacent ReRAM columns; and (c) temporally shifting inputs by anamount equal to S.

    6

  • pipeline. Different sub-Chips work in a pipeline way. Notethat a layer by layer weight mapping strategy is adopted inTIMELY, where one CNN/DNN layer is mapped into onesub-Chip if the ReRAM crossbars’ size is larger than therequired size; otherwise, a layer is mapped into multiple sub-Chips. In one sub-Chip, the following operations – readinginputs from input buffers, DTCs, analog-domain computation(including dot-product, charging-and-comparison operations),TDCs, and writing back to output buffers – are pipelined. Thepipeline-cycle time is determined by the slowest stage. Letus take the operations within one sub-Chip as an exampleto illustrate the pipeline in TIMELY. Assuming the first datais read from input buffers at the first cycle, it spends threecycles to complete digital-to-time conversion, analog-domaincomputation, and time-to-digital conversion, and is writtenback to output buffers at the fifth cycle. Meanwhile, at thefifth cycle, the fifth, fourth, third, and second data is read,converted by a DTC, computed in the analog-domain, andthen converted by a TDC, respectively.

    F. Software-Hardware Interface

    A software-hardware interface is adopted to allow devel-opers to configure TIMELY for different CNNs/DNNs, en-abling programmability. Similar to the interface in PRIME,three stages are involved from software programming tohardware execution. First, the CNN/DNN is loaded into anNN parser [78] that automatically extracts model parameters.Second, with the extracted parameters, a compiler optimizesmapping strategies for increasing the utilization of ReRAMcrossbar arrays and then generates execution commands (in-cluding commands for weight mapping and input data pathconfiguration). Third, the controller (see Fig. 6 (a)) loads thecommands from the compiler to (1) write pre-trained weightsto the mapped addresses, and (2) configure peripheral circuitsfor setting up input paths of computation.

    V. DISCUSSIONAlthough local buffers have been adopted in digital accel-

    erators [9], [77], it is challenging when using local buffersin R2PIMs because: (1) improper design can largely com-promise R2PIMs’ high computational density and (2) morefrequent large-overhead A/D and D/A conversions may becaused. To the best of our knowledge, TIMELY is thefirst to implement and maximize analog data locality viaALBs, which have at least one order of magnitude loweraccess energy cost compared to the two level memoriesin PRIME [13]/ISAAC [56]/Pipelayer [61]. Additionally,TIMELY maximizes data locality without degrading R2PIMs’computational density. Although a recent R2PIM accelerator,CASCADE [14], has adopted analog buffers, it only uses ana-log ReRAM buffer to reduce the number of A/D conversions,thereby minimizing computation energy. TIMELY uses ALBsto minimize both computation energy and data movementenergy. Taking PRIME as an example, the computation energyonly accounts for 17% of the chip energy. In order to minimizecomputation energy, TIMELY not only reduces the number of

    A/D conversions by ALBs, but also decreases the energy ofeach A/D conversion by TDCs.

    Analog computations and local buffers are efficient, but theypotentially introduce accuracy loss to TIMELY. The accuracyloss is mainly attributed to the non-ideal characteristics ofanalog circuits. To address this challenge, TIMELY not onlyleverages algorithm resilience of CNNs/DNNs to counter hard-ware vulnerability [8], [46], [76], but also minimize potentialerrors introduced by hardware, thereby achieving the optimaltrade-off between energy efficiency and accuracy. First, wechoose time and current signals to minimize potential errors.Compared with analog voltage signals, analog current signalsand digitally implemented time signals can tolerate largererrors caused by their loads, and analog time signal is lesssensitive to noise and PVT variations [47]. Second, the adoptedALBs help improve the accuracy of time inputs and Psums byincreasing the driving ability of loads. However, the larger thenumber of ALBs, the smaller the number of ReRAM crossbararrays in a sub-Chip, compromising the computational density.Based on system-level evaluations, we adopt one X-subBufbetween each pair of neighboring ReRAM crossbar arrays andone P-subBuf between each ReRAM crossbar array and its I-adder in order to achieve a good trade-off between accuracyloss and computational density reduction. Third, we limit thenumber of cascaded X-subBufs in the horizontal direction toreduce the accumulated errors (including noise) of time inputs,which can be tolerated by a design margin for time inputs.We assign a design margin (i.e. more than 40 ps) for the unitdelay (i.e. 50 ps) of the DTC conversion. We do not cascadeP-subBufs to avoid introducing errors in Psum.

    TIMELY adopts pipeline designs to address the speedlimit of time signal operations and thus improve throughput.Adjusting the number of ReRAM rows/columns shared by oneDTC/TDC allows for the trading-off between the throughputand computational density of TIMELY. TIMELY compensatesfor the increased area due to the special shifted weightduplication of O2IR (see Fig. 7 (b) and (c)) by savingperipheral circuits’ area. Besides, thanks to ReRAM’s highdensity storage, TIMELY also replicates weights to improvecomputation parallelism and thus throughput, similar to priordesigns [13], [56], [61].

    VI. EVALUATIONIn this section, we first introduce the experimental setup, and

    then compare TIMELY with state-of-the-art designs in termsof energy efficiency, computational density, and throughput.After that, we demonstrate the effectiveness of TIMELY’s keyfeatures: ALB, TDI, and O2IR, and show that these featuresare generalizable. Finally, we discuss area scaling.

    A. Experiment Setup

    TIMELY Configuration. For a fair comparison withPRIME/ISAAC, we adopt PRIME/ISAAC’s parameters, in-cluding ReRAM and ReLU parameters from PRIME [13], andmaxpool operations (scaled up to 65nm) and HyperTransportlinks from ISAAC [56] (see Table II). For TIMELY’s specificcomponents, we use silicon-verified results [42], [49] for

    7

  • TABLE IITIMELY PARAMETERS.

    Component Params SpecEnergy

    ( f J)Area(µm2)

    /compo. /compo.

    TIMELY sub-Chip

    DTC resolution 8 bits 37.5 240number 16×32ReRAM size 256×256

    1792 100crossbar number 16×12bits/cell 4

    Charging+ number 12×256 41.7 40comparator

    TDC resolution 8 bits 145 310number 12×32X-subBuf number 12×16×256 0.62 5P-subBuf number 15×12×256 2.3 5I-adder number 12×256 36.8 40ReLU number 2 205 300

    MaxPool number 1 330 240Input buffer size/number 2KB/1 12736 50

    Output buffer size/number 2KB/1 31039 50Total 0.86 mm2

    TIMELY chip (40 MHz)sub-Chip number 106a 0.86 mm2

    Total 91a mm2

    Inter chips

    Hyper link links/freq 1/1.6GHz 1620 5.7 mm2link bw 6.4 GB/sa Scaling TIMELY to an area of 0.86χ mm2 by adjusting the number of sub-Chips

    (i.e., χ) based on applications.

    DTCs and TDCs, and adopt Cadence-simulated results for X-subBuf, P-subBuf, I-adder, charging circuit, and comparatorbased on [36], [38], [67] – including their drives and loadsduring simulation. Supporting digital units (reciprocal, shifter,and adder) consume negligibly small amounts of area andenergy. All the design parameters of the peripheral circuitsare based on a commercial 65nm CMOS process. The powersupply is 1.2 V, and the clock rate is 40 MHz. The reset phaseφ in Fig. 6 is 25 ns. The pipeline-cycle time is determinedby the latency of 8 (setting γ to 8) DTCs/TDCs, which havea larger latency than other pipelined operations. The latencyof reading corresponding inputs, analog-domain computations,and writing outputs back to output buffers are 16 ns [22], 150ns [22], and 160 ns [22], respectively. In addition, I-adders andits inputs do not contribute to the total area because we insertI-adders and the interconnection between each P-subBuf andI-adder under the charging capacitors and ReRAM crossbars,leveraging different IC layers. We adopt 106 sub-Chips in theexperiments for a fair comparison with the baselines (e.g.,TIMELY vs. ISAAC: 91mm2 vs. 88 mm2).

    Methodology. We first compare TIMELY with 4 state-of-the-art R2PIM accelerators (PRIME [13], ISAAC [56],PipeLayer [61], and AtomLayer [54]) in terms of peak en-ergy efficiency and computational density. For this set ofexperiments, the performance data of the baselines are theones reported in their corresponding papers. Second, as forthe evaluation regarding various benchmarks, we consideronly PRIME [13] and ISAAC [56] because (1) there is lackof design detail information to obtain results for PipeLayer[61] and AtomLayer [54], and (2) more importantly, suchcomparison is sufficient given that PRIME [13] is the mostcompetitive baseline in terms of energy efficiency (see Fig.1 (c)). For this set of evaluations, we build an in-house

    simulator to evaluate the energy and throughput of PRIME,ISAAC, and TIMELY. Before using our simulator, we validateit against PRIME’s simulator [13] and ISAAC’s analyticalcalculations [56]. We set up our simulator to mimic PRIMEand ISAAC and compare the results of our simulator withtheir original results. The resulting errors of energy andthroughput evaluation are 8% and zero, respectively, which areacceptable by TIMELY’s one order of magnitude improvementon energy efficiency (see Section VI-B). Due to the lack ofISAAC’s mapping information, we only validate our simu-lator against PRIME’s simulator to get the energy error byadopting PRIME’s component parameters and weight mappingstrategy [13] in our simulator. Since PRIME does not supportinter-layer pipeline, we only validate our simulator againstISAAC’s analytical calculations to get the throughput errorby using ISAAC’s component parameters and balanced inter-layer pipeline [13] in our simulator. The inter-layer pipelinecorresponds to TIMELY’s inter-sub-Chip pipeline.

    TABLE IIIADOPTED BENCHMARKS AND DATASETS.

    Benchmarks Why consider these CNN/DNN models

    VGG-Da, CNN-1b, MLP-Lb For a fair comparison with PRIME(i.e. benchmarks in [13])VGG-1/-2/-3/-4a For a fair comparison with ISAACMSRA-1/-2/3a (i.e. benchmarks in [56])

    ResNet-18/-50/-101/-152a To show TIMELY’s performanceSqueezeNeta in diverse and more recent CNNsa ImageNet ILSVRC dataset [15]; b MNIST dataset [16]

    Benchmarks. We evaluate TIMELY using a total of15 benchmarks (see Table III), which can be categorizedinto three groups: (1) CNN and DNN models adopted byPRIME [13] including VGG-D [60], CNN-1 [13], and MLP-L[13]; (2) CNNs used by ISAAC [56] including VGG models(VGG-1/-2/-3/-4) [60] and MSRA models (MSRA-1/-2/-3) [25]; and (3) more recent CNNs including a set of ResNets(ResNets-18/-50/-101/-152) [24] and SqueezeNet [28]. Weadopt these models for either a fair comparison with thebaselines or for evaluating TIMELY’s performance in morerecent CNNs. Together, these models represent various modelstructures and, thus, different computational and data move-ment patterns, enabling a thorough evaluation of TIMELY’sperformance as compared to state-of-the-art designs. Specifi-cally, VGG [60] is known for its large redundancy, ResNet [24]has a deeper structure which results in a large number ofparameters, SqueezeNet [28] is a compact model, and MLP-L[13] consists of only FC layers.

    B. Evaluation Results

    We first evaluate TIMELY’s peak energy efficiency andcomputational density against those reported in [13], [56],[61], and [54]. Next, we perform an evaluation of TIMELY’s

    energy efficiency and throughput on various CNN and DNNmodels. Finally, we present an energy breakdown analy-sis to validate the effectiveness of TIMELY’s innovations,and demonstrate that TIMELY’s innovative principles can begeneralized to state-of-the-art R2PIM accelerators to furtherimprove their performance.

    8

  • (b)

    PRIME

    Crossbars’ # in one

    chip

    ISAAC1024

    TIMELY1612820352

    736.6736.6

    736.61

    5.6

    22

    .3

    1.3

    5.0 22

    .2

    16.6

    26.1

    36

    .3

    38

    .7

    74

    .5

    14

    .0

    7.7

    4.0

    4.5

    3.9

    15

    .2

    9.5 19

    .8

    34

    .5

    37

    .4

    78

    .5

    8.3

    8.1

    5.3

    6.7

    1.4

    14.8

    1.6 2

    .4 2.8

    2.4

    2.4 2.1

    2.4 2.4

    2.1

    1.2 1.6 2

    .4 3.6 4

    .8

    3.2

    2.1 2.7No

    rmal

    ize

    d th

    rou

    ghp

    ut

    No

    rmal

    ize

    d e

    ne

    rgy

    eff

    icie

    ncy

    100

    102

    4.8

    2.4

    102

    10110.0

    aAdopting 8-bit inputs/outputs/weights when compared with PRIME which uses 6-bit inputs/outputs and 8-bit weights. Thus, the improvement over PRIME is slightly higher than results in (a) and (b) bAdopting 16-bit inputs/outputs/weights when compared with ISAAC which uses the same data precision.

    Geometric MeanVGG-D CNN-1 VGG-1 VGG-2 VGG-4 MSRA-1 MSRA-2 MSRA-3 ResNet-18 ResNet-50 ResNet-101 ResNet-152 SqueezeNet

    VGG-DNormalized

    to PrimeaVGG-1 VGG-2 VGG-3 VGG-4 MSRA-1 MSRA-2 MSRA-3 Geometric Mean

    Normalized to ISAACb

    MLP-L2

    6.9

    5.3

    (a)CNNs

    a b

    101

    100

    Fig. 8. (a) The normalized energy efficiency and (b) throughput of TIMELY over PRIME and ISAAC, respectively, considering various CNNs and DNNs.

    TABLE IVPEAK PERFORMANCE COMPARISON.

    Energy Improve- Computational Improve-efficiency ment of density ment of(TOPs/W ) TIMELY (TOPs/(s×mm2)) TIMELY

    PRIMEa [13] 2.10 +10.0× 1.23 +31.2×ISAACb [56] 0.38 +18.2× 0.48 +20.0×

    PipeLayerb [61] 0.14 +49.3× 1.49 +6.4×AtomLayerb [54] 0.68 +10.1× 0.48 +20.0×

    TIMELY a 21.00 n/a 38.33 n/aTIMELY b 6.90 n/a 9.58 n/a

    a one operation: 8-bit MAC; b one operation: 16-bit MAC

    Overall Peak Performance. Compared with representativeR2PIM accelerators (see Table IV), TIMELY can improveenergy efficiency by over 10× (over PRIME [13]) and thecomputational density by over 6.4× (over PipeLayer [61]).In particular, TIMELY improves energy efficiency by 10× to49.3× and computational density by 6.4× to 31.2×. Theselarge improvements result from TIMELY’s innovative featuresof ALB, TDI, O2IR and intra-sub-Chip pipelines, whichcan aggressively reduce energy cost of the dominant datamovements and increase the number of operations given thesame time and area. In Table IV, we ensure that TIMELY’sprecision is the same as that of the baselines for a faircomparison. Specifically, we consider a 8-bit TIMELY designwhen comparing with PRIME and a 16-bit TIMELY designwhen comparing to ISAAC, PipeLayer, and AtomLayer.

    Energy Efficiency on Various CNN and DNN models. Weevaluate TIMELY on various models (1 MLP and 13 CNNs)to validate that its superior performance is generalizable todifferent computational and data movement patterns. Fig. 8(a) shows the normalized energy efficiency of TIMELY overPRIME and ISAAC. We can see that TIMELY outperformsboth PRIME and ISAAC on all CNN and DNN models.Specifically, TIMELY is on average 10× and 14.8× moreenergy efficient than PRIME and ISAAC, respectively (see theGeometric Mean in the rightmost part of Fig. 8 (a)). This setof experimental results demonstrates that TIMELY’s superiorenergy efficiency is independent of CNNs and DNNs – i.e.computational and data movement patterns. In addition, asshown in Fig. 8 (a), the energy efficiency improvement ofTIMELY decreases in small or compact CNNs, such as CNN-1 [13] and SqueezeNet [28]. This is because their energy costs

    of data movements are relatively small. These models can bemapped into one ReRAM bank of PRIME or one ReRAM tileof ISAAC, and thus do not require high cost memory accessesand limit the energy savings achieved by TIMELY.

    Throughput on Various CNNs. Fig. 8 (b) showsTIMELY’s normalized throughput over PRIME and ISAACon various CNNs (a total of 8 CNNs) considering three chipconfigurations (16, 32, and 64 chips). As the throughput isa function of the weight duplication ratio, we only considerCNNs for which PRIME or ISAAC provides correspondingweight duplication ratios. Compared to PRIME, TIMELY en-hances the throughput by 736.6× for the 16-chip, 32-chip,and 64-chip configurations on VGG-D. TIMELY’s advan-tageous throughput results from its intra-sub-Chip pipeline,which enables to minimize the latency between two pipelinedoutputs. In addition, PRIME can work in both the memorymode and computation mode (i.e. accelerating CNN), limitingthe number of crossbars for CNN computations (and thus itsthroughput) on a chip which is over 20× smaller than thatof TIMELY (i.e. 1024/20352, see the right corner of Fig. 8(b)). Compared to ISAAC on 7 CNNs, TIMELY, on average,enhances the throughput by 2.1×, 2.4×, and 2.7× for the16-chip, 32-chip, and 64-chip configurations, respectively. InFig. 8 (b), we consider only 64-chip or (32-chip and 64-chip)for large CNNs, such as MSRA-1/-2/-3, to ensure that allthe models can be mapped into one TIMELY or ISAAC ac-celerator. TIMELY’s enhanced throughput is because ISAACadopts serial operations and requires 22 pipeline-cycles (eachbeing 100 ns) to finish one 16-bit MAC operation, for whichTIMELY employs intra-sub-Chip pipelines and needs twopipeline-cycles (each being 200 ns).

    Accuracy. We observe ≤ 0.1% inference accuracy lossunder various CNN and DNN models in system-level sim-ulations including circuit-level errors extracted from Cadencesimulation. The simulation methodology is adapted from priorwork [33], [34]. Specifically, we first obtain noise and PVTvariations (by Monte-Carlo simulations in Cadence) of X-subBuf, P-subBuf, I-adder, DTC, and TDC. The errors followGaussian noise distribution. We then add equivalent noiseduring training and use the trained weights for inference. Note

    9

  • TIMELY PRIME(c)

    Psum Output(d)

    1 2&1 2&

    93.0%

    13.5

    0.96

    Analog local Bufs

    Memory L1

    Memory L2

    Memory L3

    0

    0.005

    Input

    TIMELY

    PRIME

    95.8%

    87.1%

    99.9%

    0.01

    2

    3

    En

    erg

    y (m

    J)

    3

    (b)TIMELY PRIME0

    DT

    Cs

    &T

    DC

    s

    99.6%

    DA

    Cs

    &A

    DC

    s2.7

    0.01

    2

    3

    En

    erg

    y (m

    J)

    3

    (b)TIMELY PRIME0

    DT

    Cs

    &T

    DC

    s

    99.6%

    DA

    Cs

    &A

    DC

    s2.7

    (a)

    ALB (X-subBufs & P-subBufs)

    O2IR

    ALB (X-subBufs & P-subBufs)

    O2IR

    TDITDI

    Energy reduction of Contributors

    Psum access P-subBufs

    Input read X-subBufs & X-subBufs &

    Output write

    (e)

    99%

    1%

    Data locality

    DT/TD

    1 2&1 2&33

    99%

    1%

    Data locality

    DT/TD

    1 2&3

    99%

    1%

    2

    1

    3

    11

    22

    Fig. 9. The effectiveness of TIMELY’s innovations: (a) a breakdown of energy savings (over PRIME on VGG-D) achieved by different features – i.e. (¶ and·) vs. ¸ of TIMELY; (b) comparing the energy costs of the interfacing circuits in TIMELY and PRIME; energy breakdown with regard to both (c) memorytypes and (d) data types in both TIMELY and PRIME; and (e) the contributing factors for the energy savings per data type in TIMELY (see (d)).

    that prior work has proved that adding Gaussian noise totraining can reach negligible accuracy loss [51], [52], [55]. Toachieve ≤ 0.1% accuracy loss, we set the number of cascadedX-subBufs to 12. The accumulated error of the cascaded X-subBufs is

    √12ε [18], where ε is the potential error of one

    X-subBuf.√

    12ε is less than 20×28 ps, which can be toleratedby the design margin of 40×28 ps and thus do not cause aloss of inference accuracy.

    C. Effectiveness of TIMELY’s Innovations

    Effectiveness of TIMELY’s Innovations on Energy Sav-ings. We here present an energy breakdown analysis to demon-strate how TIMELY reduces the energy consumption on VGG-D as compared with PRIME, which is the most competitiveR2PIM accelerator in terms of energy efficiency. In Fig. 8 (a),we can see that TIMELY improves the energy efficiency by15.6 × as compared to PRIME.

    Overview. We first show the breakdown of energy savingsachieved by different features of TIMELY. TIMELY’s ALBand O2IR contribute to up to 99% of the energy savings, andits TDI leads to the remaining 1% (see Fig. 9 (a)).

    Effectiveness of TIMELY’s ALB and O2IR. We compareTIMELY’s energy breakdown with regard to both memorytypes and data types with those of PRIME in Fig. 9 (c) and (d),respectively. In Fig. 9 (c), TIMELY’s ALB and O2IR togetherreduce the energy consumption of memory accesses by 93%when compared with PRIME. Specifically, the ALB and O2IRfeatures enable TIMELY to fully exploit local buffers withinits sub-Chips for minimizing accesses to the L1 memory andremoving the need to access an L2 memory.

    In Fig. 9 (d), TIMELY reduces the energy consumptionassociated with the data movement of Psums, inputs andoutputs by 99.9%, 95.8%, and 87.1%, respectively. The con-tributing factors are summarized in Fig. 9 (e). Specifically,(1) TIMELY can handle most of the Psums locally via theP-subBufs within the sub-Chips, aggressively reducing theenergy cost of data movements of Psums; (2) TIMELY’s O2IRfeature ensures all the input data are fetched only once fromthe L1 memory while its ALB feature (i.e. X-subBufs here)allows the fetched inputs to be stored and transferred via X-subBufs between the crossbars; and (3) thanks to employed

    P-subBufs and X-subBufs, TIMELY removes the need for anL2 memory, which has 146.7×/6.9× higher read/write energythan that of an L1 memory, respectively, reducing the energycost of writing outputs back to the memory (L1 memory inTIMELY vs. L2 memory in PRIME). Furthermore, as anotherway to see the effectiveness of TIMELY’s O2IR feature, wesummarize both PRIME’s and TIMELY’s total number ofinput accesses to the L1 Memory in Table V (consider thefirst six CONV layers as examples). TIMELY requires about88.9% less L1 memory accesses.

    TABLE VTHE TOTAL NO. OF L1 MEMORY ACCESSES FOR READING INPUTS IN

    TIMELY AND PRIME [13] CONSIDERING VGG-D.

    CONV1 CONV2 CONV3 CONV4 CONV5 CONV6

    PRIME [13] 1.35 M 28.90 M 7.23 M 14.45 M 3.61 M 7.23 MTIMELY 0.15 M 3.21 M 0.80 M 1.61 M 0.40 M 0.80 MSave by 88.9% 88.9% 88.9% 88.9% 88.9% 88.9%

    Effectiveness of TIMELY’s TDI. Although DTCs and TDCsonly contribute to 1% of TIMELY’s energy savings overPRIME, the total energy of DTCs and TDCs in TIMELY is99.6% less than that of ADCs and DACs in PRIME (see Fig. 9(b)). It is because (1) the unit energy of one DTC/TDC isabout 30%/23% of that of DAC/ADC; (2) the increased analogdata locality due to ALBs largely reduces the need to activateDTCs and TDCs; and (3) TIMELY’s O2IR feature aggressivelyreduces the required DTC conversions thanks to its muchreduced input accesses to the L1 memory (see Table V).

    Effectiveness of TIMELY’s Innovations on Area Re-duction. We analyze an area breakdown to present the ef-fectiveness of TIMELY’s innovations on the area savings ofperipheral circuits, which helps to improve the computational

    DTC14.2%

    TDC13.8%

    P-subBuf26.7%

    X-subBuf28.5%

    Charging+ comp. 14.2%

    ReRAM 2.2%

    (b)(a)

    ≈0

    5.5x

    PRIME ISAAC TIMELYRe

    RA

    M a

    rea

    / ch

    ip a

    rea

    (%

    )

    ≈0

    5.5x

    PRIME ISAAC TIMELYRe

    RA

    M a

    rea

    / ch

    ip a

    rea

    (%

    )

    Fig. 10. (a) The percentage of ReRAM crossbar area in the area ofPRIME [13], ISAAC [56], and TIMELY and (b) the area breakdown ofTIMELY

    10

  • 0

    0.5

    1

    1.5

    En

    erg

    y (m

    J)

    PRIME+ALB+O2IRPRIME

    68.0%Mem subarray

    FF subarray (128 crossbars)

    Bank

    Bus

    (a)

    + ALB + O2IR

    (b)

    Ener

    gy (

    mJ)

    Fig. 11. Comparing the total energy cost of intra-bank data movementsbetween PRIME and PRIME with TIMELY’s ALB and O2IR being applied:(a) applying ALB and O2IR to PRIME architecture and (b) the resultingenergy reduction.

    density (see Table IV). In Fig. 10 (a), the percentage of theReRAM array area in TIMELY (i.e. 2.2%) is 5.5× higher thanthat in ISAAC (i.e. 0.4%) [56]. The percentage of the ReRAMarray area in PRIME is small enough and thus ignored [13].The higher percentage of the ReRAM crossbar array area inTIMELY benefits from area-efficient circuit implementationsof TIMELY’s ALB, TDI and O2IR. Specifically, in TIMELYshown in Fig. 10 (b), X-subBufs and P-subBufs occupy 55.2%of the chip area; DTCs and TDCs occupy 28% of the chip area;the area of CMOS logic introduced by O2IR is neglectable.

    Generalization of TIMELY’s Innovations. TIMELY’sinnovative features are generalizable and can be applied tostate-of-the-art R2PIM accelerators for boosting their energyefficiency. To demonstrate, we apply ALB and O2IR to PRIMEbased on the following considerations. ALB feature associatedwith O2IR contributes the dominant energy savings (see Fig. 9(a)). From the perspective of data accesses and interfaces,PRIME uses the same architecture shown in Fig 5 (a) asISAAC [56]/PipeLayer [61]. To evaluate, we modify PRIMEarchitecture as shown in Fig. 11 (a). In PRIME’s originaldesign [13], it adopts Mem subarray block as the main memoryand only operates CNN/DNN computations in FF subarray.Mem subarray and FF subarray composes one bank, andmultiple banks are connected by bus. We add X-subBufsand P-subBufs between 128 ReRAM crossbar arrays in FFsubarray of each bank, and modify the weights mappingand input access dataflow based on O2IR, while employingPRIME’s original designs outside FF subarray. Thus, ALBand O2IR only have an impact on the intra-bank energy. In thisexperiment, we adopt the same component parameters as thoseused in the PRIME’s original design. Fig. 11 (b) shows thatapplying ALB and O2IR principle to FF subarrays in PRIMEreduces the intra-bank data movement energy by 68%.

    D. Discussion

    Area scaling of TIMELY (by adjusting the number ofsub-Chips shown in Table II) does not affect throughputand slightly affects energy. This is because throughput isdetermined only by intra-sub-Chip pipeline (see Section IV-E);adjusting the number of sub-Chip in one chip will only changeinter-chip energy (i.e., the energy of memory L3 in Fig. 9 (c)),which accounts for a negligible part of the total energy.

    VII. RELATED WORKNon-PIM CNN/DNN Accelerators. Although memory is

    only used for data storage in non-PIM accelerators, computing

    units are being pushed closer to compact memories to re-duce energy and area. For accelerators with off-chip DRAM,DRAM accesses consume two orders of magnitude moreenergy than on-chip memory accesses (e.g. 130× higher thana 32-KB cache at 45 nm [26]). As a result, DRAM consumesmore than 95% of the total energy in DianNao [12], [17].To break through the off-chip bottleneck, on-chip SRAM iswidely used as the mainstream on-chip memory solution [23].However, SRAM’s low density has been limiting its on-die ca-pacity even with technology scaling. For example, EIE adopts10-MB SRAM that takes 93.2% of the total area [23]. Toaddress the area issue, on-chip eDRAM is used in RANA [64]and DaDianNao [10], as eDRAM can save about 74% areawhile providing 32 KB capacity in 65 nm [53], [64]. However,the refresh energy in eDRAM can be dominant (e.g. about10× as high as the data access’ energy [64]). In terms ofFPGA-based designs, the performance is also limited by thememory accesses [19], [21], [53], [57] with limited flexibilityof choosing memory technologies. Different from these non-PIM accelerators, TIMELY improves energy efficiency bycomputing in memory and enhances computational densitythrough adopting high-density ReRAM.

    PIM CNN/DNN Accelerators. While PIM acceleratorsintegrate computing units in memory to save the energy of ac-cessing weights, the achievable energy efficiency and compu-tational density remain limited. The limited energy efficiencyis induced by the energy cost of input and Psum movementsand the overhead of interfacing circuits. PRIME [13] takes81% of the total energy to access inputs and Psums, andISAAC [56] consumes 61% of the total energy to operateDACs/ADCs. The limited computational density is relatedto memory technologies. Processing in SRAM, for example,faces this limitation. The reasons include not only one SRAMbit-cell typically stores only 1-bit weight [5], [21], [37],[59], [70] but also SRAM’s bit-cell structure – e.g. 6T [21],[37]/8T [59], [70]/10T [5] structure – decreases density. Pro-posed TIMELY adopts high-density ReRAM, and addressestwo key energy challenges with techniques including ALBs,TDIs, and O2IR. TIMELY achieves up to 27.6× improvement(over ISAAC) in energy efficiency, 31.2× improvement (overPRIME) in computational density, and 736.6× in throughput(over PRIME). Similar to the effect of ALBs used in TIMELY,a recent R2PIM accelerator [14] also increases the amount ofdata in the analog domain for energy optimization. However, itonly optimizes the computation energy (including the energyof interfacing circuits).

    VIII. CONCLUSIONS

    In this paper, we analyze existing designs of R2PIM acceler-ators and identify three opportunities to greatly enhance theirenergy efficiency: analog data locality, time-domain interfac-ing, and input access reduction. These three opportunities in-spire three key features of TIMELY: (1) ALBs, (2) Interfacingwith TDCs/DTCs, and (3) an O2IR mapping method. TIMELYoutperforms state-of-the-art in both energy efficiency andcomputational density while maintaining a better throughput.

    11

  • REFERENCES

    [1] “NVIDIA Deep Learning Accelerator (NVDLA).” [Online]. Available:http://nvdla.org/primer.html

    [2] A. Amravati, S. B. Nasir, S. Thangadurai, I. Yoon, and A. Raychowd-hury, “A 55nm time-domain mixed-signal neuromorphic acceleratorwith stochastic synapses and embedded reinforcement learning forautonomous micro-robots,” in 2018 IEEE International Solid - StateCircuits Conference - (ISSCC), 2018.

    [3] A. Amravati, S. B. Nasir, S. Thangadurai, I. Yoon, and A. Raychowd-hury, “A 55nm time-domain mixed-signal neuromorphic acceleratorwith stochastic synapses and embedded reinforcement learning for au-tonomous micro-robots,” in 2018 IEEE International Solid-State CircuitsConference-(ISSCC), 2018.

    [4] M. Bavandpour, M. R. Mahmoodi, and D. B. Strukov, “Energy-efficienttime-domain vector-by-matrix multiplier for neurocomputing and be-yond,” IEEE Transactions on Circuits and Systems II: Express Briefs,2019.

    [5] A. Biswas and A. P. Chandrakasan, “Conv-ram: An energy-efficientsram with embedded convolution computation for low-power cnn-basedmachine learning applications,” in 2018 IEEE International Solid - StateCircuits Conference - (ISSCC), 2018.

    [6] I. Bratt, “Arm’s First-Generation Machine Learning Processor,” HotChips, 2018. [Online]. Available: https://www.hotchips.org/hc30/2conf/2.07 ARM ML Processor HC30 ARM 2018 08 17.pdf

    [7] N. Cao, M. Chang, and A. Raychowdhury, “14.1 a 65nm 1.1-to-9.1tops/w hybrid-digital-mixed-signal computing platform for accel-erating model-based and model-free swarm robotics,” in 2019 IEEEInternational Solid- State Circuits Conference - (ISSCC), 2019.

    [8] L. Chen, J. Li, Y. Chen, Q. Deng, J. Shen, X. Liang, and L. Jiang,“Accelerator-friendly neural-network training: Learning variations anddefects in rram crossbar,” in Design, Automation Test in Europe Con-ference Exhibition (DATE), 2017, March 2017.

    [9] Y. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks,” in 2016 ACM/IEEE43rd Annual International Symposium on Computer Architecture (ISCA),2016.

    [10] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen,Z. Xu, N. Sun, and O. Temam, “Dadiannao: A machine-learning super-computer,” in 2014 47th Annual IEEE/ACM International Symposiumon Microarchitecture, 2014.

    [11] Z. Chen and J. Gu, “19.7 a scalable pipelined time-domain dtw enginefor time-series classification using multibit time flip-flops with 140giga-cell-updates/s throughput,” in 2019 IEEE International Solid-State Cir-cuits Conference - (ISSCC), 2019.

    [12] S. N. e. a. Chen T, Du Z, “Diannao: A small-footprint high-throughputaccelerator for ubiquitous machine-learning,” ACM International Con-ference on Architectural Support for Programming Languages andOperating Systems, 2014.

    [13] P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie,“Prime: A novel processing-in-memory architecture for neural networkcomputation in reram-based main memory,” in 2016 ACM/IEEE 43rdAnnual International Symposium on Computer Architecture (ISCA),2016.

    [14] T. Chou, W. Tang, J. Botimer, and Z. Zhang, “Cascade: Connectingrrams to extend analog dataflow in an end-to-end in-memoryprocessing paradigm,” in Proceedings of the 52Nd Annual IEEE/ACMInternational Symposium on Microarchitecture, ser. MICRO ’52. NewYork, NY, USA: ACM, 2019, pp. 114–125. [Online]. Available:http://doi.acm.org/10.1145/3352460.3358328

    [15] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:A large-scale hierarchical image database,” in Computer Vision andPattern Recognition, 2009. CVPR 2009. IEEE Conference on. Ieee,2009.

    [16] L. Deng, “The mnist database of handwritten digit images for machinelearning research [best of the web],” IEEE Signal Processing Magazine,vol. 29, no. 6, 2012.

    [17] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen,and O. Temam, “Shidiannao: Shifting vision processing closer to thesensor,” in 2015 ACM/IEEE 42nd Annual International Symposium onComputer Architecture (ISCA), 2015.

    [18] P. Dudek, S. Szczepanski, and J. V. Hatfield, “A high-resolution cmostime-to-digital converter utilizing a vernier delay line,” IEEE Journal ofSolid-State Circuits, vol. 35, no. 2, pp. 240–247, Feb 2000.

    [19] J. Fowers, K. Ovtcharov, K. Strauss, E. S. Chung, and G. Stitt,“A high memory bandwidth fpga accelerator for sparse matrix-vectormultiplication,” in 2014 IEEE 22nd Annual International Symposiumon Field-Programmable Custom Computing Machines, 2014.

    [20] B. P. Ginsburg and A. P. Chandrakasan, “500-ms/s 5-bit adc in 65-nm cmos with split capacitor array dac,” IEEE Journal of Solid-StateCircuits, vol. 42, no. 4, April 2007.

    [21] S. K. Gonugondla, M. Kang, and N. Shanbhag, “A 42pj/decision3.12tops/w robust in-memory machine learning classifier with on-chiptraining,” in 2018 IEEE International Solid - State Circuits Conference- (ISSCC), Feb 2018.

    [22] A. Grossi, E. Vianello, M. M. Sabry, M. Barlas, L. Grenouillet,J. Coignus, E. Beigne, T. Wu, B. Q. Le, M. K. Wootters, C. Zam-belli, E. Nowak, and S. Mitra, “Resistive ram endurance: Array-levelcharacterization and correction techniques targeting deep learning ap-plications,” IEEE Transactions on Electron Devices, vol. PP, 2019.

    [23] M. H. e. a. Han S, Liu X, “Eie: efficient inference engine on compresseddeep neural network,” ACM International Symposium on ComputerArchitecture, 2016.

    [24] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in 2016 IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2016.

    [25] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers:Surpassing human-level performance on imagenet classification,” inProceedings of the 2015 IEEE International Conference on ComputerVision (ICCV), ser. ICCV ’15, 2015.

    [26] M. Horowitz, “1.1 computing’s energy problem (and what we can doabout it),” in 2014 IEEE International Solid-State Circuits ConferenceDigest of Technical Papers (ISSCC), 2014.

    [27] M. Hu, J. P. Strachan, Z. Li, E. M. Grafals, N. Davila, C. Graves,S. Lam, N. Ge, J. J. Yang, and R. S. Williams, “Dot-product enginefor neuromorphic computing: Programming 1t1m crossbar to acceleratematrix-vector multiplication,” in Proceedings of the 53rd annual designautomation conference (DAC), 2016.

    [28] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally,and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewerparameters and

  • ding of computation in sram,” in 2014 IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014,pp. 8326–8330.

    [35] A. Kaur and R. Pandey, “Current mode computational circuits foranalog signal processing,” International Journal of Advanced Researchin Electrical, Electronics and Instrumentation Engineering, vol. 3, no. 4,2014.

    [36] A. Kaur and R. Pandey, “Current mode computational circuits foranalog signal processing,” International Journal of Advanced Researchin Electrical, Electronics and Instrumentation Engineering, 2014.

    [37] W. Khwa, J. Chen, J. Li, X. Si, E. Yang, X. Sun, R. Liu, P. Chen, Q. Li,S. Yu, and M. Chang, “A 65nm 4kb algorithm-dependent computing-in-memory sram unit-macro with 2.3ns and 55.8tops/w fully parallelproduct-sum operation for binary dnn edge processors,” in 2018 IEEEInternational Solid - State Circuits Conference - (ISSCC), Feb 2018.

    [38] O. Krestinskaya, I. Fedorova, and A. P. James, “Memristor load cur-rent mirror circuit,” in 2015 International Conference on Advances inComputing, Communications and Informatics (ICACCI), 2015.

    [39] L. Kull, T. Toifl, M. Schmatz, P. A. Francese, C. Menolfi, M. Braendli,M. Kossel, T. Morf, T. M. Andersen, and Y. Leblebici, “A 3.1 mw 8b 1.2gs/s single-channel asynchronous sar adc with alternate comparators forenhanced speed in 32 nm digital soi cmos,” IEEE Journal of Solid-StateCircuits, 2013.

    [40] C. Lee, H. Lin, C. Lien, Y. Chih, and J. Chang, “A 1.4mb 40-nmembedded reram macro with 0.07um2 bit cell, 2.7ma/100mhz low-powerread and hybrid write verify for high endurance application,” in 2017IEEE Asian Solid-State Circuits Conference (A-SSCC), Nov 2017.

    [41] S. Levantino, G. Marzin, and C. Samori, “An adaptive pre-distortiontechnique to mitigate the dtc nonlinearity in digital plls,” IEEE Journalof Solid-State Circuits, 2014.

    [42] M. Li, C. Yang, and Y. Ueng, “A 5.28-gb/s ldpc decoder with time-domain signal processing for ieee 802.15.3c applications,” IEEE Journalof Solid-State Circuits, 2017.

    [43] S. Li, A. O. Glova, X. Hu, P. Gu, D. Niu, K. T. Malladi, H. Zheng,B. Brennan, and Y. Xie, “Scope: A stochastic computing engine fordram-based in-situ accelerator,” in Proceedings of the 51th AnnualIEEE/ACM International Symposium on Microarchitecture, ser. MICRO-51 ’18, 2018.

    [44] S. Li, D. Niu, K. T. Malladi, H. Zheng, B. Brennan, and Y. Xie, “Drisa:A dram-based reconfigurable in-situ accelerator,” in Proceedings of the50th Annual IEEE/ACM International Symposium on Microarchitecture,ser. MICRO-50 ’17, 2017.

    [45] C. Liu, S. Chang, G. Huang, and Y. Lin, “A 10-bit 50-ms/s sar adc with amonotonic capacitor switching procedure,” IEEE Journal of Solid-StateCircuits, vol. 45, no. 4, 2010.

    [46] C. Liu, M. Hu, J. P. Strachan, and H. Li, “Rescuing memristor-basedneuromorphic design with high defects,” in 2017 54th ACM/EDAC/IEEEDesign Automation Conference (DAC), June 2017.

    [47] M. Liu, L. R. Everson, and C. H. Kim, “A scalable time-based integrate-and-fire neuromorphic core with brain-inspired leak and local lateralinhibition capabilities,” in 2017 IEEE Custom Integrated Circuits Con-ference (CICC), 2017.

    [48] A. Mantyniemi, T. Rahkonen, and J. Kostamovaara, “A cmos time-to-digital converter (tdc) based on a cyclic time domain successive ap-proximation interpolation method,” IEEE Journal of Solid-State Circuits,2009.

    [49] D. Miyashita, R. Yamaki, K. Hashiyoshi, H. Kobayashi, S. Kousai,Y. Oowaki, and Y. Unekawa, “An ldpc decoder with time-domainanalog and digital mixed-signal processing,” IEEE Journal of Solid-StateCircuits, 2014.

    [50] D. Miyashita, R. Yamaki, K. Hashiyoshi, H. Kobayashi, S. Kousai,Y. Oowaki, and Y. Unekawa, “A 10.4 pj/b (32, 8) ldpc decoderwith time-domain analog and digital mixed-signal processing,” in 2013IEEE International Solid-State Circuits Conference Digest of TechnicalPapers, 2013, pp. 420–421.

    [51] A. S. Morcos, D. G. Barrett, N. C. Rabinowitz, and M. Botvinick, “Onthe importance of single directions for generalization,” arXiv preprintarXiv:1803.06959, 2018.

    [52] A. Polino, R. Pascanu, and D. Alistarh, “Model compression viadistillation and quantization,” arXiv preprint arXiv:1802.05668, 2018.

    [53] M. Poremba, S. Mittal, D. Li, J. S. Vetter, and Y. Xie, “Destiny: A toolfor modeling emerging 3d nvm and edram caches,” in 2015 Design,Automation Test in Europe Conference Exhibition (DATE), 2015.

    [54] X. Qiao, X. Cao, H. Yang, L. Song, and H. Li, “Atomlayer: A universalreram-based cnn accelerator with atomic layer computation,” in 201855th ACM/ESDA/IEEE Design Automation Conference (DAC), 2018.

    [55] S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry, “How does batchnormalization help optimization?” in Advances in Neural InformationProcessing Systems, 2018, pp. 2483–2493.

    [56] A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Stra-chan, M. Hu, R. S. Williams, and V. Srikumar, “Isaac: A convolutionalneural network accelerator with in-situ analog arithmetic in crossbars,”in 2016 ACM/IEEE 43rd Annual International Symposium on ComputerArchitecture (ISCA), 2016.

    [57] H. Sharma, J. Park, D. Mahajan, E. Amaro, J. K. Kim, C. Shao,A. Mishra, and H. Esmaeilzadeh, “From high-level deep neural modelsto fpgas,” in 2016 49th Annual IEEE/ACM International Symposium onMicroarchitecture (MICRO), 2016.

    [58] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton,and J. Dean, “Outrageously large neural networks: The sparsely-gatedmixture-of-experts layer,” arXiv preprint arXiv:1701.06538, 2017.

    [59] X. Si, J. Chen, Y. Tu, W. Huang, J. Wang, Y. Chiu, W. Wei, S. Wu,X. Sun, R. Liu, S. Yu, R. Liu, C. Hsieh, K. Tang, Q. Li, and M. Chang,“24.5 a twin-8t sram computation-in-memory macro for multiple-bitcnn-based machine learning,” in 2019 IEEE International Solid- StateCircuits Conference - (ISSCC).

    [60] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” Proc. ICLR, 2015.

    [61] L. Song, X. Qian, H. Li, and Y. Chen, “Pipelayer: A pipelined reram-based accelerator for deep learning,” in 2017 IEEE International Sym-posium on High Performance Computer Architecture (HPCA), 2017.

    [62] V. Sze, Y. Chen, T. Yang, and J. S. Emer, “Efficient processing of deepneural networks: A tutorial and survey,” Proceedings of the IEEE, vol.105, no. 12, 2017.

    [63] W. Tseng, J. Wu, and Y. Chu, “A cmos 8-bit 1.6-gs/s dac with digitalrandom return-to-zero,” IEEE Transactions on Circuits and Systems II:Express Briefs, vol. 58, no. 1, Jan 2011.

    [64] F. Tu, W. Wu, S. Yin, L. Liu, and S. Wei, “Rana: Towards efficientneural acceleration with refresh-optimized embedded dram,” in 2018ACM/IEEE 45th Annual International Symposium on Computer Archi-tecture (ISCA), 2018.

    [65] I. Valov, R. Waser, J. R Jameson, and M. N Kozicki, “Electrochemicalmetallization memories–fundamentals, applications, prospects,” Nan-otechnology, vol. 22, 2011.

    [66] R. Waser, R. Dittmann, G. Staikov, and K. Szot, “Redox–based resistiveswitching memories–nanoionic mechanisms, prospects, and challenges,”Advanced Materials, vol. 21, 2009.

    [67] L. Weitao, L. Fule, and W. Zhihua, “High-resolution and high-speedintegrated cmos ad converters for low-power applications,” Springer,2018.

    [68] H.-. P. Wong, H. Lee, S. Yu, Y. Chen, Y. Wu, P. Chen, B. Lee, F. T.Chen, and M. Tsai, “Metal–oxide rram,” Proceedings of the IEEE, vol.100, no. 6, 2012.

    [69] H. Yang, Y. Zhu, and J. Liu, “End-to-End Learning of Energy-Constrained Deep Neural Networks,” arXiv e-prints, 2018.

    [70] J. Yang, Y. Kong, Z. Wang, Y. Liu, B. Wang, S. Yin, and L. Shi,“24.4 sandwich-ram: An energy-efficient in-memory bwn architecturewith pulse-width modulation,” in 2019 IEEE International Solid- StateCircuits Conference - (ISSCC).

    [71] T.-J. Yang, Y.-H. Chen, and V. Sze, “Designing energy-efficient convo-lutional neural networks using energy-aware pruning,” arXiv preprint,2017.

    [72] P. Yao, H. Wu, B. Gao, S. B. Eryilmaz, X. Huang, W. Zhang, Q. Zhang,N. Deng, L. Shi, H.-S. P. Wong, and H. Qian, “Face classification usingelectronic synapses,” Nature communications, 2017.

    [73] P. Yao, H. Wu, B. Gao, J. Tang, Q. Zhang, W. Zhang, J. J. Yang, andH. Qian, “Fully hardware-implemented memristor convolutional neuralnetwork,” Nature, vol. 577, no. 7792, 2020.

    [74] S. Yu, “Neuro-inspired computing with emerging nonvolatile memory,”Proceedings of the IEEE, 2018.

    [75] C. Zhang, J. Gu, L. Gao, T. Ouyang, and B. Wang, “Time-domaincomputing circuits for addition and multiplication computation,” in 2017International Conference on Electron Devices and Solid-State Circuits(EDSSC), 2017.

    [76] J. Zhang, Z. Wang, and N. Verma, “In-memory computation of amachine-learning classifier in a standard 6t sram array,” IEEE Journalof Solid-State Circuits, vol. 52, no. 4, pp. 915–924, 2017.

    13

  • [77] S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, andY. Chen, “Cambricon-x: An accelerator for sparse neural networks,” in2016 49th Annual IEEE/ACM International Symposium on Microarchi-tecture (MICRO).

    [78] X. Zhang, J. Wang, C. Zhu, Y. Lin, J. Xiong, W.-m. Hwu, and D. Chen,“Dnnbuilder: an automated tool for building high-performance dnnhardware accelerators for fpgas,” in Proceedings of the InternationalConference on Computer-Aided Design. ACM, 2018, p. 56.

    [79] Y. Zhu, C. Chan, U. Chio, S. Sin, S. U, R. P. Martins, and F. Maloberti,“A 10-bit 100-ms/s reference-free sar adc in 90 nm cmos,” IEEE Journalof Solid-State Circuits, vol. 45, no. 6, 2010.

    14