TIE: Energy-efficient Tensor Train-based Inference Engine ...alchem.usc.edu/portal/static/download/tie.pdfTIE: Energy-efficient Tensor Train-based Inference Engine for Deep Neural

TIE: Energy-efficient Tensor Train-based Inference Enginefor Deep Neural Network

Chunhua Deng*

Rutgers [email protected]

Fangxuan Sun*

Nanjing [email protected]

Xuehai QianUniversity of Southern California

[email protected]

Jun LinNanjing University

[email protected]

Zhongfeng WangNanjing [email protected]

Bo YuanRutgers University

[email protected]

ABSTRACTIn the era of artificial intelligence (AI), deep neural networks (DNNs)have emerged as the most important and powerful AI technique.However, large DNN models are both storage and computation in-tensive, posing significant challenges for adopting DNNs in resource-constrained scenarios. Thus, model compression becomes a crucialtechnique to ensure wide deployment of DNNs.

This paper advances the state-of-the-art by considering tensortrain (TT) decomposition, an very promising but yet explored com-pression technique in architecture domain. The method featureswith the extremely high compression ratio. However, the challengeis that the inference on the TT-format DNN models inherently in-curs massive amount of redundant computations, causing significantenergy consumption. Thus, the straightforward application of TTdecomposition is not feasible.

To address this fundamental challenge, this paper develops acomputation-efficient inference scheme for TT-format DNN, whichenjoys two key merits: 1) it achieves theoretical limit of number ofmultiplications, thus eliminating all redundant computations; and2) the multi-stage processing scheme reduces the intensive memoryaccess to all tensor cores, bringing significant energy saving.

Based on the novel inference scheme, we develop TIE, a TT-format compressed DNN-targeted inference engine. TIE is highlyflexible, supporting different types of networks for different needs. A16-processing elements (PE) prototype is implemented using CMOS28nm technology. Operating on 1000MHz, the TIE accelerator con-sumes 1.74mm2 and 154.8mW. Compared with EIE, TIE achieves7.22× ∼ 10.66× better area efficiency and 3.03× ∼ 4.48× better en-ergy efficiency on different workloads, respectively. Compared withCIRCNN, TIE achieves 5.96× and 4.56× higher throughput andenergy efficiency, respectively. The results show that TIE exhibitssignificant advantages over state-of-the-art solutions.

*Both authors contributed equally to this research.

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected] ’19, June 22–26, 2019, Phoenix, AZ, USA© 2019 Association for Computing Machinery.ACM ISBN 978-1-4503-6669-4/19/06. . . $15.00https://doi.org/10.1145/3307650.3322258

CCS CONCEPTS• Computer systems organization → Embedded hardware.

KEYWORDSDeep learning, tensor-train decomposition, compression, accelera-tion

ACM Reference Format:Chunhua Deng, Fangxuan Sun, Xuehai Qian, Jun Lin, Zhongfeng Wang,and Bo Yuan. 2019. TIE: Energy-efficient Tensor Train-based InferenceEngine for Deep Neural Network . In The 46th Annual International Sympo-sium on Computer Architecture (ISCA ’19), June 22–26, 2019, Phoenix, AZ,USA. ACM, New York, NY, USA, 14 pages. https://doi.org/10.1145/3307650.3322258

1 INTRODUCTIONIn the emerging artificial intelligence (AI) era, deep neural networks(DNNs) [25] have become the most important and popular machinelearning (ML) technique. Thanks to their unique large-scale structurethat consists of thousands of neurons and millions of connections,DNNs are able to obtain very strong learning and representationcapability. Thus, DNNs are able to deliver and guarantee very hightask accuracy in many real-life intelligence-demanded applications[5, 15, 44, 56, 73].

However, the unprecedented high task performance of DNNs doesnot come for free, — executing these large-scale DNNs normally de-mands large storage space and powerful computing resources. Thisis because, different from many other ML approaches, the perfor-mance of DNNs are strongly related to their scale: both theoreticalanalysis and empirical experiments have suggested that deepeningor widening DNN models can bring better representation capabil-ity and higher classification accuracy [29, 79]. Motivated by thispromising property, today’s AI researchers usually tend to use andscale up large-scale DNN models to achieve the required high ac-curacy in various AI tasks. However, the consequence of the trendis that the state-of-the-art DNNs are all very storage-intensive andcomputation-intensive. In particular, it causes severe challenges onthe efficient implementation of DNNs in many resource-constrainedembedded and mobile systems.

In order to facilitate and promote the widespread deploymentof DNNs in boarder scope of application scenarios, both ML andhardware communities have conducted extensive investigations oncompressing DNNs with affordable accuracy loss. Specifically, dueto the well recognized and verified redundancy of DNN models [24,27, 28], different compression approaches, such as pruning [27],

https://doi.org/10.1145/3307650.3322258

https://doi.org/10.1145/3307650.3322258

https://doi.org/10.1145/3307650.3322258

ISCA ’19, June 22–26, 2019, Phoenix, AZ, USA C. Deng and F. Sun, et al.

clustering [24], low rank decomposition [46] and low bit-width [47]etc., have been proposed and adopted to remove the redundancyon structure, layer, weight or number precision of DNN models.Correspondingly, several compression-oriented DNN accelerators [6,26, 75] have also been customized for those compression approachesto achieve high hardware performance.

Among various DNN compression techniques, tensor-train (TT)decomposition [52] is unique due to its extremely high compres-sion ratios. For instance, experiments in [50] show that applying TTdecomposition to the fully-connected (FC) layers of VGG-16 [50]on ImageNet dataset [17] can bring record-breaking 50000× com-pression ratio, while other compression approaches typically onlyachieve much less compression on FC layers. Moreover, due tothe generality of TT decomposition, this approach can also be ap-plied to compressing convolutional (CONV) layers via decompos-ing the weight tensors of CONV layers [23]. To date, ML commu-nity [23, 50, 74, 77] has successfully verified the effectiveness of TTdecomposition and its variant (TT ring) [81] in several representativetypes of DNNs, such as convolutional neural networks (CNNs) andrecurrent neural networks (RNNs).

From the perspective of tensor theory, the impressive compres-sion capability of TT decomposition comes from its unique tensorfactorization scheme. As illustrated in Fig. 1, TT decompositioncan decompose a d-dimensional n1 × n2 × ... × nd tensor AAA to d3-dimensional rk−1 × nk × rk tensor cores, where rk is the pre-set rank value. Thanks to this special representation scheme, only∑dk=1 nkrk−1rk parameters need to be stored in the TT format while

conventionally∏d

k=1 nk parameters were required for an explicit rep-resentation. Since in practice rk is typically small, the compressionratio, as defined as

∏dk=1 nk/

∑dk=1 nkrk−1rk , can be very significant

and hence brings orders of magnitude reduction in storage cost.Due to the promising advantages of TT decomposition on model

compression, exploiting efficient DNN hardware architecture basedon TT decomposition (referred as TT-DNN) is very attractive. Con-sidering the exceedingly high compression ratios that TT decompo-sition can bring, such specialized architecture, if properly developed,has the potential to execute the state-of-the-art CNN and RNN mod-els using very small memory resource, thereby leading to highlyarea and energy efficient solutions for today’s resource-constrainedbut memory and compute intensive DNN accelerator design.

However, realizing a high-performance TT-DNN accelerator is farfrom trivial, and needs to overcome the severe challenge on the inef-ficient inference scheme based on a TT-format DNN model. Specif-ically, the current TT-format inference scheme contains a massiveamount of redundant computations, leading to significantly highercomputational cost in the inference phase. Moreover, those inherentredundant computations also incur intensive memory accesses, —the tensor cores need to be frequently accessed when calculatingeach element of output tensor, thereby causing high energy consump-tion. As a result, despite the high compression ratios, the inherentinefficiency of TT-format inference scheme directly impedes thepotential deployment of TT-DNN accelerator in energy-constrainedapplications.

To address the fundamental challenges, this paper exploits thehardware-friendly inference scheme for TT-DNN. By carefully re-viewing and investigating the root reason and mechanism of redun-dant computation of current TT-format inference scheme, we derivethe theoretical limit for minimum number of multiplications neededfor TT-format inference. Then, leveraging the same methodologyin the theoretical analysis, we develop a computation-efficient in-ference scheme for TT-DNN. The proposed TT-format inferencescheme has two benefits: 1) it is very compact. The required numberof multiplications of this scheme is identical to the theoretical limit,thus eliminating all the unnecessary redundant computations; and2) based on its multi-stage processing style, the computing engineonly needs to access one tensor core in each stage, thereby leadingto significant saving in memory access.

Based on the proposed inference scheme, we develop TIE, theTT-DNN Inference Engine, a novel specialized hardware architec-ture based on TT-DNN. TIE is designed to fully reap the benefitsof our proposed hardware-friendly inference scheme and achieveshigh computation efficiency as well as simple memory access. Also,TIE is highly flexible and can be adapted to various network types,values of ranks, number of tensor dimensions, and combinations offactorization factors, thereby making itself well suited for variousapplication scenarios and tasks.

We implement an prototype TIE design using CMOS 28nm tech-nology. With 16 processing elements (PEs) operating on 1000MHz,the TIE accelerator consumes 1.74mm2 and 154.8mW. Comparedto the state-of-the-art compressed DNN-oriented accelerators usingother compression methods, such as sparsification (EIE [26]) andstructured matrices (CIRCNN [18]), the TT decomposition-basedTIE exhibits significant advantages in hardware performance. Com-pared with EIE, TIE achieves 7.22× ∼ 10.66× better area efficiencyand 3.03× ∼ 4.48× better energy efficiency on different workloads,respectively. Compared with CIRCNN, TIE achieves 5.96× and4.56× higher throughput and energy efficiency, respectively.

2 TT-BASED DNN COMPRESSION2.1 TT Decomposition & TT FormatAs introduced in Section 1, TT decomposition is an efficient com-pression approach to reduce DNN model sizes. In general, thekey idea of TT decomposition is to decompose a large-size multi-dimensional tensor into a set of small-size 3-dimensional tensors.Specifically, for a d-dimensional n1 × n2 × ... × nd tensor AAA, afterTT decomposition AAA is stored in the TT format using d tensor coresGkGkGk ∈ Rrk−1×nk×rk , where k = 1, 2, . . . ,d, and each element in AAA

can be reconstructed as follows [50]:

A(j1, . . . , jd ) =G1G1G1[j1] ×G2G2G2[j2] × · · · ×GdGdGd [jd ], (1)

where GkGkGk [jk ] ∈ Rrk−1×rk is the jk -th slice of the k-th tensor core

GkGkGk with jk = 1, 2, . . . ,nk , and rk is the rank of a tensor core. Ac-cordingly, the number of parameters to represent AAA is reduced from∏d

k=1 nk to∑dk=1 nkrk−1rk . It is worth noting that, though the value

of rk can vary since TT-decomposition for arbitrary tensor is notunique, r0 and rd are always set as 1 to satisfy the boundary condi-tion.

In practical use, the value of rk is typically set as a small value,so that the parameter saving resulting from TT decomposition can

TIE: Energy-efficient Tensor Train-based Inference Enginefor Deep Neural Network ISCA ’19, June 22–26, 2019, Phoenix, AZ, USA

A(i1, i2, i3) G1[i1] G2[i2] G3[i3]

A(n1 X n2 X n3) G1(r0 X n1 X r1) G2(r1 X n2 X r2) G3(r2 X n3 X r3)

i1={1,2,3}i2={1,2,3,4}i3={1,2,3,4,5}

# of parameters:

n1 x n2 x n3 =60# of parameters of G1+G2+G3 = 32

d=3 3 4 5 1 3 2 2 4 2 2 5 1

Figure 1: Represent a matrix via using the TT decomposition of its reshaped tensor.

...

......

...

...

...

...

...

...

...

...

...

...

...

...

G1[i1, j1] Gk[ik, jk] Gk+1[ik+1, jk+1] Gd[id, jd]Y(i1, …, id) ∑ ...

G1 (1xm1xn1xr1)

Gk (rk-1xmkxnkxrk)

Gk+1 (rkxmk+1xnk+1xrk+1)

...

X(j1, …, jd)

Gd (rd-1xmdxndx1)

j1…,jd

W

y

xX

G

Y

Figure 2: TT-format inference scheme.

be very significant. Consequently, leveraging TT decomposition toperform efficient compression on DNN models is very attractivesince the fully-connected (FC) and convolutional (CONV) layersof DNNs are in the format of matrix and tensor, which can bedecomposed and represented in the TT format. As suggested in [50],in order to maintain high test accuracy, the TT decomposition istypically not directly applied to the 2D weight matrix or 4D weighttensor but to their reshaped format. For instance, as illustrated inFig. 1, in order to store a 5×12 weight matrix in the TT format,the weight matrix is first reshaped to a 3-dimensional tensor asd = 3, and then it is decomposed and stored in the three tensor cores(G1G1G1 ∼ G3G3G3).

2.2 TT-Format Inference & Training on DNNsIn general, when a DNN model is stored in the TT format, the corre-sponding inference and training schemes need to be re-formulatedsince the underlying representation for weight matrix and tensor ofFC and CONV layers have been changed.

Inference on TT-format FC layers. Conventionally, with weightmatrix W∈ RM×N , input vector x∈ RN and output vector y∈ RM ,the inference procedure on FC layer is y=Wx1. In the scenario ofrepresenting weight matrix in the TT format, such inference schemeis re-formulated as follows [52]:

Y(i1, . . . , id ) =∑j1, ..., jd

G1G1G1[i1, j1]G2G2G2[i2, j2], . . . ,GdGdGd [id , jd ]X(j1, . . . , jd ),(2)

1Bias are combined with W for simplicity.

where YYY ∈ Rm1×m2, ...,×md and XXX ∈ Rn1×n2, ...,×nd are the re-shaped y and x in the tensor format with M =

∏dk=1mk and

N =∏d

k=1 nk , respectively. For weight matrix W, it is first re-shaped into a d-dimensional tensor GGG, and then GGG is decomposedand represented in the TT format with d tensor cores GkGkGk , wherek = 1, 2, . . . ,d. Notice that here GkGkGk ∈ Rrk−1×mk×nk×rk is differentfrom the representation in Eqn. (1) in Section 2.1. This is because,as indicated in [50], this 4D representation of tensor cores is betterfor describing the matrix-vector multiplication in the TT format2.Accordingly, as illustrated in Fig. 2, GkGkGk can be viewed as a 2Dmk -by-nk array, where each element (GkGkGk [ik , jk ] in Eqn. (2)) of thisarray is an rk−1-by-rk matrix.

Inference on CONV layers in the TT format. Due to its gener-ality for arbitrary tensor, TT decomposition can also enable efficientinference on CONV layer that is affiliated with a weight tensor.Specifically, there are two methods to represent the conventional4D weight tensor of CONV layer in the TT format. The first oneis to directly apply TT decomposition to the 4D tensor and obtainthe corresponding tensor cores. However, as indicated in [23], suchmethod is not very efficient for CONV layer with small kernel sizes(e.g. 1×1 convolution). Therefore, another method [23] is to reshapethe 4D weight tensor to 2D matrix, and then use the same procedurefor inference on FC layer to perform inference on CONV layer. Asillustrated in Fig. 3, such transform is mathematically rigid since the2D convolution between the 3D input tensor and 4D weight tensor isequivalent to the matrix-by-matrix multiplication [33]. Consequently,

2We can still use 3D-based tensor cores to describe inference scheme. The drawback isthe complicated indexing. The 4D representation can be viewed as folding the original3D tensor.


both the inference on FC layers and CONV layers can be executedon the same TT-format inference engine.

H

W

Cin

ff

ff

f2C

H’W

’

H’W’

W’

H’

*Cin Cin

Cout

Cout

f2Cin

Cout

Cout

Reshape Reshape Reshape

Figure 3: Converting computation on CONV layer to matrixmultiplication. Here H ′ = H − f + 1 andW ′ =W − f + 1.

Training TT-format DNN models. In general, after the sizes oftensor cores GkGkGk have been determined, a DNN model in the TTformat can be either trained from the scratch or obtained from a pre-trained non-TT-format model. Specifically, the train-from-scratchstrategy assigns initial values for each tensor core and then performsbackward propagation scheme in [50] to update them. On the otherhand, if converting a non-TT-format trained model to the TT formatis needed, the standard TT decomposition in [52] is first applied tothe weight matrix/tensor of the FC/CONV layer of model to formthe initial values of tensor cores. Then backward propagation-basedfine-tuning process is performed to retain original high accuracy.

2.3 Compression & Accuracy PerformanceBased on the training and inference described in Section 2.2, theTT-format DNN models can be trained and tested. Table 1 - Table 3list the test accuracy and compression ratio (CR) of different typesof DNN models (convolutional neural network (CNN) and recur-rent neural network (RNN)) on different datasets. Here the CR ismeasured as the reduction of number of parameters of the model.Specifically, the experimental settings are as follows:

Table 1: FC-dominated CNN on ImageNet.

FC-dominated CNN [50](NIPS’16)

Accuracy(%)

CR for FClayers

CR for overallnetwork

VGG-16 (baseline) 69.1 1× 1×TT-VGG-16 67.8 30.9× 7.4×

Table 2: CONV-dominated CNN on CIFAR-10.

CONV-dominatedCNN [23] (NIPS’17)

Accuracy(%)

CR for CONVlayers

CR for overallnetwork

CNN (baseline) 90.7 1× 1×TT-CNN 89.3 3.3× 3.27×

3Notice that in this experiment TT decomposition significantly increases test accuracy.As indicated in [77], this is mainly because plain LSTM/GRU are not adequate to modelvery high-dimensional sequential data. Also notice that the TT-decomposed modelsachieves similar test accuracy with the state-of-the-art work [51] having 80.8% accuracy.

Table 3: RNN on Youtube Celebrities Face Data.

RNN [77] (ICML’17) Accuracy3

(%)CR for FC layers CR for overall

networkLSTM (baseline) 33.2 1× 1×

TT-LSTM 75.5 15283× 196×GRU (baseline) 34.2 1× 1×

TT-GRU 80.0 11683× 195×

FC-dominated CNN [50]: Two FC layers (FC6 and FC7) are inthe TT format, where d=6, m1 ∼ m6=4, r1 ∼ r5=4. For FC6 andFC7, n1 ∼ n6=[2,7,8,8,7,4] and [4,4,4,4,4,4], respectively.

CONV-dominated CNN [23]: The 2nd ∼ 6h CONV layers are inthe TT format, where d=4,m=[3,4,4,4] and [3,4,8,4] for the 2nd andthe 3rd ∼ 6th layers, respectively, n=[3,4,4,4] and [3,4,8,4] for the2nd ∼ 3rd and the 4th ∼ 6th layers, respectively, r1 ∼ r3=[22,20,20],[27,22,22], [23,23, 23] for the 2nd , the 3rd and the 4th ∼ 6th layers,respectively.

LSTM or GRU-based RNN [77]: All the input-to-hidden layersare in the TT format, where d=4,m=[4,4,4,4], n=[4,20, 20,36] andr2 ∼ r4=4.

From Table 1 - Table 3 we can see that TT decomposition enablessignificant reduction in the number of parameters of the decomposedlayers and the entire DNN model sizes. Meanwhile, it preserveshigh task accuracy on different datasets, making it very attractivefor practical deployment of DNN models. However, its inherentdrawback of low computational efficiency in the inference phase,which will be elaborated in Section 3.1, impedes its wide adoptionin practical systems.

3 EFFICIENT TT-FORMAT INFERENCE:CHALLENGE & SOLUTION

3.1 Challenge of TT-format InferenceAs described in Eqn. (2) of Section 2.2, the inference on the TT-format layers of DNN models can be performed via multi-dimensionalsummation of the products of slices of different tensor cores. Thisimplementation, though classical and straightforward, incurs severechallenge that leads to low computational efficiency.

In general, the low computational efficiency of TT-format in-ference scheme comes from its inherent redundant computations.Recall that in Eqn. (2), calculating a specific element of output ten-sor (Y(i1, . . . , id )) requires consecutive multiplications ofGkGkGk [ik , jk ]over all jk ’s. Since each Y(i1, . . . , id ) always has the partially sameindices ik with many other Y(i1, . . . , id )’s, calculating those indices-shared elements inherently contains multiple times of consecutivemultiplication among the sameGkGkGk [ik , jk ], thereby causing unnec-essary computational redundancy. Fig. 4 illustrates the existence ofsuch redundancy in the calculation of 3-dimensional tensor. We seethat, the calculation procedure of Y(0, 0, 0) and Y(1, 0, 0) have twoidentical matrix-vector multiplication stages out of all three stages.Notice that Fig. 4 only shows the computational redundancy forthese two specific output tensor elements. In general, such repli-cate multiplications in Eqn. (2) always exist for any pair of tensorelements that shares part of the indices.

To further quantify the computational redundancy, we perform ananalytic evaluation on the total number of multiplications consumedin Eqn. (2) and the minimum required number of multiplications


1 7 13

3 9 15

5 11 17

2 8 14

4 10 16

6 12 18

10-1

20-2

30-3

40-4

50-5

60-6

1 2 30 1 -4-1 -2 0

4 5 67 -1 0-2 0 0

7 8 9-2 -1 -30 2 0

2 1 -1 3 4 -3

-2 -1 -1 0 0 -1

10-1

110-1

1 2 30 1 -4-1 -2 0

10-1

-24-1

2 1 -1-24-1

1

10-1

110-1

1 2 30 1 -4-1 -2 0

10-1

-24-1

-2 -1 -1-24-1

G1[0, j1] G2[0, j2]G3[0, j3]Y(0,0,0) ∑ X(j1, j2, j3)j1, j2, j3

=When i1=0, i2=0, i3=0, j1=0, j2=0, j3=0

G1[0, j1] G2[0, j2]G3[0, j3]Y(0,0,0) ∑ X(j1, j2, j3)j1, j2, j3

=When i1=0, i2=0, i3=0, j1=0, j2=0, j3=0

G1[0, j1] G2[0, j2]G3[0, j3]Y(0,0,0) ∑ X(j1, j2, j3)j1, j2, j3

=When i1=0, i2=0, i3=0, j1=0, j2=0, j3=0

G1[1, j1] G2[0, j2]G3[0, j3]Y(1,0,0) ∑ X(j1, j2, j3)j1, j2, j3

=When i1=1, i2=0, i3=0, j1=0, j2=0, j3=0

G1[1, j1] G2[0, j2]G3[0, j3]Y(1,0,0) ∑ X(j1, j2, j3)j1, j2, j3

=When i1=1, i2=0, i3=0, j1=0, j2=0, j3=0

G1[1, j1] G2[0, j2]G3[0, j3]Y(1,0,0) ∑ X(j1, j2, j3)j1, j2, j3

=When i1=1, i2=0, i3=0, j1=0, j2=0, j3=0

G3G2 G1

X(:,:,:)

1

Stage-1 Stage-2 Stage-3

∑ j1, j2, j3

∑ j1, j2, j3

Redundant Computation

Figure 4: Exampled redundant computations in conventional TT-format inference scheme.

for calculating all Y(i1, . . . , id ), respectively. For simplicity, onlymultiplication is counted for computational cost.

Analysis on total number of multiplications in Eqn. (2):First we examine the total required number of multiplications con-sumed in Eqn. (2). As indicated in Fig. 4, computing one Y(i1, . . . , id )needs d stages of matrix-vector multiplication between a length-riand ri -by-ri−1 matrix. Therefore, total number of multiplications tocalculate all MY(i1, . . . , id )’s is:

MULnaive = MNd∑i=1

riri−1. (3)

Analysis on minimum number of multiplications forY(i1, . . . , id )Y(i1, . . . , id )Y(i1, . . . , id ):Next we analyze the minimum number of required multiplicationsfor calculating all Y(i1, . . . , id ) ∈ YYY. The general procedure is tofirst determine the computational cost for Y(i1, . . . , id−1, :)4 wheni1 ∼ id−1 are specific. Based on that, we then determine the numberof non-redundant multiplications for calculating Y(i1, . . . , id−2, :, :)when i1 ∼ id−2 are specific. Notice that here all the computationsinvolved with GdGdGd [id , jd ] is not considered since they have beenincluded before, thereby avoiding counting repeated computation.After that, we continue to perform the similar analysis from the(d − 2)-th dimension to the 1-st dimension of YYY, and finally we canobtain the minimum required number of multiplications for calcu-lating all Y(i1, . . . , id )’s as Y(:, . . . , :, :). In general, such recursivecounting method ensures that all the multiplications involved withthe calculation of all Y(i1, . . . , id )’s are included and meanwhilethose multiplications are not counted repeatedly.

Specifically, the detail of above analysis procedure is described asfollows. First let us consider the computational cost for Y(i1, . . . , id−1, :) (referred as stage-1). Recall that Eqn. (2) indicates calculating oneY(i1, . . . , id−1, id ) requires all X(j1, . . . , jd−1, jd )’s anddGkGkGk [ik , jk ]’swhere k = 1, 2 , . . . ,d. Also, notice that as illustrated in Fig. 4,X(j1, . . . , jd−1, jd ) only shares one common index jk with one slice

4‘:’ used in the i-th dimension of tensor denotes all elements in the i-th dimension ofthis tensor.

of tensor core GkGkGk [ik , jk ]. Based on these two observations, in or-der to facilitate the analysis on Y(i1, . . . , id−1, :) we further parti-tion the counting procedure into d steps, where in the k-th stepwe consider the involvement of X(j1, . . . , jd−k , :, :, . . . , :). Accord-ingly, when considering the number of multiplications involved withX(j1, . . . , jd−1, :) and GdGdGd [id , jd ] for calculating Y(i1, . . . , id−1, :),the computational cost is rd−1rdnd . Notice that inherently this costcorresponds to paralleling thed-th dimension of input X(j1, . . . , jd−1, jd )(as shown in stage-1 of Fig. 5). Interestingly, though such paral-leled input scheme does not reduce any computational redundancyinvolved with GdGdGd [id , jd ] (stage-1 in Fig. 5), it saves the compu-tation involved with Gd−1Gd−1Gd−1[id−1, jd−1] (stage-2 in Fig. 5). This isbecause now only one, instead of nd , length-r1 vector is multipliedwithGd−1Gd−1Gd−1[id−1, jd−1]. Besides, since all the computations involvedwithGdGdGd [id , jd ] has been considered before, the additional compu-tational cost for Y(i1, . . . , id−1, :) with specific jd−1 is rd−1rdnd +rd−2rd−1. Therefore, the number of multiplications of calculatingY(i1, . . . , id−1, :) with X(j1, . . . , jd−2, :, :) is (rd−1rd nd+rd−2rd−1)nd−1.By recursively applying this analysis, we can then derive the numberof multiplications for calculating Y(i1, . . . , id−1, :) with X(:, . . . , :, :)as:

MULY(i1, ...,id−1, :) =md

d∑i=1

(riri−1

i∏t=1

nt ). (4)

Next we consider the additional computational cost for calculatingY(i1, . . . , id−2, :, :) (referred as stage-2). Similar to the previous anal-ysis on recursive computation, when computing Y(i1, . . . , id−2, :, :),the computation involved with Y(i1, . . . , id−2, id−1, :) has alreadybeen considered and should not be re-counted again. Therefore,the additional number of multiplications for calculating Y(i1, . . . ,id−2, :, :) with X(:, . . . , :, :) is:

MULextra f or Y(i1, ...,id−2, :, :)

= (md−1md −md )

d−1∑i=1

(riri−1

i∏t=1

nt ).(5)


1 7 13

3 9 15

5 11 17

2 8 14

4 10 16

6 12 18

10-1

20-2

30-3

40-4

50-5

60-6

Y(0,0,0) =

X(:,:,:)

10-1

20-2

30-3

1

7

13

540

-54

G1[0, j1] G2[0, j2] G3[0, :]Y(0,0,0) ∑ X(j1, j2, j3)j1, j2

= ∑ j3

1 2 30 1 -4-1 -2 0

4 5 67 -1 0-2 0 0

7 8 9-2 -1 -30 2 0

G2

1 2 30 1 -4-1 -2 0

540

-54

-108216-54

2 1 -1 3 4 -3

-2 -1 -1 0 0 -1

G1

2 1 -1-108216-54

54

…...

2778

Y(0,0,0)

G1[0, j1] G2[0, j2] G3[0, :]Y(0,0,0) ∑ X(j1, j2, j3)j1, j2

= ∑ j3

G1[0, j1] G2[0, j2] G3[0, :]Y(0,0,0) ∑ X(j1, j2, j3)j1, j2

= ∑ j3

G1[0, j1] G2[0, j2] G3[0, :]∑ X(j1, j2, j3)j1, j2

When i1=0, i2=0, i3=0, j1=0, j2=0, j3=0,1,2

∑ j3

54

330

-234

780

2016

-168

G3

10-1

20-2

30-3

1

13

7

10-1

20-2

30-3

540

-54

Equivalent

Stage-1 Stage-2 Stage-3

When i1=0, i2=0, i3=0, j1=0, j2=0, j3=0,1,2 When i1=0, i2=0, i3=0, j1=0, j2=0, j3=0,1,2 When i1=0, i2=0, i3=0, j1=0, j2=0, j3=0,1,2

Figure 5: Partially paralleling the inputs in Stage-3. Redundant computations are partially reduced.

By generalizing Eqn. 5, we can derive the additional number ofmultiplications for calculating Y(i1, . . . , il , :, . . . , :) with X(:, . . . , :, :) in stage-l as:

MULextra f or Y(i1, ...,il , :, ..., :)

= (

d∏j=l

mj −

d∏j=l+1

mj )

l∑i=1

(riri−1

i∏t=1

nt )

= (ml − 1)d∏

j=l+1mj

l∑i=1

(riri−1

i∏t=1

nt )

. (6)

Consequently, because YYY has d dimensions, the total minimumnumber of multiplications for calculating Y(i1, . . . , il , :, . . . , :) in alld stages is:

MULextra f or Y(:, ..., :, :)= MULY(i1, ...,id−1, :) +MULextra f or Y(i1, ...,id−2, :, :)+ · · · +MULextra f or Y(i1, ..., :, :)

=

d∑l=1

((ml − 1)d∏

j=l+1mj

l∑i=1

(riri−1

i∏t=1

nt ))

. (7)

Eqn. 7 gives the analytical result of minimum number of mul-tiplications for performing TT-format inference. Comparing thistheoretical limit with Eqn. 3, we can find that the conventional TT-format scheme contains significant computational redundancy. Forinstance, for the FC-6 layer in VGG-16 with d=6 and ri=4, the num-ber of multiplications consumed in Eqn. 3 is 1073 times than that inEqn. 7. Obviously, such huge redundancy in multiplication causedvery low computational efficiency.

3.2 Compact TT-format Inference SchemeTo address the challenge of low computational efficiency, we proposea novel computation-efficient TT-format inference scheme. Withcarefully designed computation procedure, our approach, namelycompact inference scheme, calculates all the elements of outputtensor Y in parallel without any redundant computations, therebysignificantly improving computational efficiency over the conven-tional TT-format inference scheme.

In general, the design of this compact inference scheme is in-spired by the theoretical analysis on the minimum required numberof computations in Section 3. Recall that in the previous analysis

the minimum number of multiplications is counted based on the as-sumption that all the computations involved withGkGkGk [ik , jk ] are notincluded for the future computations involved withGk−1Gk−1Gk−1[ik−1, jk−1].To achieve this, in our design we perform the computation on dif-ferent GkGkGk ’s one by one. In other words, different from Eqn. 2 thatcalculates one output tensor element using d GkGkGk [ik , jk ]’s wherek = 1, 2, . . . ,d, our scheme performs computation using all mknkGkGkGk [ik , jk ]’s with only one specific k , and then will never use them inthe future for other k’s. Consequently, such computing arrangementbreaks the original data dependency and eliminates all the potentialcomputational redundancy.

Algorithm 1: Compact TT-format Inference Scheme.

Input :X, G1G1G1, . . . , GdGdGd ,m = [m1, . . . ,md ],n =[n1, . . . ,nd ], r = [r0, r1, . . . , rd ]

Output :Y

1 X’=Reshape(X, [nd ,−1])2 V’d+1d+1d+1=X’3 for h = d to 1 do4 Vhhh = MatMul(GhGhGh ,V’h+1h+1h+1)

5 V’hhh = Transform(Vh ,hh ,hh ,h)

6 Function Transform(VVV, h)7 V’ = Transpose(V)8 V’ = Rehape(V’, [nh−1,−1])9 // split

10 t’ = new [nh−1, rh−1]

11 T ′T ′T ′ = new [∏h−2

k=1 nk∏d

k=hmk ] t’12 for j = 1 to [

∏h−2k=1 nk

∏dk=hmk ] do

13 T ′T ′T ′[j] = V’[:, (j − 1) ∗ rh−1 : j ∗ rh−1]14 T ′T ′T ′[j] = Reshape(T ′T ′T ′[j], [nh−1 ∗ rh−1])

15 // assemble16 V’ = new [nh−1 ∗ rh−1,

∏h−2k=1 nk

∏dk=hmk ]

17 for j = 1 to [∏h−2

k=1 nk∏d

k=hmk ] do18 V’[:, j] =T ′T ′T ′[j]

19 Return V’

Fig. 6 illustrates the key idea of the proposed compact infer-ence scheme. Recall that as analyzed in Section 3.1 and shown in


X(:,:,:)

G3

10-1

20-2

30-3

40-4

50-5

60-6

1 2 3

7 8 9

13 14 15

4 5 6

10 11 12

16 17 18117

0-117

540

-54

1320

-132

600

-60

1470

-147

660

-66

1620

-162

720

-72

1770

-177

780

-78

1920

-192

840

-84

1 2 30 1 -4-1 -2 0

4 5 67 -1 0-2 0 0

7 8 9-2 -1 -30 2 0

G2

1 2 30 1 -4-1 -2 0

1170

-117

540

-54

234468-117

-108216-54

2 1 -1 3 4 -3

-2 -1 -1 0 0 -1

G1

234468-117

-108216-54

2 1 -1 11754 …...

54 117

780 1716

330 735

2016 4536

-234 -531

-168 -384

2778

Y(0,0,:)

Y(0,0,:) = G1[0, j1] G2[0, j2] G3[:, :]∑ X(j1, j2, j3)j1, j2

When j1=0, j2=0, j3=0,1,2, i1=0, i2=0, i3=0,1

∑ j3 G1[0, j1] G2[0, j2] G3[:, j3]Y(0,0,:) ∑ X(j1, j2, j3)

j1, j2

=

When j1=0, j2=0, j3=0,1,2, i1=0, i2=0, i3=0,1

∑ j3

G1[0, j1] G2[0, j2] G3[:, j3]Y(0,0,:) ∑ X(j1, j2, j3)j1, j2

=

When j1=0, j2=0, j3=0,1,2, i1=0, i2=0, i3=0,1

∑ j3

G1[0, j1] G2[0, j2] G3[:, j3]Y(0,0,:) ∑ X(j1, j2, j3)j1, j2

=

When j1=0, j2=0, j3=0,1,2, i1=0, i2=0, i3=0,1

∑ j3

6189X’

Transform (Eqn.(8))

= G1[0, j1] G2[0, j2]G3∑ X’j1, j2

= G1[0, j1] G2[0, j2]G3∑ X’j1, j2

= G1[0, j1] G2[0, j2]G3∑ X’j1, j2

= G1[0, j1] G2[0, j2]G3∑ X’j1, j2

Stage-1Stage-2 Stage-3

1

2

3

7

8

9

14

15

4

5

6

10

11

12

16

17

13

18

~ ~ ~ ~

~ V3

= G1[0, j1] G2[0, j2]V3∑ j1, j2

(a) Fully paralleling input to eliminate redundant computations in Stage-1. Stage-2 and Stage-3 still contain redundant computations.

1170

-117

540

-54

1320

-132

600

-60

1470

-147

660

-66

1620

-162

720

-72

1770

-177

780

-78

1920

-192

840

-84

Transpose

54 0 -54

60 0 -60

117 0 -117

132 0 -132

66 0 -66

72 0 -72

147 0 -147

162 0 -162

78 0 -78

84 0 -84

177 0 -177

192 0 -192

Reshape 54 0 -54 60 0 -60117 0 -117 132 0 -132

66 0 -66 72 0 -72147 0 -147 162 0 -162

78 0 -78 84 0 -84177 0 -177 192 0 -192

Split

54 0 -54 60 0 -60117 0 -117 132 0 -132

66 0 -66 72 0 -72147 0 -147 162 0 -162

78 0 -78 84 0 -84177 0 -177 192 0 -192

Reshape + Assemble

1170

-117

540

-54

1320

-132

600

-60

1470

-147

660

-66

1620

-162

720

-72

1770

-177

780

-78

1920

-192

840

-84

Split

1 2 30 1 -4-1 -2 0

4 5 67 -1 0-2 0 0

7 8 9-2 -1 -30 2 0

G2

-9721854-456

-432828-204

-8821674-411

-396756-186

Transpose-396 756 -186

-882 1674 -411

-432 828 -204

-972 1854 -456

-396 756 -186 -882 1674 -411

-432 828 -204 -972 1854 -456

Reshape Split -396 756 -186 -882 1674 -411

-432 828 -204 -972 1854 -456

-9721854-456

-432828-204

-8821674-411

-396756-186

Reshape + Assemble

2 1 -1 3 4 -3

-2 -1 -1 0 0 -1

G1

2778 6189

426 957

-8821674-411

-396756-186

2 1 -1

-9721854-456

-432828-204

3 4 -3

-2 -1 -1-8821674-411

-396756-186

-9721854-456

-432828-204

0 0 -1

150 321

2628 5868

222 501

204 456

2778 6189

426 957

G1[0, j1] G2[0, j2] G3Y(0,:,:) ∑ X‘j1, j2

=

When j1=0, j2=0, j3=0,1,2, i1=0, i2=0,1, i3=0,1

G1[1, j1]V2Y(1,:,:) ∑ j1, j2

=When j1=0,1, j2=0,1,2, j3=0,1,2, i1=1, i2=0,1, i3=0,1

Equivalent

V3‘

V2

Transform (Eqn.(10))

V3


V2‘

Y(0,:,:)

Y(:,:,:)

G1[0, j1] G2[0, j2] V3∑ j1, j2

=

G1[0, j1] G2 V3∑ j1

= ‘

G1[:, j1] V2Y(:,:,:) ∑ j1, j2

=When j1=0,1, j2=0,1,2, j3=0,1,2, i1=0,1, i2=0,1, i3=0,1

Y(:,:,:)

When j1=0,1, j2=0, j3=0,1,2, i1=0,1, i2=0,1, i3=0,1

G1[:, j1] V2∑ j1, j2

= G1V2= ‘

Stage-2 Stage-3

G1[0, j1]V2Y(0,:,:) ∑ j1, j2

=When j1=0,1, j2=0,1,2, j3=0,1,2, i1=0, i2=0,1, i3=0,1

Y(1,:,:)

Y(:,:,:)

Y(1,:,:)

Y(0,:,:)

~

~

~

~

~

(b) Completely compact computations in Stage-2 and Stage-3. Stage -1 (ignored here) is identical to the case in Fig.6 (a).

Figure 6: Example of compact TT-format inference scheme.

Fig. 5, partial paralleling input multiple X(j1, . . . , jd−1, jd )’sas X(j1, . . . , jd−1, :) reduces redundant computations involved withGd−1Gd−1Gd−1[id−1, jd−1]. To be consistent with this, we further fully parallelthe computation involved with all the input X(j1, . . . , jd−1, jd )’s (asX(:, . . . , :)) and all the slices of GdGdGd (see Fig. 6(a)). As revealed inFig. 6, such parallel computation corresponds to a compact matrix-format multiplication that replaces the original summations in Eqn. 2over index jd . Different from Fig. 5, the computations in stage-1of Fig. 6(a) are now for every X(j1, . . . , jd−1, jd ) and every sliceof of GdGdGd ; therefore the input tensor XXX ∈ Rn1×n2, ...,×nd needs tobe properly transformed to a new matrix format to guarantee thefunctional validity and suited for matrix multiplication. In general,such transform converts a tensor XXX ∈ Rn1×n2, ...,×nd to a matrixX’ ∈ Rnd×

∏d−1i=1 ni , and the mapping principle for this transform is

as follows:

X(j1, . . . , jd−1, jd ) → X′(p,q), (8)

where p = jd and q =∑d−1l=1 jl

∏l−1i=1 ni . Then, a compact matrix-

format multiplication can be performed as:

VVVddd = GdGdGdX′X′X′, (9)

where Vddd is the intermediate matrix to be sent to stage-2, and GdGdGd isthe matrix format of unfolded GdGdGd (see Fig. 6(a)).

It should be noted that Fig. 6(a) only performs compact matrix-format computation in stage-1; while in stage-2 and stage-3G2G2G2[i2, j2]andG1G1G1[i1, j1] are still fetched for processing in serial, thereby caus-ing redundant computations. As analyzed in Section 3.1, avoidingthe redundant computations in each computing stage demands theinvolvement of all the output values from previous stages (e.g. V333)and the matrix format of the unfolded tensor core (e.g. G2G2G2). How-ever, as illustrated in Fig. 6, G2G2G2 and V333, or G1G1G1 and V222, cannot besimply multiplied because 1) their sizes do not fit for direct ma-trix multiplications; and 2) the elements of Vhhh , as the intermediatevalues from stages-(d − h + 1), and Gh−1Gh−1Gh−1, as the matrix format ofunfolded tensor core in the stage-(d − h + 2), are not in the correct


positions to produce correct results even if they can be multiplied.To address this problem, we propose to transform Vhhh before itis multiplied with Gh−1Gh−1Gh−1. In general, in the stage-(d − h + 1) such

transform is to convert Vhhh ∈ R(mh×rh−1)×(∏h−1k=1 nk×

∏d−hk=1 md−k−1)

to V’hhh ∈ R(nh−1×rh−1)×(∏h−2k=1 nk×

∏d−h+1k=1 md−k−1), and the mapping

principal for this transform is as follows:

Vh (p,q) → V′h (p

′,q′), (10)

where p = ihrh−1 + th−1, p′ = jh−1rh−1 + th−1,q = (

∑h−1l=1 jl

∏l−1i=1 ni )

∏d−hk=1 md−k+1 +∑d−h

д=2 (∏d−h

k=д md−k+1)id−д+2,

q′ = (∑h−2l=1 jl

∏l−1i=1 ni )

∏d−h+1k=1 md−k+1 +∑d−h+1

д=2 (∏d−h+1

k=д md−k+1)id−д+2. After performing this transform,a compact matrix-format computation for stage-(d − h + 2) is as:

VVVh−1h−1h−1 = Gh−1Gh−1Gh−1V′hV′hV′h, (11)

where Vh−1h−1h−1 will then be sent to stage-(d − h + 3) and being trans-formed again.

In overall, as illustrated in Fig. 6(b), the entire compact TT-formatinference scheme contains d computation stages, where each stageperforms transform and multiply with the output Vhhh from previousstages. In each stage, matrix multiplication between transformedinput V’hhh and corresponding Gh−1Gh−1Gh−1. By using this scheme, all theelements of final output tensor YYY can be obtained simultaneouslyat the output end of stage-1 without any redundant computations.Notice that for practical implementation, the transformation de-scribed in Eqn.(2) can be equivalently achieved by performing 4-stepmatrix-wise multiplications (see Transform in Fig. 6 (b)). Puttingall together, the proposed compact TT-format inference scheme isdescribed in Algorithm 1.

Notice that at each stage of compact TT-format inference scheme,the intermediate values should be buffered on-chip for the processingof next stage. Needed extra storage capacity to store the intermedi-ate values from state-(d − h + 1) is max(rh−1

∏k−1k=1 nk )

∏dk=hmk ,

where h = 1 . . .d. Consider that both the input and output ofeach stage should be stored. The overall storage overhead is 2 ×max(rh−1

∏k−1k=1 nk )

∏dk=hmk , where h = 1 . . .d. However, con-

sider that the activation size is much less than weights size in typi-cal, the storage overhead brought by compact TT-format inferencescheme can be accepted.

4 HARDWARE ARCHITECTUREBased on the proposed efficient TT-format inference scheme in Sec-tion 3.2, we develop TIE, the corresponding hardware architectureof tensor train-based inference engine.

4.1 Data Mapping and Processing SchemeFig. 6 shows the overall computing flow of the proposed TIE. Ingeneral, it contains two types of operation: reshaping the inputs (xto X’ and Vhhh to V’hhh ), and multiplying V’hhh and Gh−1Gh−1Gh−1. Consideringreshaping x to X’ can be prepared offline and reshaping Vhhh to V’hhhcan be performed by carefully designed memory access schemein Section 4.4, the datapath of TIE is mainly responsible for ex-ecuting matrix multiplication. Fig. 10 illustrates the detailed datamapping and processing scheme of an example 2-PE datapath for

the multiplication between 3 × 2 Gh−1Gh−1Gh−1 and 2 × 4 V’hhh matrices. Here,each PE is equipped with 3 multiply-accumulate (MAC) units. Ineach clock cycle, one column of Gh−1Gh−1Gh−1 is broadcast to all the PEs,where each multiplier of PE receives one element of the column.Meanwhile, two elements in the same row of V’hhh are sent to twoPEs, respectively; and each of these elements in V’hhh is broadcastto all the multipliers of their corresponding PEs. After finishing thecomputation in the current cycle, in the next cycle PEs will move onto process the next column of Gh−1Gh−1Gh−1 and next row of V’hhh . In general,with NPE PEs equipped with NMAC MAC units, the proposed pro-cessing scheme can produce a NMAC × NPE -size sub-block of theresult matrix Vh−1h−1h−1 =Gh−1Gh−1Gh−1V’hhh in NGcol cycles, where NGcol is thenumber of columns of Gh−1Gh−1Gh−1. Notice that when the number of rowsof Gh−1Gh−1Gh−1 (as NGrow ) is larger than NMAC or number of columns ofV’hhh (as NVcol ) is larger than NPE , it takes PEs multiple NGcol cyclesto calculate the entire Vh−1h−1h−1 (see Fig. 7).

4.2 Overall ArchitectureBased on the data mapping and processing scheme described above,the overall architecture of TIE is shown in Fig. 8. For the infer-ence task on one layer, the datapath of TIE performs d-stage matrixmultiplications between matrix V’hhh read from the working SRAMand Gh−1Gh−1Gh−1 read from the weight SRAM. During the computation ineach stage, as indicated in Section 4.1, part of the results matrixVh−1h−1h−1 =Gh−1Gh−1Gh−1V’hhh are already calculated in the PEs, these sub-blockof Vh−1h−1h−1, once available, will be written to another working SRAM.According to this scheme, two working SRAMs are used to avoidpotential read/write conflict. After the entire Vh−1h−1h−1 has been writ-ten to one working SRAM, in order to continue the next stage ofcomputation correctly, this SRAM will then output V’h−1h−1h−1, as thereshaped Vh−1h−1h−1, to the datapath via a specifically designed memoryread scheme (described in Section 4.4). Therefore, the two workingSRAMs act as source and destination memories, respectively, andexchange their roles for every stage. Notice that during the last stageof computations for V111, the calculated elements of V111 need to besent to activation units first and then written to working SRAM.

4.3 Datapath & Weight SRAMDatapath. As shown in Fig. 8, TIE consists of an array of NPE PEsthat perform matrix multiplication. Specifically, each PE containsNMAC MAC units and NMAC activation units. Notice that by usingthe processing scheme in Section 4.1, the NMACNPE elements ofresult matrix are available simultaneous in the registers of all PEsevery NGcol cycles, and then they will be written to working SRAMin parallel.

Weight SRAM. The weight SRAM of TIE stores the weightparameters of tensor cores GhGhGh ’s. Notice that though each layer of TT-format DNN models is affiliated with d tensor cores, the sequentialaccess to different GhGhGh ’s in different computation stages, which isdescribed in Section 3.2, enables the simple storing strategy oflocating all GhGhGh ’s in the same weight SRAM sequentially from h = 1to d. However, different from such sequential placement for theconsecutive GhGhGh ’s, the data allocation within the same GhGhGh may be notalways sequentially in the weight SRAM. For instance, as illustratedin Fig. 9, when the number of rows of GhGhGh ’s is larger than the number


0

Resh

ape

(Eqn

. (10

))

...

Mu

ltiply

with

Gd

Resh

ape

(Eqn

. (8))

Mu

ltiply

with

Gd

-1

XX

Resh

ape

(Eqn

. (10

))

Mu

ltiply

with

G1

Mu

ltiply

with

Gi

Resh

ape

(Eqn

. (10

))

... Y

3

-12

10

-14

-21

62

-50

03

-12

10

-14

-21

62

-50

03

-12

10

-14

-21

62

-50

03

-12

10

-14

-21

62

-50

PE1

PE2

PE1

PE2

PE1

PE2

PE1

PE2

Clock Cycle 1

3x4=122x4=80x4=0

3x(-1)=-32x(-1)=-20x(-1)=0

12+0x1=128+(-1)x1=70+1x1=1

-3+0x(-2)=-3-2+(-1)x(-2)=00+1x(-2)=-2

Clock Cycle 2

3x2=62x2=40x2=0

3x6=182x6=120x6=0

Clock Cycle 3

6+0x0=64+(-1)x0=40+1x0=0

18+0x(-5)=1812+(-1)x(-5)=170+1x(-5)=-5

Clock Cycle 4

Gh-1 Vh Vh Vh Vh

PE1 PE2 PE1 PE2 PE1 PE2 PE1 PE2

‘ Vd Vd‘ Vd-1 VhVh‘ Vh-1‘ V2 V2‘

Gh-1 Gh-1 Gh-1‘ ‘ ‘ ‘~ ~ ~ ~

Figure 7: Example 2-PE processing scheme with 3 MAC units in each PE. Calculated elements in the result matrix are in red.

PE #1

...

PE #2

PE #NPE

PE Array

Tensor-Train Inference Engine for DNN

Main Controller

Working SRAM #1

Routing Network

G1 Weight

G2 Weight

Tensor Core Weight SRAM

Working SRAM

Working SRAM #2

Routing Network

Gd-1 Weight

Gd Weight

...~ ~ ~ ~

Figure 8: Overall architecture of TIE.

of PEs, in order to be consistent with processing scheme describedin Section 4.1, the elements in the same column of GhGhGh need to bestored in the different row of weight SRAM via an interleaved way.In short, the entire data allocation of weight SRAM is sequential atthe inter-GhGhGh level and interleaved at the intra-GhGhGh level.

1 13

2 14

3 15

4 16

5 17

6 18

7 19

8 20

25

26

27

28

29

30

31

32

9 21

10 22

33

34

1 2

13 14

25 26

5 6

17 18

29 30

9 10

21 22

3

15

27

7

19

31

11

23

32

20

28

16

8

4

12

24

33 34 35 3611 23

12 24

35

36

All PEs

Ro

w-w

ise

Me

mo

ry R

ead

All PEs

All PEs

All PEs

All PEs

All PEs

All PEs

All PEs

All PEs

CC #1

CC: Clock Cyle

CC #2

CC #3

CC #4

CC #5

CC #6

CC #7

CC #8

CC #9

NMAC=4 NPE=4

MAC#1

4

PE #1

PE #2

PE #3

PE #4

Gh

Data Allocation in Weight SRAM

MAC#2

MAC#3

MAC#4

MAC#1

MAC#2

MAC#3

MAC#4

MAC#1

MAC#2

MAC#3

MAC#4

MAC#1

MAC#2

MAC#3

MAC#4

3

2

1

~

Figure 9: Data allocation in weight SRAM.

Algorithm 2: Data Read Scheme for Working SRAM.Input :Memory[Nд ,Nr ,M]. Nr component SRAMs are

divided into Nд groups. Each group contains NrNд

component SRAMs. Each component SRAM containsM elements. NPE denotes the number of PE.

Output :Data

1 //Number of total cycles for reading data

2 Nc = ⌊NдNrMNPE

⌋

3 while (k < Nc ) do4 for j = 1 to ⌊ M

NPE/Nд⌋ do

5 for ir = 1 to NrNд

do6 Data = Read(Memory(:, ir , [j ∗ ⌊ M

NPE/Nд⌋ :

(j + 1) ∗ ⌊ MNPE/Nд

⌋]))

7 Data = ReArrange(Data)

8 k ++

9 Function ReArrange (Data)10 Data′ = new [Nд × ⌊ M

NPE/Nд⌋]

11 for j = 1 to ⌊ MNPE/Nд

⌋ do12 for iд = 1 to Nд do13 Data′[j ∗ Nд + iд]=Data(iд , j)

14 Return Data′

4.4 Working SRAMAs indicated in Algorithm 1, a transform from Vhhh to V’hhh is requiredin each stage of computation to guarantee the functional correctnessof the proposed inference scheme. Conventionally, such transform,including matrix reshape and transpose, demands extra memoryresource to implement those matrix operations [9], thereby signifi-cantly degrading the hardware performance of entire TIE design onboth area efficiency and power efficiency.


To address such problem, we carefully design efficient read andwrite schemes for working SRAMs to achieve zero-cost matrix trans-form. Our general methodology is to ensure that the datapath alwaysreads the required elements of V’hhh from the working SRAM thatstores Vhhh , thereby enabling on-the-fly matrix transform for Vhhh . Toachieve this, the key idea is to partition working SRAM to multi-ple groups with well-designed data selection mechanism. Next wedescribe the proposed schemes in detail.

Writing scheme. Recall that in the proposed computing scheme(Section 4.1) each PE calculates NMAC elements in the same columnof Vhhh after every NGcol cycles. To make data allocation in the work-ing SRAM consistent with the corresponding matrix format of Vhhh ,different PEs assemble the calculated elements in the same positionsof MAC units together and write them to one row of componentSRAMs. In general, during the writing phase the calculated elementsin the i-th MAC units among different PEs form one row of data tobe written to memory. Notice that as mentioned before, each of thetwo working SRAMs is partitioned to multiple groups, where eachgroup contains multiple component SRAMs. Based on this type ofmemory organization, multiple columns of Vhhh can be written tomultiple component SRAMs simultaneously without access conflict.

Reading scheme. As mentioned before, the matrix transformoperation on Vhhh is performed during the reading phase. To achievethat, we design a partitioned group-based data selection mechanism.Algorithm 2 describes the proposed mechanism in detail. Here thekey idea of this mechanism is to utilize the indices of SRAM groups,component SRAMs and element to locate and read the targetedelement of V’hhh in a mathematically equivalent manner.

Fig. 10 illustrates the working SRAM reading scheme based onthe proposed data selection mechanism. It is seen that in each cyclethe required elements of V’hhh can be located and read from the rowsof different component SRAMs of memory group. As shown inFig. 10, after being assembled, these data form the row vector oftargeted V’hhh , and then they will be distributed to their correspondingPEs for calculating Vh−1h−1h−1 =Gh−1Gh−1Gh−1V’hhh .

Notice that besides the transform from Vhhh to V’hhh , the inferenceon entire DNN models also require transform from V1115 of thislayer to X’ of next layer. Interestingly, our mathematical analysisshows that such inter-layer transform is identical to the intra-layertransform described before. Therefore, when the TIE is performingthe computation between two consecutive layers, it will still utilizethe proposed working SRAM read scheme.

5 EVALUATION5.1 Experimental MethodologySimulation and CAD Tools. The high-level functional behavior ofTIE was modeled by a bit-accurate cycle-accurate simulator. Basedon that, we developed the RTL model using Verilog and verified itsfunctional validity. Then, the verified RTL model was synthesizedusing Synopsis Design Compiler with CMOS 28nm library. Here thegate-level netlist was annotated with toggle rate that was obtainedfrom the extracted switching activity during the simulation. Afterthat we used Synopsis IC Compiler to perform place and route andgenerate layout (see Fig. 11). Then, Synopsis Prime-Time PX was

5Before being transformed, V111 needs to be processed by the activation units in PEs first.

used to estimate power consumption. Notice that the area and powerof memory part were reported by Cacti.

Benchmarks. To evaluate the performance of TIE on differenttasks, we choose several workloads from two models used in imageclassification and video classification tasks, respectively. Here thesame-size layers with different TT-decomposition setting are viewedas different workloads. Table 4 lists the information of four bench-mark layers, including the size, TT-decomposition settings (d , n,m,and r ) and compression ratio.

5.2 Hardware PerformanceDesign Configuration. Table 5 shows the configuration informationof the TIE hardware. The entire design consists of 16 PEs with 16-bit quantization. For each PE, it is equipped with 16 MACs and16 activation units, where each MAC contains one 16-bit widthmultiplier and one 24-bit width accumulator. Regarding the memory,a 16 KB Weight SRAM is used to store up to 8192 16-bit weightson the chip. According to Section 2.3, such budgeted capacity forweight SRAM is sufficient for most TT-DNN models. For workingSRAM, it contains two copies acting as ping-pong buffer, whereeach copy has the capacity as 384KB. Therefore the total capacityof working RAM is 384 × 2 = 768KB.

Hardware Resources and Performance. Fig. 11 shows the over-all hardware source and performance of TIE design. Operating on1000 MHz, the 16-PE TIE occupies 1.74mm2 silicon area and con-sumes 154.8mW power. Notice that all the memory used in TIE areon-chip SRAM due to the high compression ratio brought by TTdecomposition. The area and power breakdown is shown in Table. 6.

5.3 Comparison with EIE, CIRCNN, and EyerissIn this subsection, we compare TIE with two state-of-the-art com-pressed DNN-targeted accelerators: EIE [26] and CIRCNN [18].Different from TIE, model compression in EIE and CIRCNN comesfrom other sources: For EIE, model compression is achieved vianetwork sparsification; for CIRCNN, model compression is fromstructuring topology. Moreover, to evaluate the performance of TIEon CONV layers, we also compare TIE with representative CONV-oriented work: Eyeriss [12].

Comparison with EIE. Table 7 summarizes the design parame-ters and hardware performance of EIE and TIE. Due to the differenttechnology nodes adopted in the two works, the clock frequency,silicon area and power consumption of EIE are also projected underthe same 28nm technology for fair comparison. Such projection isbased on the scaling rule used in [26]: linear, quardratic and constantscaling for frequency, area and power, respectively.

Fig. 12 compares the hardware performance of EIE and TIE ontwo benchmarks (VGG-FC6 and VGG-FC7) in terms of throughput,area efficiency and energy efficiency. We see that TIE can achieve acomparable throughput with EIE. More importantly, thanks to thehigh compression ratio brought by TT decomposition, TIE achieves7.22× ∼ 10.66× better area efficiency and 3.03× ∼ 4.48× betterenergy efficiency on different workloads, respectively.

Comparison with CIRCNN. Table 8 compares the hardwareperformance of CIRCNN and TIE. Notice that here the listed perfor-mance metrics of the two designs are obtained from their synthesisreports for fair comparison since CIRCNN reports synthesis results.


1, 2, 3, 4, 5, 6 7, 8, 9, 10, 11, 1213, 14, 15, 16, 17, 1819, 20, 21, 22, 23, 24

25, 26, 27, 28, 29, 3031, 32, 33, 34, 35, 3637, 38, 39, 40, 41, 4243, 44, 45, 46, 47, 48

49, 50, 51, 52, 53, 5455, 56, 57, 58, 59, 6061, 62, 63, 64, 65, 6667, 68, 69, 70, 71, 72

73, 74, 75, 76, 77, 7879, 80, 81, 82, 83, 8485, 86, 87, 88, 89, 9091, 92, 93, 94, 95, 96

V3

17

131939

15215

111723

253137432733394529 354147

495561675157636953 596571

737985917581879377 838995

28

14204

1016226

121824

263238442834404630 364248

505662685258647054 606672

748086927682889478 849096


1 2 3 4 5 6

7 8 9 10 11 12

13 14 15 16 17 18

19 20 21 22 23 24

25 26 27 28 29 30

31 32 33 34 35 36

37 38 39 40 41 42

43 44 45 46 47 48

49 50 51 52 53 54

55 56 57 58 59 60

61 62 63 64 65 66

67 68 69 70 71 72

73 74 75 76 77 78

79 80 81 82 83 84

85 86 87 88 89 90

91 92 93 94 95 96

1 25V3‘

Cycle1

49 73 2 26 50 74

Mathematical DemandSolution using Group-based

Read Access (Scheme-2) 1 25 49 73 2 26 50 74

7 31 55 79 8 32 56 80

13 37 61 85 14 38 62 86

19 43 67 91 20 44 68 92

3 27 51 75 4 28 52 76

9 33 57 81 10 34 58 82

15 39 63 87 16 40 64 88

21 45 69 93 22 46 70 94

Cycle2

RAM11

RAM12

RAM13

RAM14

RAM21

RAM22

RAM23

RAM24

RAM31

RAM32

RAM33

RAM34

RAM41

RAM42

RAM43

RAM44

Group 1 Group 2 Group 3 Group 4

…Mapping

Cycle3

Cycle4

Cycle5

Cycle6

Cycle7

Cycle8

……

Assembled data to PEs

Figure 10: Perform on-the-fly transform using well-designed working SRAM read access scheme.

Table 4: Information of evaluated benchmarks.

Layer Size d n m r Compression Ratio TasksVGG-FC6 (4096, 25088) 6 [2,7,8,8,7,4] [4,4,4,4,4,4] [1,4,4,4,4,4,1] 50972× CNN model for

image classificationVGG-FC7 (4096, 4096) 6 [4,4,4,4,4,4] [4,4,4,4,4,4] [1,4,4,4,4,4,1] 14564×LSTM-UCF1 (57600, 256) 4 [8,20,20,18] [4,4,4,4] [1,4,4,4,1] 4954× RNN models for

video classificationLSTM-Youtube2 (57600, 256) 4 [4,20,20,36] [4,4,4,4] [1,4,4,4,1] 4608×1 Input-to-hidden layer of LSTM proposed in [77] on UCF11 dateset.2 Input-to-hidden layer of LSTM proposed in [77] on Youtube Celebrities Face Data dateset.

Figure 11: Layout and performance metrics of TIE.

Table 5: Design configuration information.

PE Parameter Multiplier AccumulatorAmount 16 16Width 16-bit 24-bit

Memory Parameter Weight SRAM Working SRAMCapacity 16KB 768KB

TIE parameter Amount of PEs QuantizationValue 16 16-bit

Meanwhile, due to the lack of area information of CIRCNN, wecompare the overall throughput (in term of TOPS) and energy ef-ficiency (in term of TOPS/W) of the two designs. After projectingperformance of CIRCNN to the same 28nm technology for faircomparison, it is seen that TIE achieves 5.96× and 4.56× higherthroughput and energy efficiency than CIRCNN, respectively.

Comparison with Eyeriss. Table 9 summarizes the design pa-rameters and hardware performance of Eyeriss and TIE on CONVlayers of VGG. For fair comparison, the clock frequency, silicon

Table 6: Power and area breakdowns.

Component Power (mW) Area (mm2)Memory 60.8 (39.28%) 1.29 (73.93%)Register 10.9 (7.04%) 0.019 (1.11%)

Combinational 54 (34.88%) 0.082 (4.70%)Clock Network 29.1 (18.80%) 0.0035 (0.02%)

Other 0.35 (20.06%)Total 154.8 1.744

Table 7: Comparisons of EIE and TIE.

Design EIE TIECMOS Tech. 45nm (reported) 28nm (projected) 28nm

Frequency (MHz) 800 1285 1000Memory SRAM SRAM

Quantization 4-bit for weight index, 16-bit forshared weight 16-bit

Area (mm2) 40.8 15.7 1.74Power (mW) 590 590 154.8

Table 8: Comparisons of CIRCNN and TIE.

Design CIRCNN TIE

CMOS Tech. 45nm(reported)

28nm(projected) 28nm

Freq (MHz) 200 320 1000Quantization 16-bit 16-bitArea (mm2) N/A N/A 1.40Power (mW) 80 80 104.8

Throughput (TOPS) 0.8 1.28 7.64(5.96×)

Energy Efficiency(TOPS/W) 10.0 16.0 72.90

(4.56×)

area and power consumption of Eyeriss are also projected under the28nm technology.


Figure 12: Performance comparison between EIE and TIE on different benchmarks.

5.4 FlexibilityTIE is designed to provide sufficient flexibility to support the needsof different TT models having different layer sizes and decompo-sition settings. As illustrated in Fig. 13, different workloads withdifferent d ,m, n and r can be executed on the same TIE acceleratorhardware efficiently in a flexible way. In addition, we also investigatethe throughput with different r ’s for the same workload (see Fig. 13.Here the change of r is specifically studied since it is an importantmetric to provide flexible control the compression and accelerationeffect of TT decomposition. From the figure we can see that TIEexhibits great flexibility to support the flexibility for this importantTT decomposition parameter.

Table 9: Comparisons of Eyeriss and TIE on VGG CONV lay-ers.6

Design Eyeriss TIE

CMOS Tech. 65nm(reported)

28nm(projected) 28nm

Freq (MHz) 200 464 1000Quantization 16-bit 16-bitArea (mm2) 12.25 2.27 1.74Power (mW) 236 236 170

Throughput (Frame/s) 0.8 1.86 6.72(3.61×)

Area Efficiency(Frame/s/W) 0.065 0.82 3.86

(4.71×)Energy Efficiency(Frame/s/mm2) 3.39 7.89 39.5

(5.01×)

6 RELATED WORKSDesigning specialized DNN accelerators has become an active re-search topic of the architecture community. In the last four yearsa series of DNN processors and the corresponding instruction setsare proposed in Diannao family [10, 11, 19, 42, 80] and Google’sTPU [34]. Also, in [4, 37, 43], efficient and flexible dataflows areinvestigated and proposed for DNN accelerator to reducethe memoryaccess and hence improve energy efficiency. On the other hand, someworks [13, 22, 36, 39, 58, 71] focus on improving the efficiency ofmemory system in the DNN accelerators.

The performance of DNN hardware can also be improved by de-signing accelerators [3, 6, 12, 16, 31, 38, 55, 78] utilizing the sparsityof the network model. Besides, some other computation/memory-reducing approaches are investigated in [1, 53, 60, 68] to facilitate6We used core area and processing latency of Eyeriss instead of chip area and totallatency for fair comparison with TIE.

Figure 13: Flexibility of TIE on different decomposition ranks.

the efficient hardware design. Bit-serial computation-based accelera-tors [2, 20, 30, 35, 37, 57, 63, 76] are another type of energy-efficientDNN hardware that provides scalable performance with differentprecisions.

Besides the aforementioned ASIC designs, FPGA-based imple-mentations are widely investigated in [40, 48, 49, 54, 62, 64, 65, 82].Also, analog or mixed signal-based design, proposed in [7, 8, 21,41, 61, 66, 70], is another potential solutions for energy-efficientDNN accelerators. In particular, [32] proposes a RRAM-basedTT-format DNN accelerator, which is the most relevant work ofTIE. Recently, considering the training phase of DNN is very time-consuming, [14, 45, 59, 67, 69, 72] investigate efficient hardwaresolutions for DNN training.

7 CONCLUSIONThis paper develops a computation-efficient inference scheme forTT-format DNN, and accordingly develops TIE, a TT-format com-pressed DNN-targeted inference engine. The highly flexible archi-tecture completely eliminate redundant computation and memoryaccesses. A 16-processing elements (PE) prototype is implementedusing CMOS 28nm technology. Operating on 1000MHz, the TIEaccelerator consumes 1.74mm2 and 154.8mW. Compared with EIE,TIE achieves 7.22× ∼ 10.66× better area efficiency and 3.03× ∼

4.48× better energy efficiency on different workloads, respectively.Compared with CIRCNN, TIE achieves 5.96× and 4.56× higherthroughput and energy efficiency, respectively. The results show thatTIE exhibits significant advantages over state-of-the-art solutions.


REFERENCES[1] V Aklaghi, Amir Yazdanbakhsh, Kambiz Samadi, H Esmaeilzadeh, and RK

Gupta. 2018. Snapea: Predictive early activation for reducing computation in deepconvolutional neural networks. ISCA.

[2] Jorge Albericio, Alberto Delmás, Patrick Judd, Sayeh Sharify, Gerard O’Leary,Roman Genov, and Andreas Moshovos. 2017. Bit-pragmatic deep neural net-work computing. In Proceedings of the 50th Annual IEEE/ACM InternationalSymposium on Microarchitecture. ACM, 382–394.

[3] Jorge Albericio, Patrick Judd, Tayler Hetherington, Tor Aamodt, Natalie EnrightJerger, and Andreas Moshovos. 2016. Cnvlutin: Ineffectual-neuron-free deepneural network computing. In ACM SIGARCH Computer Architecture News,Vol. 44. IEEE Press, 1–13.

[4] Manoj Alwani, Han Chen, Michael Ferdman, and Peter Milder. 2016. Fused-layerCNN accelerators. In The 49th Annual IEEE/ACM International Symposium onMicroarchitecture. IEEE Press, 22.

[5] Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, EricBattenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, GuoliangChen, et al. 2016. Deep speech 2: End-to-end speech recognition in english andmandarin. In International Conference on Machine Learning. 173–182.

[6] P. Angshuman, R. Minsoo, M. Anurag, P. Antonio, V. Rangharajan, K. Brucek, E.Joel, S. W. Keckler, and W. J. Dally. 2017. Scnn: An accelerator for compressed-sparse convolutional neural networks. In ACM SIGARCH Computer ArchitectureNews, Vol. 45. ACM, 27–40.

[7] Daniel Bankman, Lita Yang, Bert Moons, Marian Verhelst, and Boris Murmann.2018. An always-on 3.8 µJ/86% CIFAR-10 mixed-signal binary CNN processorwith all memory on chip in 28nm CMOS. In Solid-State Circuits Conference-(ISSCC), 2018 IEEE International. IEEE, 222–224.

[8] Mahdi Nazm Bojnordi and Engin Ipek. 2016. Memristive boltzmann machine:A hardware accelerator for combinatorial optimization and deep learning. InHigh Performance Computer Architecture (HPCA), 2016 IEEE InternationalSymposium on. IEEE, 1–13.

[9] Ren Chen and Viktor K Prasanna. 2013. Energy-efficient architecture for stridepermutation on streaming data. In Reconfigurable Computing and FPGAs (Re-ConFig), 2013 International Conference on. IEEE, 1–7.

[10] Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen,and Olivier Temam. 2014. Diannao: A small-footprint high-throughput acceleratorfor ubiquitous machine-learning. ACM Sigplan Notices 49, 4 (2014), 269–284.

[11] Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li,Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. 2014. DaDianNao: AMachine-Learning Supercomputer. In Proceedings of the 47th Annual IEEE/ACMInternational Symposium on Microarchitecture (MICRO-47). IEEE ComputerSociety, Washington, DC, USA, 609–622. https://doi.org/10.1109/MICRO.2014.58

[12] Yu-Hsin Chen, Joel Emer, and Vivienne Sze. 2016. Eyeriss: A spatial architecturefor energy-efficient dataflow for convolutional neural networks. In ACM SIGARCHComputer Architecture News, Vol. 44. IEEE Press, 367–379.

[13] Ping Chi, Shuangchen Li, Cong Xu, Tao Zhang, Jishen Zhao, Yongpan Liu, YuWang, and Yuan Xie. 2016. Prime: A novel processing-in-memory architecturefor neural network computation in reram-based main memory. In ACM SIGARCHComputer Architecture News, Vol. 44. IEEE Press, 27–39.

[14] Christopher De Sa, Matthew Feldman, Christopher Ré, and Kunle Olukotun. 2017.Understanding and optimizing asynchronous low-precision stochastic gradientdescent. In ACM SIGARCH Computer Architecture News, Vol. 45. ACM, 561–574.

[15] C. Deng and S. L. Bo Yuan. 2019. Reduced-complexity Deep Neural Network-aided Channel Code Decoder: A Case Study for BCH Decoder. In ICASSP 2019 -2019 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). 1468–1472. https://doi.org/10.1109/ICASSP.2019.8682871

[16] C. Deng, S. Liao, Y. Xie, K. K. Parhi, X. Qian, and B. Yuan. 2018. PermDNN:Efficient Compressed DNN Architecture with Permuted Diagonal Matrices. In2018 51st Annual IEEE/ACM International Symposium on Microarchitecture(MICRO). 189–202. https://doi.org/10.1109/MICRO.2018.00024

[17] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009.Imagenet: A large-scale hierarchical image database. In Computer Vision andPattern Recognition, 2009. CVPR 2009. IEEE Conference on. Ieee, 248–255.

[18] Caiwen Ding, Siyu Liao, Yanzhi Wang, Zhe Li, Ning Liu, Youwei Zhuo, ChaoWang, Xuehai Qian, Yu Bai, Geng Yuan, Xiaolong Ma, Yipeng Zhang, Jian Tang,Qinru Qiu, Xue Lin, and Bo Yuan. 2017. CirCNN: Accelerating and CompressingDeep Neural Networks Using Block-circulant Weight Matrices. In Proceedingsof the 50th Annual IEEE/ACM International Symposium on Microarchitecture(MICRO-50 ’17). ACM, New York, NY, USA, 395–408. https://doi.org/10.1145/3123939.3124552

[19] Zidong Du, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo,Xiaobing Feng, Yunji Chen, and Olivier Temam. 2015. ShiDianNao: Shiftingvision processing closer to the sensor. In ACM SIGARCH Computer ArchitectureNews, Vol. 43. ACM, 92–104.

[20] Charles Eckert, Xiaowei Wang, Jingcheng Wang, Arun Subramaniyan, Ravi Iyer,Dennis Sylvester, David Blaauw, and Reetuparna Das. 2018. Neural Cache:Bit-Serial In-Cache Acceleration of Deep Neural Networks. arXiv preprintarXiv:1805.03718 (2018).

[21] Ben Feinberg, Shibo Wang, and Engin Ipek. 2018. Making memristive neuralnetwork accelerators reliable. In 2018 IEEE International Symposium on HighPerformance Computer Architecture (HPCA). IEEE, 52–65.

[22] Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, and Christos Kozyrakis. 2017.Tetris: Scalable and efficient neural network acceleration with 3d memory. ACMSIGOPS Operating Systems Review 51, 2 (2017), 751–764.

[23] T. Garipov, D. Podoprikhin, A. Novikov, and D. Vetrov. 2016. Ultimate ten-sorization: compressing convolutional and FC layers alike. arXiv preprintarXiv:1611.03214 (2016).

[24] Yunchao Gong, Liu Liu, Ming Yang, and Lubomir Bourdev. 2014. Compress-ing deep convolutional networks using vector quantization. arXiv preprintarXiv:1412.6115 (2014).

[25] Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. 2016.Deep learning. Vol. 1. MIT press Cambridge.

[26] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally. 2016.EIE: efficient inference engine on compressed deep neural network. In ComputerArchitecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on.IEEE, 243–254.

[27] S. Han, H. Mao, and W. J. Dally. 2016. Deep Compression: Compressing DeepNeural Networks with Pruning, Trained Quantization and Huffman Coding. inter-national conference on learning representations (2016).

[28] Babak Hassibi and David G Stork. 1993. Second order derivatives for networkpruning: Optimal brain surgeon. In Advances in neural information processingsystems. 164–171.

[29] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep resid-ual learning for image recognition. In Proceedings of the IEEE conference oncomputer vision and pattern recognition. 770–778.

[30] Kartik Hegde, Jiyong Yu, Rohit Agrawal, Mengjia Yan, Michael Pellauer, andChristopher W Fletcher. 2018. UCNN: Exploiting Computational Reuse in DeepNeural Networks via Weight Repetition. arXiv preprint arXiv:1804.06508 (2018).

[31] Parker Hill, Animesh Jain, Mason Hill, Babak Zamirai, Chang-Hong Hsu,Michael A Laurenzano, Scott Mahlke, Lingjia Tang, and Jason Mars. 2017. Deftnn:Addressing bottlenecks for dnn execution on gpus via synapse vector eliminationand near-compute data fission. In Proceedings of the 50th Annual IEEE/ACMInternational Symposium on Microarchitecture. ACM, 786–799.

[32] Hantao Huang, Leibin Ni, Kanwen Wang, Yuangang Wang, and Hao Yu. 2017. Ahighly parallel and energy efficient three-dimensional multilayer CMOS-RRAMaccelerator for tensorized neural network. IEEE Transactions on Nanotechnology17, 4 (2017), 645–656.

[33] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long,Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: ConvolutionalArchitecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093 (2014).

[34] Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal,Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al.2017. In-datacenter performance analysis of a tensor processing unit. In ComputerArchitecture (ISCA), 2017 ACM/IEEE 44th Annual International Symposium on.IEEE, 1–12.

[35] Patrick Judd, Jorge Albericio, Tayler Hetherington, Tor M Aamodt, and AndreasMoshovos. 2016. Stripes: Bit-serial deep neural network computing. In Microar-chitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on.IEEE, 1–12.

[36] Duckhwan Kim, Jaeha Kung, Sek Chai, Sudhakar Yalamanchili, and SaibalMukhopadhyay. 2016. Neurocube: A programmable digital neuromorphic ar-chitecture with high-density 3D memory. In Computer Architecture (ISCA), 2016ACM/IEEE 43rd Annual International Symposium on. IEEE, 380–392.

[37] Hyoukjun Kwon, Ananda Samajdar, and Tushar Krishna. 2018. MAERI: En-abling Flexible Dataflow Mapping over DNN Accelerators via ReconfigurableInterconnects. In Proceedings of the Twenty-Third International Conference onArchitectural Support for Programming Languages and Operating Systems. ACM,461–475.

[38] Ching-En Lee, Yakun Sophia Shao, Jie-Fang Zhang, Angshuman Parashar, JoelEmer, Stephen W Keckler, and Zhengya Zhang. [n. d.]. Stitch-X: An AcceleratorArchitecture for Exploiting Unstructured Sparsity in Deep Neural Networks. ([n.d.]).

[39] Shuangchen Li, Dimin Niu, Krishna T Malladi, Hongzhong Zheng, Bob Brennan,and Yuan Xie. 2017. Drisa: A dram-based reconfigurable in-situ accelerator. InProceedings of the 50th Annual IEEE/ACM International Symposium on Microar-chitecture. ACM, 288–301.

[40] S. Liao, A. Samiee, C. Deng, Y. Bai, and B. Yuan. 2019. Compressing Deep NeuralNetworks Using Toeplitz Matrix: Algorithm Design and Fpga Implementation. InICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP). 1443–1447. https://doi.org/10.1109/ICASSP.2019.8683556

https://doi.org/10.1109/MICRO.2014.58


https://doi.org/10.1109/ICASSP.2019.8682871


https://doi.org/10.1145/3123939.3124552

https://doi.org/10.1145/3123939.3124552




[41] Robert LiKamWa, Yunhui Hou, Julian Gao, Mia Polansky, and Lin Zhong. 2016.RedEye: analog ConvNet image sensor architecture for continuous mobile vision.In ACM SIGARCH Computer Architecture News, Vol. 44. IEEE Press, 255–266.

[42] Shaoli Liu, Zidong Du, Jinhua Tao, Dong Han, Tao Luo, Yuan Xie, Yunji Chen,and Tianshi Chen. 2016. Cambricon: An instruction set architecture for neuralnetworks. In ACM SIGARCH Computer Architecture News, Vol. 44. IEEE Press,393–405.

[43] Wenyan Lu, Guihai Yan, Jiajun Li, Shijun Gong, Yinhe Han, and Xiaowei Li.2017. Flexflow: A flexible dataflow accelerator architecture for convolutionalneural networks. In High Performance Computer Architecture (HPCA), 2017IEEE International Symposium on. IEEE, 553–564.

[44] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effec-tive approaches to attention-based neural machine translation. arXiv preprintarXiv:1508.04025 (2015).

[45] Divya Mahajan, Jongse Park, Emmanuel Amaro, Hardik Sharma, Amir Yaz-danbakhsh, Joon Kyung Kim, and Hadi Esmaeilzadeh. 2016. Tabla: A unifiedtemplate-based framework for accelerating statistical machine learning. In HighPerformance Computer Architecture (HPCA), 2016 IEEE International Sympo-sium on. IEEE, 14–26.

[46] J. Max, V. Andrea, and Z. Andrew. 2014. Speeding up convolutional neuralnetworks with low rank expansions. arXiv preprint arXiv:1405.3866 (2014).

[47] R. Mohammad, O. Vicente, R. Joseph, and F. Ali. 2016. Xnor-net: Imagenetclassification using binary convolutional neural networks. In European Conferenceon Computer Vision. Springer, 525–542.

[48] Duncan JM Moss, Srivatsan Krishnan, Eriko Nurvitadhi, Piotr Ratuszniak, ChrisJohnson, Jaewoong Sim, Asit Mishra, Debbie Marr, Suchit Subhaschandra, andPhilip HW Leong. 2018. A Customizable Matrix Multiplication Frameworkfor the Intel HARPv2 Xeon+ FPGA Platform: A Deep Learning Case Study.In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 107–116.

[49] Duncan JM Moss, Eriko Nurvitadhi, Jaewoong Sim, Asit Mishra, Debbie Marr,Suchit Subhaschandra, and Philip HW Leong. 2017. High performance binaryneural networks on the Xeon+ FPGAâDc platform. In Field Programmable Logicand Applications (FPL), 2017 27th International Conference on. IEEE, 1–4.

[50] A. Novikov, D. Podoprikhin, A. Osokin, and D. P. Vetrov. 2015. Tensorizing neuralnetworks. In Advances in Neural Information Processing Systems. 442–450.

[51] Enrique G Ortiz, Alan Wright, and Mubarak Shah. 2013. Face recognition inmovie trailers via mean sequence sparse representation-based classification. InProceedings of the IEEE conference on computer vision and pattern recognition.3531–3538.

[52] Ivan V Oseledets. 2011. Tensor-train decomposition. SIAM Journal on ScientificComputing 33, 5 (2011), 2295–2317.

[53] Eunhyeok Park, Dongyoung Kim, and Sungjoo Yoo. 2018. Energy-efficientneural network accelerator based on outlier-aware low-precision computation. In2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture(ISCA). IEEE, 688–698.

[54] Jiantao Qiu, Jie Wang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin Zhou, JinchengYu, Tianqi Tang, Ningyi Xu, Sen Song, et al. 2016. Going deeper with em-bedded fpga platform for convolutional neural network. In Proceedings of the2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays.ACM, 26–35.

[55] Brandon Reagen, Paul Whatmough, Robert Adolf, Saketh Rama, HyunkwangLee, Sae Kyu Lee, José Miguel Hernández-Lobato, Gu-Yeon Wei, and DavidBrooks. 2016. Minerva: Enabling low-power, highly-accurate deep neural networkaccelerators. In ACM SIGARCH Computer Architecture News, Vol. 44. IEEE Press,267–278.

[56] Joseph Redmon and Ali Farhadi. 2018. Yolov3: An incremental improvement.arXiv preprint arXiv:1804.02767 (2018).

[57] Ao Ren, Zhe Li, Caiwen Ding, Qinru Qiu, Yanzhi Wang, Ji Li, Xuehai Qian, andBo Yuan. 2017. Sc-dcnn: Highly-scalable deep convolutional neural network usingstochastic computing. ACM SIGOPS Operating Systems Review 51, 2 (2017),405–418.

[58] Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, and Stephen WKeckler. 2016. vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design. In The 49th Annual IEEE/ACM InternationalSymposium on Microarchitecture. IEEE Press, 18.

[59] Minsoo Rhu, Mike O’Connor, Niladrish Chatterjee, Jeff Pool, Youngeun Kwon,and Stephen W Keckler. 2018. Compressing DMA Engine: Leveraging ActivationSparsity for Training Deep Neural Networks. In High Performance ComputerArchitecture (HPCA), 2018 IEEE International Symposium on. IEEE, 78–91.

[60] Marc Riera, Jose-Maria Arnau, and Antonio González. 2018. Computation reusein DNNs by exploiting input similarity. In Proceedings of the 45th Annual Inter-national Symposium on Computer Architecture. IEEE Press, 57–68.

[61] Ali Shafiee, Anirban Nag, Naveen Muralimanohar, Rajeev Balasubramonian,John Paul Strachan, Miao Hu, R Stanley Williams, and Vivek Srikumar. 2016.ISAAC: A convolutional neural network accelerator with in-situ analog arithmeticin crossbars. ACM SIGARCH Computer Architecture News 44, 3 (2016), 14–26.

[62] Hardik Sharma, Jongse Park, Divya Mahajan, Emmanuel Amaro, Joon KyungKim, Chenkai Shao, Asit Mishra, and Hadi Esmaeilzadeh. 2016. From high-leveldeep neural models to FPGAs. In The 49th Annual IEEE/ACM InternationalSymposium on Microarchitecture. IEEE Press, 17.

[63] Hardik Sharma, Jongse Park, Naveen Suda, Liangzhen Lai, Benson Chau, VikasChandra, and Hadi Esmaeilzadeh. 2018. Bit fusion: Bit-level dynamically com-posable architecture for accelerating deep neural network. In 2018 ACM/IEEE45th Annual International Symposium on Computer Architecture (ISCA). IEEE,764–775.

[64] Yongming Shen, Michael Ferdman, and Peter Milder. 2016. Overcoming resourceunderutilization in spatial CNN accelerators. In Field Programmable Logic andApplications (FPL), 2016 26th International Conference on. IEEE, 1–4.

[65] Yongming Shen, Michael Ferdman, and Peter Milder. 2017. Maximizing CNNaccelerator efficiency through resource partitioning. In Computer Architecture(ISCA), 2017 ACM/IEEE 44th Annual International Symposium on. IEEE, 535–547.

[66] Linghao Song, Xuehai Qian, Hai Li, and Yiran Chen. 2017. Pipelayer: A pipelinedreram-based accelerator for deep learning. In High Performance Computer Archi-tecture (HPCA), 2017 IEEE International Symposium on. IEEE, 541–552.

[67] Mingcong Song, Jiaqi Zhang, Huixiang Chen, and Tao Li. 2018. Towards Ef-ficient Microarchitectural Design for Accelerating Unsupervised GAN-BasedDeep Learning. In High Performance Computer Architecture (HPCA), 2018 IEEEInternational Symposium on. IEEE, 66–77.

[68] Mingcong Song, Jiechen Zhao, Yang Hu, Jiaqi Zhang, and Tao Li. 2018. PredictionBased Execution on Deep Neural Networks. In 2018 ACM/IEEE 45th AnnualInternational Symposium on Computer Architecture (ISCA). IEEE, 752–763.

[69] Mingcong Song, Kan Zhong, Jiaqi Zhang, Yang Hu, Duo Liu, Weigong Zhang,Jing Wang, and Tao Li. 2018. In-Situ AI: Towards Autonomous and IncrementalDeep Learning for IoT Systems. In 2018 IEEE InternatiOnal SympOsium OnHigh PerfOrmance COmputer Architecture (HPCA). IEEE, 92–103.

[70] Prakalp Srivastava, Mingu Kang, Sujan K Gonugondla, Sungmin Lim, JungwookChoi, Vikram Adve, Nam Sung Kim, and Naresh Shanbhag. 2018. PROMISE:an end-to-end design of a programmable mixed-signal accelerator for machine-learning algorithms. In Proceedings of the 45th Annual International Symposiumon Computer Architecture. IEEE Press, 43–56.

[71] Fengbin Tu, Weiwei Wu, Shouyi Yin, Leibo Liu, and Shaojun Wei. 2018. RANA:towards efficient neural acceleration with refresh-optimized embedded DRAM. InProceedings of the 45th Annual International Symposium on Computer Architec-ture. IEEE Press, 340–352.

[72] Swagath Venkataramani, Ashish Ranjan, Subarno Banerjee, Dipankar Das,Sasikanth Avancha, Ashok Jagannathan, Ajaya Durg, Dheemanth Nagaraj, BharatKaul, Pradeep Dubey, and Anand Raghunathan. 2017. ScaleDeep: A ScalableCompute Architecture for Learning and Evaluating Deep Networks. In Proceed-ings of the 44th Annual International Symposium on Computer Architecture(ISCA ’17). ACM, New York, NY, USA, 13–26. https://doi.org/10.1145/3079856.3080244

[73] Oriol Vinyals and Quoc Le. 2015. A neural conversational model. arXiv preprintarXiv:1506.05869 (2015).

[74] Wenqi Wang, Yifan Sun, Brian Eriksson, Wenlin Wang, and Vaneet Aggarwal.2018. Wide Compression: Tensor Ring Nets. learning 14, 15 (2018), 13–31.

[75] Y. Wang, J. Lin, and Z. Wang. 2017. An Energy-Efficient Architecture for BinaryWeight Convolutional Neural Networks. IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems (2017).

[76] Y. Xie, C. Deng, S. Liao, and B. Yuan. 2018. Area-efficient K-Nearest NeighborDesign using Stochastic Computing. In 2018 52nd Asilomar Conference on Sig-nals, Systems, and Computers. 782–786. https://doi.org/10.1109/ACSSC.2018.8645529

[77] Yinchong Yang, Denis Krompass, and Volker Tresp. 2017. Tensor-train recurrentneural networks for video classification. arXiv preprint arXiv:1707.01786 (2017).

[78] Jiecao Yu, Andrew Lukefahr, David Palframan, Ganesh Dasika, Reetuparna Das,and Scott Mahlke. 2017. Scalpel: Customizing dnn pruning to the underlyinghardware parallelism. In ACM SIGARCH Computer Architecture News, Vol. 45.ACM, 548–560.

[79] Sergey Zagoruyko and Nikos Komodakis. 2016. Wide residual networks. arXivpreprint arXiv:1605.07146 (2016).

[80] Shijin Zhang, Zidong Du, Lei Zhang, Huiying Lan, Shaoli Liu, Ling Li, Qi Guo,Tianshi Chen, and Yunji Chen. 2016. Cambricon-x: An accelerator for sparseneural networks. In The 49th Annual IEEE/ACM International Symposium onMicroarchitecture. IEEE Press, 20.

[81] Qibin Zhao, Guoxu Zhou, Shengli Xie, Liqing Zhang, and Andrzej Cichocki.2016. Tensor ring decomposition. arXiv preprint arXiv:1606.05535 (2016).

[82] Ritchie Zhao, Weinan Song, Wentao Zhang, Tianwei Xing, Jeng-Hau Lin, ManiSrivastava, Rajesh Gupta, and Zhiru Zhang. 2017. Accelerating binarized convolu-tional neural networks with software-programmable fpgas. In Proceedings of the2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays.ACM, 15–24.

https://doi.org/10.1145/3079856.3080244

https://doi.org/10.1145/3079856.3080244

https://doi.org/10.1109/ACSSC.2018.8645529

https://doi.org/10.1109/ACSSC.2018.8645529

Documents

TIE: Energy-efficient Tensor Train-based Inference Engine ...alchem.usc.edu/portal/static/download/tie.pdfTIE: Energy-efficient Tensor Train-based Inference Engine for Deep Neural