Deep Compression and EIE: Efficient Inference …hotchips.org/wp-content/uploads/hc_archives/hc28/HC28.24...Figure 1. Efﬁcient inference engine that works on the compressed deep

Deep Compression and EIE: Efficient Inference Engine on Compressed Deep Neural Network

Song Han*, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark Horowitz, Bill Dally

Stanford University

Network Original Size

Compressed Size

Compression Ratio

Original Accuracy

Compressed Accuracy

AlexNet 240MB 6.9MB 35x 80.27% 80.30%

VGGNet 550MB 11.3MB 49x 88.68% 89.09%

GoogleNet 28MB 2.8MB 10x 88.90% 88.92%

SqueezeNet 4.8MB 0.47MB 10x 80.32% 80.35%

Our Prior Work: Deep Compression• Memory reference is expensive.• Small DNN models are critical.

pruning weight sharing

Published as a conference paper at ICLR 2016

Figure 2: Representing the matrix sparsity with relative index. Padding filler zero to prevent overflow.

2.09 -0.98 1.48 0.09

0.05 -0.14 -1.08 2.12

-0.91 1.92 0 -1.03

1.87 0 1.53 1.49

cluster

(32 bit float) centroids

3 0 2 1

1 1 0 3

0 3 1 0

3 1 2 2

(2 bit uint)

2.00

1.50

0.00

-1.00

1:

0:

2:

3:

Figure 3: Weight sharing by scalar quantization (top) and centroids fine-tuning (bottom).

We store the sparse structure that results from pruning using compressed sparse row (CSR) orcompressed sparse column (CSC) format, which requires 2a+n+1 numbers, where a is the numberof non-zero elements and n is the number of rows or columns.

To compress further, we store the index difference instead of the absolute position, and encode thisdifference in 8 bits for conv layer and 5 bits for fc layer. When we need an index difference largerthan the bound, we the zero padding solution shown in Figure 2: in case when the difference exceeds8, the largest 3-bit (as an example) unsigned number, we add a filler zero.

3 TRAINED QUANTIZATION AND WEIGHT SHARING

Network quantization and weight sharing further compresses the pruned network by reducing thenumber of bits required to represent each weight. We limit the number of effective weights we need tostore by having multiple connections share the same weight, and then fine-tune those shared weights.

Weight sharing is illustrated in Figure 3. Suppose we have a layer that has 4 input neurons and 4output neurons, the weight is a 4⇥ 4 matrix. On the top left is the 4⇥ 4 weight matrix, and on thebottom left is the 4⇥ 4 gradient matrix. The weights are quantized to 4 bins (denoted with 4 colors),all the weights in the same bin share the same value, thus for each weight, we then need to store onlya small index into a table of shared weights. During update, all the gradients are grouped by the colorand summed together, multiplied by the learning rate and subtracted from the shared centroids fromlast iteration. For pruned AlexNet, we are able to quantize to 8-bits (256 shared weights) for eachCONV layers, and 5-bits (32 shared weights) for each FC layer without any loss of accuracy.

To calculate the compression rate, given k clusters, we only need log2(k) bits to encode the index. Ingeneral, for a network with n connections and each connection is represented with b bits, constrainingthe connections to have only k shared weights will result in a compression rate of:

r =nb

nlog2(k) + kb(1)

For example, Figure 3 shows the weights of a single layer neural network with four input units andfour output units. There are 4⇥4 = 16 weights originally but there are only 4 shared weights: similarweights are grouped together to share the same value. Originally we need to store 16 weights each

3

Train Connectivity

Prune Connections

Train Weights

Figure 2: Three-Step Training Pipeline.

pruning neurons

pruning synapses

after pruningbefore pruning

Figure 3: Synapses and neurons before and afterpruning.

3 Learning Connections in Addition to Weights

Our pruning method employs a three-step process, as illustrated in Figure 2, which begins by learningthe connectivity via normal network training. Unlike conventional training, however, we are notlearning the final values of the weights, but rather we are learning which connections are important.

The second step is to prune the low-weight connections. All connections with weights below athreshold are removed from the network — converting a dense network into a sparse network, asshown in Figure 3. The final step retrains the network to learn the final weights for the remainingsparse connections. This step is critical. If the pruned network is used without retraining, accuracy issignificantly impacted.

3.1 Regularization

Choosing the correct regularization impacts the performance of pruning and retraining. L1 regulariza-tion penalizes non-zero parameters resulting in more parameters near zero. This gives better accuracyafter pruning, but before retraining. However, the remaining connections are not as good as with L2regularization, resulting in lower accuracy after retraining. Overall, L2 regularization gives the bestpruning results. This is further discussed in experiment section.

3.2 Dropout Ratio Adjustment

Dropout [23] is widely used to prevent over-fitting, and this also applies to retraining. Duringretraining, however, the dropout ratio must be adjusted to account for the change in model capacity.In dropout, each parameter is probabilistically dropped during training, but will come back duringinference. In pruning, parameters are dropped forever after pruning and have no chance to come backduring both training and inference. As the parameters get sparse, the classifier will select the mostinformative predictors and thus have much less prediction variance, which reduces over-fitting. Aspruning already reduced model capacity, the retraining dropout ratio should be smaller.

Quantitatively, let Ci

be the number of connections in layer i, Cio

for the original network, Cir

forthe network after retraining, N

i

be the number of neurons in layer i. Since dropout works on neurons,and C

i

varies quadratically with Ni

, according to Equation 1 thus the dropout ratio after pruning theparameters should follow Equation 2, where D

o

represent the original dropout rate, Dr

represent thedropout rate during retraining.

Ci

= Ni

Ni�1 (1) D

r

= Do

rC

ir

Cio

(2)

3.3 Local Pruning and Parameter Co-adaptation

During retraining, it is better to retain the weights from the initial training phase for the connectionsthat survived pruning than it is to re-initialize the pruned layers. CNNs contain fragile co-adaptedfeatures [24]: gradient descent is able to find a good solution when the network is initially trained,but not after re-initializing some layers and retraining them. So when we retrain the pruned layers,we should keep the surviving parameters instead of re-initializing them.

3

[1]. Han et al. NIPS 2015 [2]. Han et al. ICLR 2016, best paper award

EIE: First Accelerator for Sparse DNN

Fully fits in SRAM120x less energy than DRAM

Sparse Vector70% dynamic sparsity

in the activation3x less computation

Sparse Matrix90% static sparsity

in the weights,10x less computation,

5x less memory footprint

Weight Sharing4bits weights

8x less memoryfootprint

• Deep Compression solves the model size problem.• But it creates another problem: irregular computation pattern.• CPU/GPU are only good at dense linear algebra.• So we create EIE that supports: static-sparse M, dynamic-sparse V,

indirect indexing, weight sharing.

Savings are multiplicative: 5x3x8x120=14,400 theoretical energy improvement.Dally. NIPS tutorial 2015; Han et al. ISCA 2016

4

Technology 45 nm

# PEs 64

on-chip SRAM 8 MB

Max Model Size 84 Million

Static Sparsity 10x

Dynamic Sparsity 3x

Quantization 4-bit

ALU Width 16-bit

Area 40.8 mm^2

MxV Throughput 81,967 layers/s

Power 586 mW

1. Post layout result2. Throughput measured on AlexNet FC-7

EIE: Efficient Inference Engine on Compressed Deep Neural Network

Song Han⇤ Xingyu Liu⇤ Huizi Mao⇤ Jing Pu⇤ Ardavan Pedram⇤

Mark A. Horowitz⇤ William J. Dally⇤†

⇤Stanford University, †NVIDIA{songhan,xyl,huizi,jingpu,perdavan,horowitz,dally}@stanford.edu

Abstract—State-of-the-art deep neural networks (DNNs)have hundreds of millions of connections and are both compu-tationally and memory intensive, making them difficult to de-ploy on embedded systems with limited hardware resources andpower budgets. While custom hardware helps the computation,fetching weights from DRAM is two orders of magnitude moreexpensive than ALU operations, and dominates the requiredpower.

Previously proposed ‘Deep Compression’ makes it possibleto fit large DNNs (AlexNet and VGGNet) fully in on-chipSRAM. This compression is achieved by pruning the redundantconnections and having multiple connections share the sameweight. We propose an energy efficient inference engine (EIE)that performs inference on this compressed network model andaccelerates the resulting sparse matrix-vector multiplicationwith weight sharing. Going from DRAM to SRAM gives EIE120⇥ energy saving; Exploiting sparsity saves 10⇥; Weightsharing gives 8⇥; Skipping zero activations from ReLU savesanother 3⇥. Evaluated on nine DNN benchmarks, EIE is189⇥ and 13⇥ faster when compared to CPU and GPUimplementations of the same DNN without compression. EIEhas a processing power of 102 GOPS/s working directly ona compressed network, corresponding to 3 TOPS/s on anuncompressed network, and processes FC layers of AlexNet at1.88⇥104 frames/sec with a power dissipation of only 600mW.It is 24,000⇥ and 3,400⇥ more energy efficient than a CPUand GPU respectively. Compared with DaDianNao, EIE has2.9⇥, 19⇥ and 3⇥ better throughput, energy efficiency andarea efficiency.

Keywords-Deep Learning; Model Compression; HardwareAcceleration; Algorithm-Hardware co-Design; ASIC;

I. INTRODUCTION

Neural networks have become ubiquitous in applicationsincluding computer vision [1]–[3], speech recognition [4],and natural language processing [4]. In 1998, Lecun etal. classified handwritten digits with less than 1M parame-ters [5], while in 2012, Krizhevsky et al. won the ImageNetcompetition with 60M parameters [1]. Deepface classifiedhuman faces with 120M parameters [6]. Neural Talk [7]automatically converts image to natural language with 130MCNN parameters and 100M RNN parameters. Coates etal. scaled up a network to 10 billion parameters on HPCsystems [8].

Large DNN models are very powerful but consume largeamounts of energy because the model must be stored inexternal DRAM, and fetched every time for each image,

4-bit RelativeIndex

4-bit Virtualweight

16-bitRealweight

16-bit AbsoluteIndex

EncodedWeightRelativeIndexSparseFormat

ALU

Mem

CompressedDNNModel Weight

Look-up

IndexAccum

Prediction

InputImage

Result

Figure 1. Efficient inference engine that works on the compressed deepneural network model for machine learning applications.

word, or speech sample. For embedded mobile applications,these resource demands become prohibitive. Table I showsthe energy cost of basic arithmetic and memory operationsin a 45nm CMOS process [9]. It shows that the total energyis dominated by the required memory access if there isno data reuse. The energy cost per fetch ranges from 5pJfor 32b coefficients in on-chip SRAM to 640pJ for 32bcoefficients in off-chip LPDDR2 DRAM. Large networks donot fit in on-chip storage and hence require the more costlyDRAM accesses. Running a 1G connection neural network,for example, at 20Hz would require (20Hz)(1G)(640pJ) =12.8W just for DRAM accesses, which is well beyond thepower envelope of a typical mobile device.

Previous work has used specialized hardware to accelerateDNNs [10]–[12]. However, these efforts focus on acceler-ating dense, uncompressed models - limiting their utilityto small models or to cases where the high energy costof external DRAM access can be tolerated. Without modelcompression, it is only possible to fit very small neuralnetworks, such as Lenet-5, in on-chip SRAM [12].

Efficient implementation of convolutional layers in CNNhas been intensively studied, as its data reuse and manipu-lation is quite suitable for customized hardware [10]–[15].However, it has been found that fully-connected (FC) layers,widely used in RNN and LSTMs, are bandwidth limitedon large networks [14]. Unlike CONV layers, there is noparameter reuse in FC layers. Data batching has becomean efficient solution when training networks on CPUs orGPUs, however, it is unsuitable for real-time applicationswith latency requirements.

Network compression via pruning and weight sharing[16] makes it possible to fit modern networks such asAlexNet (60M parameters, 240MB), and VGG-16 (130Mparameters, 520MB) in on-chip SRAM. Processing these

arX

iv:1

602.

0152

8v2

[cs.C

V]

3 M

ay 2

016

EIE: First Accelerator for Sparse DNN

Dally. NIPS tutorial 2015; Han et al. ISCA 2016

FC Layers: Speedup / Energy Efficiency

1x 1x 1x 1x 1x 1x 1x 1x 1x 1x2x

5x1x

9x 10x

1x2x 3x 2x 3x

14x25x

14x 24x 22x10x 9x 15x 9x 15x

56x 94x

21x

210x 135x

16x34x 33x 25x

48x

0.6x1.1x

0.5x1.0x 1.0x

0.3x 0.5x 0.5x 0.5x 0.6x

3x5x

1x

8x 9x

1x3x 2x 1x

3x

248x507x

115x

1018x 618x

92x 63x 98x 60x189x

0.1x

1x

10x

100x

1000x

Alex-6 Alex-7 Alex-8 VGG-6 VGG-7 VGG-8 NT-We NT-Wd NT-LSTM Geo Mean

Speedup

CPU Dense (Baseline) CPU Compressed GPU Dense GPU Compressed mGPU Dense mGPU Compressed EIE

Figure 6. Speedups of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases.

1x 1x 1x 1x 1x 1x 1x 1x 1x 1x5x

9x3x

17x 20x

2x6x 6x 4x 6x7x 12x 7x 10x 10x

5x 6x 6x 5x 7x26x 37x

10x

78x 61x

8x25x 14x 15x 23x

10x 15x7x 13x 14x

5x 8x 7x 7x 9x

37x 59x18x

101x 102x

14x39x 25x 20x 36x

34,522x 61,533x14,826x

119,797x 76,784x

11,828x 9,485x 10,904x 8,053x24,207x

1x

10x

100x

1000x

10000x

100000x

Alex-6 Alex-7 Alex-8 VGG-6 VGG-7 VGG-8 NT-We NT-Wd NT-LSTM Geo MeanEner

gy E

ffici

ency

CPU Dense (Baseline) CPU Compressed GPU Dense GPU Compressed mGPU Dense mGPU Compressed EIE

Figure 7. Energy efficiency of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases.

corner. We placed and routed the PE using the Synopsys ICcompiler (ICC). We used Cacti [25] to get SRAM area andenergy numbers. We annotated the toggle rate from the RTLsimulation to the gate-level netlist, which was dumped toswitching activity interchange format (SAIF), and estimatedthe power using Prime-Time PX.

Comparison Baseline. We compare EIE with three dif-ferent off-the-shelf computing units: CPU, GPU and mobileGPU.

1) CPU. We use Intel Core i-7 5930k CPU, a Haswell-Eclass processor, that has been used in NVIDIA Digits DeepLearning Dev Box as a CPU baseline. To run the benchmarkon CPU, we used MKL CBLAS GEMV to implement theoriginal dense model and MKL SPBLAS CSRMV for thecompressed sparse model. CPU socket and DRAM powerare as reported by the pcm-power utility provided by Intel.

2) GPU. We use NVIDIA GeForce GTX Titan X GPU,a state-of-the-art GPU for deep learning as our baselineusing nvidia-smi utility to report the power. To runthe benchmark, we used cuBLAS GEMV to implementthe original dense layer. For the compressed sparse layer,we stored the sparse matrix in in CSR format, and usedcuSPARSE CSRMV kernel, which is optimized for sparsematrix-vector multiplication on GPUs.

3) Mobile GPU. We use NVIDIA Tegra K1 that has192 CUDA cores as our mobile GPU baseline. We usedcuBLAS GEMV for the original dense model and cuS-PARSE CSRMV for the compressed sparse model. Tegra K1doesn’t have software interface to report power consumption,so we measured the total power consumption with a power-meter, then assumed 15% AC to DC conversion loss, 85%regulator efficiency and 15% power consumed by peripheralcomponents [26], [27] to report the AP+DRAM power forTegra K1.

Benchmarks.We compare the performance on two sets of models:

uncompressed DNN model and the compressed DNN model.

Table IIIBENCHMARK FROM STATE-OF-THE-ART DNN MODELS

Layer Size Weight% Act% FLOP% Description

Alex-6 9216, 9% 35.1% 3% Compressed4096AlexNet [1] forAlex-7 4096, 9% 35.3% 3% large scale image4096classificationAlex-8 4096, 25% 37.5% 10%1000

VGG-6 25088, 4% 18.3% 1% Compressed4096 VGG-16 [3] forVGG-7 4096, 4% 37.5% 2% large scale image4096 classification andVGG-8 4096, 23% 41.1% 9% object detection1000

NT-We 4096, 10% 100% 10% Compressed600 NeuralTalk [7]

NT-Wd 600, 11% 100% 11% with RNN and8791 LSTM for

NTLSTM 1201, 10% 100% 11% automatic2400 image captioning

The uncompressed DNN model is obtained from Caffemodel zoo [28] and NeuralTalk model zoo [7]; The com-pressed DNN model is produced as described in [16], [23].The benchmark networks have 9 layers in total obtainedfrom AlexNet, VGGNet, and NeuralTalk. We use the Image-Net dataset [29] and the Caffe [28] deep learning frameworkas golden model to verify the correctness of the hardwaredesign.

VI. EXPERIMENTAL RESULT

Figure 5 shows the layout (after place-and-route) ofan EIE processing element. The power/area breakdown isshown in Table II. We brought the critical path delay downto 1.15ns by introducing 4 pipeline stages to update oneactivation: codebook lookup and address accumulation (inparallel), output activation read and input activation multiply(in parallel), shift and add, and output activation write. Ac-tivation read and write access a local register and activationbypassing is employed to avoid a pipeline hazard. Using64 PEs running at 800MHz yields a performance of 102

Compared to CPU and GPU: 189x and 13x faster 24,000x and 3,400x more energy efficient

Beyond EIE: a Multi-Dimension Sparse Recipe for Deep Learning

6

[1]. Han et al. “Learning both Weights and Connections for Efficient Neural Networks”, NIPS 2015 [2]. Han et al. “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding”, Deep Learning Symposium 2015, ICLR 2016 (best paper award) [3]. Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network”, ISCA 2016 [4]. Han et al. “DSD: Regularizing Deep Neural Networks with Dense-Sparse-Dense Training Flow”, arXiv 2016 [5]. Iandola, Han,et al. “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size”, arXiv 16 [6]. Yao, Han, et.al, “Hardware-friendly convolutional neural network with even-number filter size”, ICLR workshop 2016

Higher Accuracy: DSD regularization

Smaller Size: Deep Compression, SqueezeNet++

Faster Speed: EIE accelerator

sparsity

Documents

Deep Compression and EIE: Efficient Inference …hotchips.org/wp-content/uploads/hc_archives/hc28/HC28.24...Figure 1. Efﬁcient inference engine that works on the compressed deep