Upload
hakhuong
View
212
Download
0
Embed Size (px)
Citation preview
Deep Compression and EIE: ——Deep Neural Network Model Compression
and Efficient Inference Engine Song Han
CVA group, Stanford University Apr 7, 2016
1
Intro about me and my advisor• Fourth year PhD with Prof. Bill Dally at Stanford.• Research interest: deep learning model compression
and hardware acceleration, to make inference more efficient for deployment.
• Recent work on “Deep Compression” and “EIE: Efficient Inference Engine” covered by TheNextPlatform & O’Reilly & TechEmergence & HackerNewsSong Han
Bill Dally
• Professor at Stanford University and former chairman of CS department, leads the Concurrent VLSI Architecture Group.
• Chief Scientist of NVIDIA.• Member of the National Academy of Engineering, Fellow of
the American Academy of Arts & Sciences, Fellow of the IEEE, Fellow of the ACM and numerous other rewards…
2
Thanks to my collaborators
3
Bill Dally
• NVIDIA: Jeff Pool, John Tran, Bill Dally
• Stanford: Xingyu Liu, Jing Pu, Ardavan Pedram, Mark Horowitz, Bill Dally
• Tsinghua: Huizi Mao, Song Yao, Yu Wang
• Berkeley: Forrest Iandola, Matthew Moskewicz, Khalid Ashraf, Kurt Keutzer
You’ll be interested in his GTC talk: S6417 - FireCaffe
This Talk:
• Deep Compression[1,2]: A Deep Neural Network Model Compression Pipeline.
• EIE Accelerator[3]: Efficient Inference Engine that Accelerates the Compressed Deep Neural Network Model.
• SqueezeNet++[4,5]: ConvNet Architecture Design Space Exploration
[1]. Han et al. “Learning both Weights and Connections for Efficient Neural Networks”, NIPS 2015 [2]. Han et al. “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding”, ICLR 2016 [3]. Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network”, ISCA 2016 [4]. Iandola, Han,et al. “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model size” arXiv 2016 [5]. Yao, Han, et.al, “Hardware-friendly convolutional neural network with even-number filter size” ICLR’16 workshop
4
Deep Learning Next Wave of AI
Image Recognition
Speech Recognition
Natural Language Processing
5
Applications
6
App developers suffers from the model size
“At Baidu, our #1 motivation for compressing networks is to bring down the size of the binary file. As a mobile-first company, we frequently update various apps via different app stores. We've very sensitive to the size of our binary files, and a feature that increases the binary size by 100MB will receive much more scrutiny than one that increases it by 10MB.” —Andrew Ng
The Problem:If Running DNN on Mobile…
7
Hardware engineer suffers from the model size(embedded system, limited resource)
The Problem:If Running DNN on Mobile…
8
Intelligent but Inefficient
NetworkDelay
PowerBudget
User Privacy
9
The Problem:If Running DNN on the Cloud…
Deep Compression
Smaller SizeCompress Mobile App
Size by 35x-50x
Accuracyno loss of accuracyimproved accuracy
Speedupmake inference faster
10
Problem 1: Model Size Solution 1: Deep Compression
EIE Accelerator
OfflineNo dependency on network connection
Real TimeNo network delayhigh frame rate
Low PowerHigh energy efficiencythat preserves battery
11
Problem 2: Latency, Power, Energy Solution 2: ASIC accelerator
Part1: Deep Compression
• AlexNet: 35×, 240MB => 6.9MB => 0.47MB (510x)
• VGG16: 49×, 552MB => 11.3MB
• With no loss of accuracy on ImageNet12
• Weights fits on-chip SRAM, taking 120x less energy than DRAM
1. Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015 2. Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016 3. Iandola, Han, et al. “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size” ECCV submission
12Deep Compression SqueezeNet++ EIE
1. Pruning
Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015
13Deep Compression SqueezeNet++ EIE
Pruning: Motivation
• Trillion of synapses are generated in the human brain during the first few months of birth.
• 1 year old, peaked at 1000 trillion
• Pruning begins to occur.
• 10 years old, a child has nearly 500 trillion synapses
• This ’pruning’ mechanism removes redundant connections in the brain.
[1] Christopher A Walsh. Peter huttenlocher (1931-2013). Nature, 502(7470):172–172, 2013.
14Deep Compression SqueezeNet++ EIE
Retrain to Recover Accuracy
-4.5%-4.0%-3.5%-3.0%-2.5%-2.0%-1.5%-1.0%-0.5%0.0%0.5%
40% 50% 60% 70% 80% 90% 100%
Accu
racy
Los
s
Parametes Pruned Away
L2 regularization w/o retrain L1 regularization w/o retrain L1 regularization w/ retrain L2 regularization w/ retrain L2 regularization w/ iterative prune and retrain
Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015
15Deep Compression SqueezeNet++ EIE
Pruning: Result on 4 Covnets
Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015
16Deep Compression SqueezeNet++ EIE
AlexNet & VGGNet
Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015
17Deep Compression SqueezeNet++ EIE
Mask Visualization
Visualization of the first FC layer’s sparsity pattern of Lenet-300-100. It has a banded structure repeated 28 times, which correspond to the un-pruned parameters in the center of the images, since the digits are written in the center.
18Deep Compression SqueezeNet++ EIE
Pruning NeuralTalk and LSTM
19
Lecture 10 - 8 Feb 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 10 - 8 Feb 201651
Explain Images with Multimodal Recurrent Neural Networks, Mao et al.Deep Visual-Semantic Alignments for Generating Image Descriptions, Karpathy and Fei-FeiShow and Tell: A Neural Image Caption Generator, Vinyals et al.Long-term Recurrent Convolutional Networks for Visual Recognition and Description, Donahue et al.Learning a Recurrent Visual Representation for Image Caption Generation, Chen and Zitnick
Image Captioning
Karpathy, Feifei, et al, "Deep Visual-Semantic Alignments for Generating Image Descriptions"
• Pruning away 90% parameters in NeuralTalk doesn’t hurt BLUE score with proper retraining
Deep Compression SqueezeNet++ EIE
• Original: a basketball player in a white uniform is playing with a ball
• Pruned 90%: a basketball player in a white uniform is playing with a basketball
• Original : a brown dog is running through a grassy field • Pruned 90%: a brown dog is running through a grassy area
• Original : a soccer player in red is running in the field • Pruned 95%: a man in a red shirt and black and white black
shirt is running through a field
20
Pruning NeuralTalk and LSTM
Deep Compression SqueezeNet++ EIE
Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015
Pruning Neural Machine Translation
Abi See, “CS224N Final Project: Exploiting the Redundancy in Neural Machine Translation”
21Deep Compression SqueezeNet++ EIE
Pruning Neural Machine Translation
Dark means zero and redundant, White means non-zero and useful
Word Embedding:
LSTM:
Abi See, “CS224N Final Project: Exploiting the Redundancy in Neural Machine Translation”
22Deep Compression SqueezeNet++ EIE
Speedup (FC layer)
• Intel Core i7 5930K: MKL CBLAS GEMV, MKL SPBLAS CSRMV
• NVIDIA GeForce GTX Titan X: cuBLAS GEMV, cuSPARSE CSRMV
• NVIDIA Tegra K1: cuBLAS GEMV, cuSPARSE CSRMV
23
Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016
Deep Compression SqueezeNet++ EIE
Energy Efficiency (FC layer)
• Intel Core i7 5930K: CPU socket and DRAM power are reported by pcm-power utility
• NVIDIA GeForce GTX Titan X: reported by nvidia-smi utility
• NVIDIA Tegra K1: measured the total power consumption with a power-meter, 15% AC to DC conversion loss, 85% regulator efficiency and 15% power consumed by peripheral components => 60% AP+DRAM power
24
Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016
Deep Compression SqueezeNet++ EIE
2. Weight Sharing (Trained Quantization)
Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016
25Deep Compression SqueezeNet++ EIE
Weight Sharing: Overview
Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016
26Deep Compression SqueezeNet++ EIE
Finetune Centroids
Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016
27Deep Compression SqueezeNet++ EIE
Accuracy ~ #Bits on 5 Conv Layers + 3 FC Layers
28Deep Compression SqueezeNet++ EIE
Weight Sharing: Result
• 16 Million => 2^4=16
• 8/5 bit quantization results in no accuracy loss
• 8/4 bit quantization results in no top-5 accuracy loss, 0.1% top-1 accuracy loss
• 4/2 bit quantization results in -1.99% top-1 accuracy loss, and -2.60% top-5 accuracy loss, not that bad-:
Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016
29Deep Compression SqueezeNet++ EIE
Pruning and Quantization Works Well Together
Under review as a conference paper at ICLR 2016
Figure 6: Accuracy v.s. compression rate under different compression methods. Pruning andquantization works best when combined.
Figure 7: Pruning doesn’t hurt quantization. Dashed: quantization on unpruned network. Solid:quantization on pruned network; Accuracy begins to drop at the same number of quantization bitswhether or not the network has been pruned. Although pruning made the number of parameters less,quantization still works well, or even better(3 bits case on the left figure) as in the unpruned network.
Figure 8: Accuracy of different initialization methods. Left: top-1 accuracy. Right: top-5 accuracy.Linear initialization gives best result.
6.2 CENTROID INITIALIZATION
Figure 8 compares the accuracy of the three different initialization methods with respect to top-1accuracy (Left) and top-5 accuracy (Right). The network is quantized to 2 ⇠ 8 bits as shown onx-axis. Linear initialization outperforms the density initialization and random initialization in allcases except at 3 bits.
The initial centroids of linear initialization spread equally across the x-axis, from the min value to themax value. That helps to maintain the large weights as the large weights play a more important rolethan smaller ones, which is also shown in network pruning Han et al. (2015). Neither random nordensity-based initialization retains large centroids. With these initialization methods, large weights areclustered to the small centroids because there are few large weights. In contrast, linear initializationallows large weights a better chance to form a large centroid.
6.3 SPEEDUP AND ENERGY EFFICIENCY
Deep Compression is targeting extremely latency-focused applications running on mobile, whichrequires real-time inference, such as pedestrian detection on an embedded processor inside anautonomous vehicle. Waiting for a batch to assemble significantly adds latency. So when bench-
8
Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016
30Deep Compression SqueezeNet++ EIE
3. Huffman Coding
Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016
31Deep Compression SqueezeNet++ EIE
Huffman CodingHuffman code is a type of optimal prefix code that is commonly used for loss-less data compression. It produces a variable-length code table for encoding source symbol. The table is derived from the occurrence probability for each symbol. As in other entropy encoding methods, more common symbols are represented with fewer bits than less common symbols, thus save the total space.
32Deep Compression SqueezeNet++ EIE
Deep Compression Result on 4 Convnets
Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016
33Deep Compression SqueezeNet++ EIE
Result: AlexNet
Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016
34Deep Compression SqueezeNet++ EIE
AlexNet: Breakdown
Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016
35Deep Compression SqueezeNet++ EIE
New Network Topology
+ Deep Compression
The Big Gun:
Iandola, Han,et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size
36Deep Compression SqueezeNet++ EIE
Iandola, Han,et al. “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size” arXiv 2016
New Network Topology + Deep Compression
0E+00
1E+05
3E+05
4E+05
5E+05
conv
1fir
e2/c
onv1
x1_1
fire2
/con
v1x1
_2fir
e2/c
onv3
x3_2
fire3
/con
v1x1
_1fir
e3/c
onv1
x1_2
fire3
/con
v3x3
_2fir
e4/c
onv1
x1_1
fire4
/con
v1x1
_2fir
e4/c
onv3
x3_2
fire5
/con
v1x1
_1fir
e5/c
onv1
x1_2
fire5
/con
v3x3
_2fir
e6/c
onv1
x1_1
fire6
/con
v1x1
_2fir
e6/c
onv3
x3_2
fire7
/con
v1x1
_1fir
e7/c
onv1
x1_2
fire7
/con
v3x3
_2fir
e8/c
onv1
x1_1
fire8
/con
v1x1
_2fir
e8/c
onv3
x3_2
fire9
/con
v1x1
_1fir
e9/c
onv1
x1_2
fire9
/con
v3x3
_2co
nv_fi
nal
Remaining parameters Parameters pruned away
Fig 2. Deep compression is compatible with even extreme efficient network architecture such as SqueezeNet: It can be pruned 3x, quantized to 6bit w/o loss of accuracy.
Fig 1: SqueezeNet architecture
37
Input
1x1 ConvSqueeze
1x1 ConvExpand
3x3 ConvExpand
Output Concat/Eltwise
64
16
64 64
128
Vanilla Fire module
Deep Compression SqueezeNet++ EIE
470KB model, AlexNet-accuracy
Efficient Model Pruning Weight Sharing
38
Input
1x1 ConvSqueeze
1x1 ConvExpand
3x3 ConvExpand
Output Concat/Eltwise
64
16
64 64
128
Vanilla Fire module
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
ECCV#1731
ECCV#1731
8 ECCV-16 submission ID 1731
Table 2. Comparing SqueezeNet to model compression approaches. By model size, wemean the number of bytes required to store all of the parameters in the trained model.
CNNarchitecture
CompressionApproach
DataType
Original !Compressed Model
Size
Reduction inModel Size vs.
AlexNet
Top-1ImageNetAccuracy
Top-5ImageNetAccuracy
AlexNet None(baseline)
32 bit 240MB 1x 57.2% 80.3%
AlexNet SVD [3] 32 bit 240MB ! 48MB 5x 56.0% 79.4%AlexNet Network
Pruning [4]32 bit 240MB ! 27MB 9x 57.2% 80.3%
AlexNet Deep Com-pression [5]
5-8 bit 240MB ! 6.9MB 35x 57.2% 80.3%
SqueezeNet(ours)
None 32 bit 4.8MB 50x 57.5% 80.3%
SqueezeNet(ours)
DeepCompression
8 bit 4.8MB ! 0.66MB 363x 57.5% 80.3%
SqueezeNet(ours)
DeepCompression
6 bit 4.8MB ! 0.47MB 510x 57.5% 80.3%
sion [5] to SqueezeNet, using 33% sparsity6 and 8-bit quantization. This yields a0.66 MB model (363⇥ smaller than 32-bit AlexNet) with equivalent accuracy toAlexNet. Further, applying Deep Compression with 6-bit quantization and 33%sparsity on SqueezeNet, we produce a 0.47MB model (510⇥ smaller than 32-bitAlexNet) with equivalent accuracy. It appears that our small model is indeedamenable to compression.
In addition, these results demonstrate that Deep Compression [5] not onlyworks well on CNN architectures with many parameters (e.g. AlexNet andVGG), but it is also able to compress the already compact, fully convolu-tional SqueezeNet architecture. Deep Compression compressed SqueezeNet by10⇥ while preserving the baseline accuracy. In summary: by combining CNNarchitectural innovation (SqueezeNet) with state-of-the-art compression tech-niques (Deep Compression), we achieved a 510⇥ reduction in model size withno decrease in accuracy compared to the baseline.
5 CNN Microarchitecture Design Space Exploration
So far, we have proposed architectural design strategies for small models, fol-lowed these principles to create SqueezeNet, and discovered that SqueezeNet is50x smaller than AlexNet with equivalent accuracy. However, SqueezeNet andother models reside in a broad and largely unexplored design space of CNN ar-chitectures. Now, in Sections 5, 6, and 7, we explore several aspects of the designspace. We divide this exploration into three main topics: microarchitectural ex-
ploration (per-module layer dimensions and configurations), macroarchitectural
exploration (high-level end-to-end organization of modules and other layers), andexploration of model compression configurations.
In this section, we design and execute experiments with the goal of providingintuition about the shape of the microarchitectural design space with respect to
6 Note that, due to the storage overhead of storing sparse matrix indices, 33% sparsityleads to somewhat less than a 3⇥ decrease in model size.
Iandola, Han,et al. “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size” arXiv 2016
Deep Compression SqueezeNet++ EIE
39
Iandola, Han,et al. “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size” arXiv 2016
Deep Compression SqueezeNet++ EIE
470KB model, AlexNet-accuracy
https://github.com/songhan/SqueezeNet_compressed
Smaller DNN means…
(1) Less communication across servers during distributed training. (2) Easier to download from App Store.(3) Less bandwidth to update model to an autonomous car.(4) Easier to deploy on embedded hardware with limited memory.
40Deep Compression SqueezeNet++ EIE
A Model Compression Tool for App Developers
• Easy Version (done): ✓ No training needed ✓ Fast (3 minutes) x 5x - 10x compression rate x 1% loss of accuracy
• Advanced Version (todo): ✓ 35x - 50x compression rate ✓ no loss of accuracy x Training is needed x Slow
deepcompression.net is under construction
41Deep Compression SqueezeNet++ EIE
DeepCompression.net
43
•Username: deepcompression•Password: songhan
Provides a trial account for GTC attendees:
welcome your feedback!
Deep Compression SqueezeNet++ EIE
Conclusion
• We have presented a method to compress neural networks without affecting accuracy by finding the right connections and quantizing the weights.
• Pruning the unimportant connections => quantizing the network and enforce weight sharing => apply Huffman encoding.
• We highlight our experiments on ImageNet, and reduced the weight storage by 35×, VGG16 by 49×, without loss of accuracy.
• Now weights can fit in cache
44Deep Compression SqueezeNet++ EIE
Part2: SqueezeNet++ ——CNN Design Space Exploration
45
Iandola, Han,et al. “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size”, submitted to ECCV
Song Han CVA group, Stanford University
Apr 7, 2016
Deep Compression SqueezeNet++ EIE
Motivation: How to choose so many architectural dimensions?
46
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
ECCV#1731
ECCV#1731
ECCV-16 submission ID 1731 7
Table 1. SqueezeNet architectural dimensions. (The formatting of this table was in-spired by the Inception2 paper [11].)
5
layername/type outputsize
filtersize/stride
(ifnotafirelayer)
depths1x1(#1x1
squeeze)
e1x1(#1x1
expand)
e3x3(#3x3
expand)
s1x1sparsity
e1x1sparsity
e3x3sparsity
#bits#parameter
before pruning
#parameter after
pruning
inputimage 224x224x3 - -
conv1 111x111x96 7x7/2(x96) 1 6bit 14,208 14,208
maxpool1 55x55x96 3x3/2 0
fire2 55x55x128 2 16 64 64 100% 100% 33% 6bit 11,920 5,746
fire3 55x55x128 2 16 64 64 100% 100% 33% 6bit 12,432 6,258
fire4 55x55x256 2 32 128 128 100% 100% 33% 6bit 45,344 20,646
maxpool4 27x27x256 3x3/2 0
fire5 27x27x256 2 32 128 128 100% 100% 33% 6bit 49,440 24,742
fire6 27x27x384 2 48 192 192 100% 50% 33% 6bit 104,880 44,700
fire7 27x27x384 2 48 192 192 50% 100% 33% 6bit 111,024 46,236
fire8 27x27x512 2 64 256 256 100% 50% 33% 6bit 188,992 77,581
maxpool8 13x12x512 3x3/2 0
fire9 13x13x512 2 64 256 256 50% 100% 30% 6bit 197,184 77,581
conv10 13x13x1000 1x1/1(x1000) 1 6bit 513,000 103,400
avgpool10 1x1x1000 13x13/1 0
1,248,424(total)
421,098(total)
20%(3x3)
100%(7x7)
4 Evaluation of SqueezeNet
We now turn our attention to evaluating SqueezeNet. In each of the CNNmodel compression papers reviewed in Section 2.1, the goal was to compress anAlexNet [25] model that was trained to classify images using the ImageNet [14](ILSVRC 2012) dataset. Therefore, we use AlexNet5 and the associated modelcompression results as a basis for comparison when evaluating SqueezeNet.
In Table 2, we review SqueezeNet in the context of recent model compres-sion results. The SVD-based approach is able to compress a pretrained AlexNetmodel by a factor of 5X, while diminishing top-1 accuracy to 56.0% [3]. NetworkPruning achieves a 9X reduction in model size while maintaining the baselineof 57.2% top-1 and 80.3% top-5 accuracy on ImageNet [4]. Deep Compressionachieves a 35x reduction in model size while still maintaining the baseline accu-racy level [5]. Now, with SqueezeNet, we achieve a 50X reduction in model sizecompared to AlexNet, while meeting or exceeding the top-1 and top-5 accuracyof AlexNet. We summarize all of the aforementioned results in Table 2.
It appears that we have surpassed the state-of-the-art results from the modelcompression community: even when using uncompressed 32-bit values to repre-sent the model, SqueezeNet has a 1.4⇥ smaller model size than the best e↵ortsfrom the model compression community while maintaining or exceeding the base-line accuracy. Until now, an open question has been: are small models amenable
to compression, or do small models “need” all of the representational power
a↵orded by dense floating-point values? To find out, we applied Deep Compres-
5 Our baseline is bvlc alexnet from the Ca↵e codebase [26].
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
ECCV#1731
ECCV#1731
ECCV-16 submission ID 1731 7
Table 1. SqueezeNet architectural dimensions. (The formatting of this table was in-spired by the Inception2 paper [11].)
5
layername/type outputsize
filtersize/stride
(ifnotafirelayer)
depths1x1(#1x1
squeeze)
e1x1(#1x1
expand)
e3x3(#3x3
expand)
s1x1sparsity
e1x1sparsity
e3x3sparsity
#bits#parameter
before pruning
#parameter after
pruning
inputimage 224x224x3 - -
conv1 111x111x96 7x7/2(x96) 1 6bit 14,208 14,208
maxpool1 55x55x96 3x3/2 0
fire2 55x55x128 2 16 64 64 100% 100% 33% 6bit 11,920 5,746
fire3 55x55x128 2 16 64 64 100% 100% 33% 6bit 12,432 6,258
fire4 55x55x256 2 32 128 128 100% 100% 33% 6bit 45,344 20,646
maxpool4 27x27x256 3x3/2 0
fire5 27x27x256 2 32 128 128 100% 100% 33% 6bit 49,440 24,742
fire6 27x27x384 2 48 192 192 100% 50% 33% 6bit 104,880 44,700
fire7 27x27x384 2 48 192 192 50% 100% 33% 6bit 111,024 46,236
fire8 27x27x512 2 64 256 256 100% 50% 33% 6bit 188,992 77,581
maxpool8 13x12x512 3x3/2 0
fire9 13x13x512 2 64 256 256 50% 100% 30% 6bit 197,184 77,581
conv10 13x13x1000 1x1/1(x1000) 1 6bit 513,000 103,400
avgpool10 1x1x1000 13x13/1 0
1,248,424(total)
421,098(total)
20%(3x3)
100%(7x7)
4 Evaluation of SqueezeNet
We now turn our attention to evaluating SqueezeNet. In each of the CNNmodel compression papers reviewed in Section 2.1, the goal was to compress anAlexNet [25] model that was trained to classify images using the ImageNet [14](ILSVRC 2012) dataset. Therefore, we use AlexNet5 and the associated modelcompression results as a basis for comparison when evaluating SqueezeNet.
In Table 2, we review SqueezeNet in the context of recent model compres-sion results. The SVD-based approach is able to compress a pretrained AlexNetmodel by a factor of 5X, while diminishing top-1 accuracy to 56.0% [3]. NetworkPruning achieves a 9X reduction in model size while maintaining the baselineof 57.2% top-1 and 80.3% top-5 accuracy on ImageNet [4]. Deep Compressionachieves a 35x reduction in model size while still maintaining the baseline accu-racy level [5]. Now, with SqueezeNet, we achieve a 50X reduction in model sizecompared to AlexNet, while meeting or exceeding the top-1 and top-5 accuracyof AlexNet. We summarize all of the aforementioned results in Table 2.
It appears that we have surpassed the state-of-the-art results from the modelcompression community: even when using uncompressed 32-bit values to repre-sent the model, SqueezeNet has a 1.4⇥ smaller model size than the best e↵ortsfrom the model compression community while maintaining or exceeding the base-line accuracy. Until now, an open question has been: are small models amenable
to compression, or do small models “need” all of the representational power
a↵orded by dense floating-point values? To find out, we applied Deep Compres-
5 Our baseline is bvlc alexnet from the Ca↵e codebase [26].
Deep Compression SqueezeNet++ EIE
Micro-Architecture DSX
• Micro-Architecture: how to size the layers: 64? 128? 256? • Sensitivity: Pruning 50% weights for a single layer and measure the accuracy. • => Sensitivity analysis helps sizing the number of parameters in a layer.
Use sensitive analysis
47
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
ECCV#1731
ECCV#1731
14 ECCV-16 submission ID 1731
Table 4. Improving accuracy by using sensitivity analysis to determine where to addparameters in a CNN.
Architecture Top-1 Accuracy Top-5 Accuracy Model SizeSqueezeNet 57.5% 80.3% 4.8MB
SqueezeNet++ 59.5% 81.5% 7.1MB
Table 5. Improving accuracy with dense!sparse!dense (DSD) training.
Architecture Top-1 Accuracy Top-5 Accuracy Model SizeSqueezeNet 57.5% 80.3% 4.8MB
SqueezeNet (DSD) 61.8% 83.5% 4.8MB
constraint gives the network more freedom to escape the saddle point and arriveat a higher-accuracy local minimum. So far, we trained in just three stagesof density (dense!sparse!dense), but regularizing models by intermittentlypruning parameters10 throughout training would be an interesting area of futurework.
8 Conclusions
We have presented SqueezeNet, a CNN architecture that has 50⇥ fewer param-eters than AlexNet and maintains AlexNet-level accuracy on ImageNet. We alsocompressed SqueezeNet to less than 0.5MB, or 510⇥ smaller than than AlexNetwithout compression. In this paper, we focused on ImageNet as a target dataset.However, it has become common practice to apply ImageNet-trained CNN rep-resentations to a variety of applications such as fine-grained object recognition,logo identification in images, and generating sentences about images. ImageNet-trained CNNs have also been applied to a number of applications pertaining toautonomous driving, including pedestrian and vehicle detection as well as seg-menting the shape of the road. We think SqueezeNet will be a good candidateCNN architecture for a variety of applications, especially those in which smallmodel size is of importance.
SqueezeNet is one of several new CNNs that we have discovered while broadlyexploring the design space of CNN architectures. We hope that SqueezeNet willinspire the reader to consider and explore the broad range of possibilities in thedesign space of CNN architectures.
10 This is somewhat like Dropout, which prunes activations instead of parameters.
Iandola, Han,et al. “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size”, submitted to ECCV
Deep Compression SqueezeNet++ EIE
48
"labrador retriever
dog"
conv1
96
fire2
128
fire3
128
fire4
256
fire5256
fire6
384
fire7
384
fire8
512
fire9512
conv101000
softmax
maxpool/2
maxpool/2
maxpool/2
globalavgpool
conv1
96
fire2
128
fire3
128
fire4
256
fire5256
fire6
384
fire7
384
fire8
512
fire9512
conv101000
softmax
maxpool/2
maxpool/2
maxpool/2
globalavgpool
conv196
fire2
128
fire3
128
fire4
256
fire5256
fire6
384
fire7
384
fire8
512
fire9512
conv101000
softmax
maxpool/2
maxpool/2
maxpool/2
globalavgpool
conv1x1
conv1x1
conv1x1
conv1x1
96
Macro-Architecture DSX
Iandola, Han,et al. “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size”, submitted to ECCV
Deep Compression SqueezeNet++ EIE
vanilla Fire module simple bypass
Macro-Architecture DSXcomplex bypass with 1x1 Conv
49
Iandola, Han,et al. “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size”, submitted to ECCV
Input
1x1 ConvSqueeze
1x1 ConvExpand
3x3 ConvExpand
Output Concat/Eltwise
64
16
64 64
128
Vanilla Fire module
Concat
Input
1x1 ConvSqueeze
1x1 ConvExpand
3x3 ConvExpand
Output Concat/Eltwise
128
16
64 64
128
128
128
Fire module with Simple Bypass
Input
1x1 ConvSqueeze
1x1 ConvExpand
3x3 ConvExpand
Output Concat/Eltwise
1x1 ConvBypass
64
16
64 64
128
64
128
Fire module with Complex Bypass
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
ECCV#1731
ECCV#1731
ECCV-16 submission ID 1731 11
Table 3. SqueezeNet accuracy and model size using di↵erent macroarchitecture
Architecture Top-1 Accuracy Top-5 Accuracy Model SizeVanilla SqueezeNet 57.5% 80.3% 4.8MB
SqueezeNet + Simple Bypass 60.4% 82.5% 4.8MBSqueezeNet + Complex Bypass 58.8% 82.0% 7.7MB
at the macroarchitecture level concerning the high-level connections among Firemodules. Inspired by ResNet [16], we explored three di↵erent architectures:
• Vanilla SqueezeNet (as per the prior sections).• SqueezeNet with simple bypass connections between some Fire modules.• SqueezeNet with complex bypass connections between the remaining Firemodules.
We illustrate these three variants of SqueezeNet in Figure 2.Our simple bypass architecture adds bypass connections around Fire modules
3, 5, 7, and 9, requiring these modules to learn a residual function between inputand output. As in ResNet, to implement a bypass connection around Fire3, weset the input to Fire4 equal to (output of Fire2 + output of Fire3), where the+ operator is elementwise addition. This changes the regularization applied tothe parameters of these Fire modules, and, as per ResNet, can improve the finalaccuracy and/or ability to train the full model.
One limitation is that, in the straightforward case, the number of input chan-nels and number of output channels has to be the same; as a result, only half ofthe Fire modules can have simple bypass connections, as shown in the middlediagram of Fig 2. When the “same number of channels” requirement can’t bemet, we use a complex bypass connection, as illustrated on the right of Figure 2.While a simple bypass is “just a wire,” we define a complex bypass as a bypassthat includes a 1x1 convolution layer with the number of filters set equal to thenumber of output channels that are needed. Note that complex bypass connec-tions add extra parameters to the model, while simple bypass connections donot.
In addition to changing the regularization, it is intuitive to us that addingbypass connections would help to alleviate the representational bottleneck intro-duced by squeeze layers. In SqueezeNet, the squeeze ratio (SR) is 0.125, meaningthat every squeeze layer has 8x fewer output channels than the accompanyingexpand layer. Due to this severe dimensionality reduction, a limited amount ofinformation can pass through squeeze layers. However, by adding bypass con-nections to SqueezeNet, we open up avenues for information to flow around thesqueeze layers.
We trained SqueezeNet with above three macroarchitectures and comparedthe accuracy and model size in Table 3. We fixed the microarchitecture to matchSqueezeNet as described in Table 1 throughout the macroarchitecture explo-ration. Interestingly, adding complex bypass layers degraded accuracy. However,
Deep Compression SqueezeNet++ EIE
50
DSD Training (Dense-Sparse-Dense) improves accuracy
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
ECCV#1731
ECCV#1731
14 ECCV-16 submission ID 1731
Table 4. Improving accuracy by using sensitivity analysis to determine where to addparameters in a CNN.
Architecture Top-1 Accuracy Top-5 Accuracy Model SizeSqueezeNet 57.5% 80.3% 4.8MB
SqueezeNet++ 59.5% 81.5% 7.1MB
Table 5. Improving accuracy with dense!sparse!dense (DSD) training.
Architecture Top-1 Accuracy Top-5 Accuracy Model SizeSqueezeNet 57.5% 80.3% 4.8MB
SqueezeNet (DSD) 61.8% 83.5% 4.8MB
constraint gives the network more freedom to escape the saddle point and arriveat a higher-accuracy local minimum. So far, we trained in just three stagesof density (dense!sparse!dense), but regularizing models by intermittentlypruning parameters10 throughout training would be an interesting area of futurework.
8 Conclusions
We have presented SqueezeNet, a CNN architecture that has 50⇥ fewer param-eters than AlexNet and maintains AlexNet-level accuracy on ImageNet. We alsocompressed SqueezeNet to less than 0.5MB, or 510⇥ smaller than than AlexNetwithout compression. In this paper, we focused on ImageNet as a target dataset.However, it has become common practice to apply ImageNet-trained CNN rep-resentations to a variety of applications such as fine-grained object recognition,logo identification in images, and generating sentences about images. ImageNet-trained CNNs have also been applied to a number of applications pertaining toautonomous driving, including pedestrian and vehicle detection as well as seg-menting the shape of the road. We think SqueezeNet will be a good candidateCNN architecture for a variety of applications, especially those in which smallmodel size is of importance.
SqueezeNet is one of several new CNNs that we have discovered while broadlyexploring the design space of CNN architectures. We hope that SqueezeNet willinspire the reader to consider and explore the broad range of possibilities in thedesign space of CNN architectures.
10 This is somewhat like Dropout, which prunes activations instead of parameters.
• dense→sparse→dense (DSD) training yielded 4.3% higher accuracy.• Sparsity is a form of regularization. Once the network arrives at a local minimum given
the sparsity constraint, relaxing the constraint gives the network more freedom to escape the saddle point and arrive at a higher-accuracy local minimum.
• Regularizing models by intermittently pruning parameters throughout training would be an interesting area of future work.
SparseDense Dense
Constrain Relax
Iandola, Han,et al. “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size”, submitted to ECCV
Deep Compression SqueezeNet++ EIE
Design Space Exploration Conclusion
• SqueezeNet++: sizing the layers with sensitivity analysis • Use even kernel • SqueezeNet +simple + complex bypass layer • DSD training: improves accuracy by 4.3%
51Deep Compression SqueezeNet++ EIE
EIE: Efficient Inference Engine on Compressed Deep Neural
NetworkSong Han
CVA group, Stanford University Apr 7, 2016
52
Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network”, ISCA 2016
Part 3:
Deep Compression SqueezeNet++ EIE
ASIC Accelerator on Compressed DNN
OfflineNo dependency on network connection
Real TimeNo network delayhigh frame rate
Low PowerHigh energy efficiencythat preserves battery
53Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network”, ISCA 2016
• sparse, indirectly indexed, weight shared MxV accelerator.
Deep Compression SqueezeNet++ EIE
Distribute Storage and Processing
PE PE PE PE
PE PE PE PE
PE PE PE PE
PE PE PE PE
Central Control
54
Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network”, ISCA 2016
Deep Compression SqueezeNet++ EIE
Evaluation
1. Cycle-accurate C++ simulator. Two abstract methods: Propagate and Update. Used for DSE and verification. 2. RTL in Verilog, verified its output result with the golden model in Modelsim.3. Synthesized EIE using the Synopsys Design Compiler (DC) under the TSMC 45nm GP standard VT library with worst case PVT corner. 4. Placed and routed the PE using the Synopsys IC compiler (ICC). We used Cacti to get SRAM area and energy numbers. 5. Annotated the toggle rate from the RTL simulation to the gate-level netlist, which was dumped to switching activity interchange format (SAIF), and estimated the power using Prime-Time PX.
55
Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network”, ISCA 2016
Deep Compression SqueezeNet++ EIE
Layout of an EIE PE
56
Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network”, ISCA 2016
Deep Compression SqueezeNet++ EIE
Baseline and Benchmark• CPU: Intel Core-i7 5930k
• GPU: NVIDIA TitanX GPU
• Mobile GPU: Jetson TK1 with NVIDIA
57
Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network”, ISCA 2016
Deep Compression SqueezeNet++ EIE
Result: Speedup / Energy Efficiency
58
Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network”, ISCA 2016
Deep Compression SqueezeNet++ EIE
Comparison with other Platforms
59
Where are the savings from?
• Three factors for energy saving:
• Matrix is compressed by 35×; less work to do; less bricks to carry
• DRAM => SRAM, no need to go off-chip: 120×; carry bricks from Stanford to Berkeley => Stanford to Palo Alto
• Sparse activation: 3×;lighter bricks to carry
60
Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network”, ISCA 2016
Deep Compression SqueezeNet++ EIE
Load Balancing and Scalability
61
Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network”, ISCA 2016
Deep Compression SqueezeNet++ EIE
media coverage
62Deep Compression SqueezeNet++ EIE
TheNextPlatform
http://www.nextplatform.com/2015/12/08/emergent-chip-vastly-accelerates-deep-neural-networks/
63Deep Compression SqueezeNet++ EIE
O’Reilly
https://www.oreilly.com/ideas/compressed-representations-in-the-age-of-big-data
64Deep Compression SqueezeNet++ EIE
TechEmergence
http://techemergence.com/a-limitless-pill-for-deep-neural-networks/
65Deep Compression SqueezeNet++ EIE
Hacker News
https://news.ycombinator.com/item?id=10881683
66Deep Compression SqueezeNet++ EIE
Conclusion
• We present EIE, an energy-efficient engine optimized to operate on compressed deep neural networks.
• By leveraging sparsity in both the activations and the weights, EIE reduces the energy needed to compute a typical FC layer by 3,000×.
• With wrapper logic on top of EIE, 1x1 convolution and 3x3 convolution is possible.
67Deep Compression SqueezeNet++ EIE
Hardware for Deep Learning
PC Mobile Intelligent Mobile
Computation Mobile Computation
Intelligent Mobile
Computation
68
69
Model Compression [1]. Han et al. “Learning both Weights and Connections for Efficient Neural Networks”, NIPS 2015 [2]. Han et al. “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding”, ICLR 2016, Deep Learning Symposium, NIPS 2015
Hardware Acceleration [3]. Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network”, ISCA 2016
CNN Architecture Design Space Exploration [4]. Iandola, Han,et al. “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size” arXiv’16 [5]. Yao, Han, et.al, “Hardware-friendly convolutional neural network with even-number filter size” ICLR 2016 workshop
Recap
Deep Compression SqueezeNet++ EIE
Thank [email protected]
70
[1]. Han et al. “Learning both Weights and Connections for Efficient Neural Networks”, NIPS 2015 [2]. Han et al. “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding”, ICLR 2016 [3]. Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network”, ISCA 2016 [4]. Iandola, Han,et al. “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size” ECCV’16 submission [5]. Yao, Han, et.al, “Hardware-friendly convolutional neural network with even-number filter size” ICLR’16 workshop