Deep Compression and EIE - GTC On-Demand Featured Talks ...on-demand.gputechconf.com/gtc/2016/presentation/s6561-song-han... · Deep Compression and EIE: ——Deep Neural Network

Deep Compression and EIE: ——Deep Neural Network Model Compression

and Efficient Inference Engine Song Han

CVA group, Stanford University Apr 7, 2016

1

Intro about me and my advisor• Fourth year PhD with Prof. Bill Dally at Stanford.• Research interest: deep learning model compression

and hardware acceleration, to make inference more efficient for deployment.

• Recent work on “Deep Compression” and “EIE: Efficient Inference Engine” covered by TheNextPlatform & O’Reilly & TechEmergence & HackerNewsSong Han

Bill Dally

• Professor at Stanford University and former chairman of CS department, leads the Concurrent VLSI Architecture Group.

• Chief Scientist of NVIDIA.• Member of the National Academy of Engineering, Fellow of

the American Academy of Arts & Sciences, Fellow of the IEEE, Fellow of the ACM and numerous other rewards…

2

http://www.nextplatform.com/2015/12/08/emergent-chip-vastly-accelerates-deep-neural-networks/

https://www.oreilly.com/ideas/compressed-representations-in-the-age-of-big-data

http://techemergence.com/a-limitless-pill-for-deep-neural-networks/

Thanks to my collaborators

3

Bill Dally

• NVIDIA: Jeff Pool, John Tran, Bill Dally

• Stanford: Xingyu Liu, Jing Pu, Ardavan Pedram, Mark Horowitz, Bill Dally

• Tsinghua: Huizi Mao, Song Yao, Yu Wang

• Berkeley: Forrest Iandola, Matthew Moskewicz, Khalid Ashraf, Kurt Keutzer

You’ll be interested in his GTC talk: S6417 - FireCaffe

This Talk:

• Deep Compression[1,2]: A Deep Neural Network Model Compression Pipeline.

• EIE Accelerator[3]: Efficient Inference Engine that Accelerates the Compressed Deep Neural Network Model.

• SqueezeNet++[4,5]: ConvNet Architecture Design Space Exploration

[1]. Han et al. “Learning both Weights and Connections for Efficient Neural Networks”, NIPS 2015 [2]. Han et al. “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding”, ICLR 2016 [3]. Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network”, ISCA 2016 [4]. Iandola, Han,et al. “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model size” arXiv 2016 [5]. Yao, Han, et.al, “Hardware-friendly convolutional neural network with even-number filter size” ICLR’16 workshop

4

Deep Learning Next Wave of AI

Image Recognition

Speech Recognition

Natural Language Processing

5

Applications

6

App developers suffers from the model size

“At Baidu, our #1 motivation for compressing networks is to bring down the size of the binary file. As a mobile-first company, we frequently update various apps via different app stores. We've very sensitive to the size of our binary files, and a feature that increases the binary size by 100MB will receive much more scrutiny than one that increases it by 10MB.” —Andrew Ng

The Problem:If Running DNN on Mobile…

7

Hardware engineer suffers from the model size(embedded system, limited resource)

The Problem:If Running DNN on Mobile…

8

Intelligent but Inefficient

NetworkDelay

PowerBudget

User Privacy

9

The Problem:If Running DNN on the Cloud…

Deep Compression

Smaller SizeCompress Mobile App

Size by 35x-50x

Accuracyno loss of accuracyimproved accuracy

Speedupmake inference faster

10

Problem 1: Model Size Solution 1: Deep Compression

EIE Accelerator

OfflineNo dependency on network connection

Real TimeNo network delayhigh frame rate

Low PowerHigh energy efficiencythat preserves battery

11

Problem 2: Latency, Power, Energy Solution 2: ASIC accelerator

Part1: Deep Compression

• AlexNet: 35×, 240MB => 6.9MB => 0.47MB (510x)

• VGG16: 49×, 552MB => 11.3MB

• With no loss of accuracy on ImageNet12

• Weights fits on-chip SRAM, taking 120x less energy than DRAM

1. Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015 2. Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016 3. Iandola, Han, et al. “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size” ECCV submission

12Deep Compression SqueezeNet++ EIE

1. Pruning

Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015


Pruning: Motivation

• Trillion of synapses are generated in the human brain during the first few months of birth.

• 1 year old, peaked at 1000 trillion

• Pruning begins to occur.

• 10 years old, a child has nearly 500 trillion synapses

• This ’pruning’ mechanism removes redundant connections in the brain.

[1] Christopher A Walsh. Peter huttenlocher (1931-2013). Nature, 502(7470):172–172, 2013.


Retrain to Recover Accuracy

-4.5%-4.0%-3.5%-3.0%-2.5%-2.0%-1.5%-1.0%-0.5%0.0%0.5%

40% 50% 60% 70% 80% 90% 100%

Accu

racy

Los

s

Parametes Pruned Away

L2 regularization w/o retrain L1 regularization w/o retrain L1 regularization w/ retrain L2 regularization w/ retrain L2 regularization w/ iterative prune and retrain



Pruning: Result on 4 Covnets



AlexNet & VGGNet



Mask Visualization

Visualization of the first FC layer’s sparsity pattern of Lenet-300-100. It has a banded structure repeated 28 times, which correspond to the un-pruned parameters in the center of the images, since the digits are written in the center.


Pruning NeuralTalk and LSTM

19

Lecture 10 - 8 Feb 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 10 - 8 Feb 201651

Explain Images with Multimodal Recurrent Neural Networks, Mao et al.Deep Visual-Semantic Alignments for Generating Image Descriptions, Karpathy and Fei-FeiShow and Tell: A Neural Image Caption Generator, Vinyals et al.Long-term Recurrent Convolutional Networks for Visual Recognition and Description, Donahue et al.Learning a Recurrent Visual Representation for Image Caption Generation, Chen and Zitnick

Image Captioning

Karpathy, Feifei, et al, "Deep Visual-Semantic Alignments for Generating Image Descriptions"

• Pruning away 90% parameters in NeuralTalk doesn’t hurt BLUE score with proper retraining

Deep Compression SqueezeNet++ EIE

• Original: a basketball player in a white uniform is playing with a ball

• Pruned 90%: a basketball player in a white uniform is playing with a basketball

• Original : a brown dog is running through a grassy field • Pruned 90%: a brown dog is running through a grassy area

• Original : a soccer player in red is running in the field • Pruned 95%: a man in a red shirt and black and white black

shirt is running through a field

20

Pruning NeuralTalk and LSTM



Pruning Neural Machine Translation

Abi See, “CS224N Final Project: Exploiting the Redundancy in Neural Machine Translation”


Pruning Neural Machine Translation

Dark means zero and redundant, White means non-zero and useful

Word Embedding:

LSTM:

Abi See, “CS224N Final Project: Exploiting the Redundancy in Neural Machine Translation”


Speedup (FC layer)

• Intel Core i7 5930K: MKL CBLAS GEMV, MKL SPBLAS CSRMV

• NVIDIA GeForce GTX Titan X: cuBLAS GEMV, cuSPARSE CSRMV

• NVIDIA Tegra K1: cuBLAS GEMV, cuSPARSE CSRMV

23

Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016


Energy Efficiency (FC layer)

• Intel Core i7 5930K: CPU socket and DRAM power are reported by pcm-power utility

• NVIDIA GeForce GTX Titan X: reported by nvidia-smi utility

• NVIDIA Tegra K1: measured the total power consumption with a power-meter, 15% AC to DC conversion loss, 85% regulator efficiency and 15% power consumed by peripheral components => 60% AP+DRAM power

24



2. Weight Sharing (Trained Quantization)



Weight Sharing: Overview



Finetune Centroids



Accuracy ~ #Bits on 5 Conv Layers + 3 FC Layers


Weight Sharing: Result

• 16 Million => 2^4=16

• 8/5 bit quantization results in no accuracy loss

• 8/4 bit quantization results in no top-5 accuracy loss, 0.1% top-1 accuracy loss

• 4/2 bit quantization results in -1.99% top-1 accuracy loss, and -2.60% top-5 accuracy loss, not that bad-:



Pruning and Quantization Works Well Together

Under review as a conference paper at ICLR 2016

Figure 6: Accuracy v.s. compression rate under different compression methods. Pruning andquantization works best when combined.

Figure 7: Pruning doesn’t hurt quantization. Dashed: quantization on unpruned network. Solid:quantization on pruned network; Accuracy begins to drop at the same number of quantization bitswhether or not the network has been pruned. Although pruning made the number of parameters less,quantization still works well, or even better(3 bits case on the left figure) as in the unpruned network.

Figure 8: Accuracy of different initialization methods. Left: top-1 accuracy. Right: top-5 accuracy.Linear initialization gives best result.

6.2 CENTROID INITIALIZATION

Figure 8 compares the accuracy of the three different initialization methods with respect to top-1accuracy (Left) and top-5 accuracy (Right). The network is quantized to 2 ⇠ 8 bits as shown onx-axis. Linear initialization outperforms the density initialization and random initialization in allcases except at 3 bits.

The initial centroids of linear initialization spread equally across the x-axis, from the min value to themax value. That helps to maintain the large weights as the large weights play a more important rolethan smaller ones, which is also shown in network pruning Han et al. (2015). Neither random nordensity-based initialization retains large centroids. With these initialization methods, large weights areclustered to the small centroids because there are few large weights. In contrast, linear initializationallows large weights a better chance to form a large centroid.

6.3 SPEEDUP AND ENERGY EFFICIENCY

Deep Compression is targeting extremely latency-focused applications running on mobile, whichrequires real-time inference, such as pedestrian detection on an embedded processor inside anautonomous vehicle. Waiting for a batch to assemble significantly adds latency. So when bench-

8



3. Huffman Coding



Huffman CodingHuffman code is a type of optimal prefix code that is commonly used for loss-less data compression. It produces a variable-length code table for encoding source symbol. The table is derived from the occurrence probability for each symbol. As in other entropy encoding methods, more common symbols are represented with fewer bits than less common symbols, thus save the total space.


Deep Compression Result on 4 Convnets



Result: AlexNet



AlexNet: Breakdown



New Network Topology

+ Deep Compression

The Big Gun:

Iandola, Han,et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size


Iandola, Han,et al. “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size” arXiv 2016

New Network Topology + Deep Compression

0E+00

1E+05

3E+05

4E+05

5E+05

conv

1fir

e2/c

onv1

x1_1

fire2

/con

v1x1

_2fir

e2/c

onv3

x3_2

fire3

/con

v1x1

_1fir

e3/c

onv1

x1_2

fire3

/con

v3x3

_2fir

e4/c

onv1

x1_1

fire4

/con

v1x1

_2fir

e4/c

onv3

x3_2

fire5

/con

v1x1

_1fir

e5/c

onv1

x1_2

fire5

/con

v3x3

_2fir

e6/c

onv1

x1_1

fire6

/con

v1x1

_2fir

e6/c

onv3

x3_2

fire7

/con

v1x1

_1fir

e7/c

onv1

x1_2

fire7

/con

v3x3

_2fir

e8/c

onv1

x1_1

fire8

/con

v1x1

_2fir

e8/c

onv3

x3_2

fire9

/con

v1x1

_1fir

e9/c

onv1

x1_2

fire9

/con

v3x3

_2co

nv_fi

nal

Remaining parameters Parameters pruned away

Fig 2. Deep compression is compatible with even extreme efficient network architecture such as SqueezeNet: It can be pruned 3x, quantized to 6bit w/o loss of accuracy.

Fig 1: SqueezeNet architecture

37

Input

1x1 ConvSqueeze

1x1 ConvExpand

3x3 ConvExpand

Output Concat/Eltwise

64

16

64 64

128

Vanilla Fire module


470KB model, AlexNet-accuracy

Efficient Model Pruning Weight Sharing

38

Input

1x1 ConvSqueeze

1x1 ConvExpand

3x3 ConvExpand


64

16

64 64

128

Vanilla Fire module

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

ECCV#1731

ECCV#1731

8 ECCV-16 submission ID 1731

Table 2. Comparing SqueezeNet to model compression approaches. By model size, wemean the number of bytes required to store all of the parameters in the trained model.

CNNarchitecture

CompressionApproach

DataType

Original !Compressed Model

Size

Reduction inModel Size vs.

AlexNet

Top-1ImageNetAccuracy

Top-5ImageNetAccuracy

AlexNet None(baseline)

32 bit 240MB 1x 57.2% 80.3%

AlexNet SVD [3] 32 bit 240MB ! 48MB 5x 56.0% 79.4%AlexNet Network

Pruning [4]32 bit 240MB ! 27MB 9x 57.2% 80.3%

AlexNet Deep Com-pression [5]

5-8 bit 240MB ! 6.9MB 35x 57.2% 80.3%

SqueezeNet(ours)

None 32 bit 4.8MB 50x 57.5% 80.3%

SqueezeNet(ours)

DeepCompression

8 bit 4.8MB ! 0.66MB 363x 57.5% 80.3%

SqueezeNet(ours)

DeepCompression

6 bit 4.8MB ! 0.47MB 510x 57.5% 80.3%

sion [5] to SqueezeNet, using 33% sparsity6 and 8-bit quantization. This yields a0.66 MB model (363⇥ smaller than 32-bit AlexNet) with equivalent accuracy toAlexNet. Further, applying Deep Compression with 6-bit quantization and 33%sparsity on SqueezeNet, we produce a 0.47MB model (510⇥ smaller than 32-bitAlexNet) with equivalent accuracy. It appears that our small model is indeedamenable to compression.

In addition, these results demonstrate that Deep Compression [5] not onlyworks well on CNN architectures with many parameters (e.g. AlexNet andVGG), but it is also able to compress the already compact, fully convolu-tional SqueezeNet architecture. Deep Compression compressed SqueezeNet by10⇥ while preserving the baseline accuracy. In summary: by combining CNNarchitectural innovation (SqueezeNet) with state-of-the-art compression tech-niques (Deep Compression), we achieved a 510⇥ reduction in model size withno decrease in accuracy compared to the baseline.

5 CNN Microarchitecture Design Space Exploration

So far, we have proposed architectural design strategies for small models, fol-lowed these principles to create SqueezeNet, and discovered that SqueezeNet is50x smaller than AlexNet with equivalent accuracy. However, SqueezeNet andother models reside in a broad and largely unexplored design space of CNN ar-chitectures. Now, in Sections 5, 6, and 7, we explore several aspects of the designspace. We divide this exploration into three main topics: microarchitectural ex-

ploration (per-module layer dimensions and configurations), macroarchitectural

exploration (high-level end-to-end organization of modules and other layers), andexploration of model compression configurations.

In this section, we design and execute experiments with the goal of providingintuition about the shape of the microarchitectural design space with respect to

6 Note that, due to the storage overhead of storing sparse matrix indices, 33% sparsityleads to somewhat less than a 3⇥ decrease in model size.



39



470KB model, AlexNet-accuracy

https://github.com/songhan/SqueezeNet_compressed

https://github.com/songhan/SqueezeNet_compressed

Smaller DNN means…

(1) Less communication across servers during distributed training. (2) Easier to download from App Store.(3) Less bandwidth to update model to an autonomous car.(4) Easier to deploy on embedded hardware with limited memory.


A Model Compression Tool for App Developers

• Easy Version (done): ✓ No training needed ✓ Fast (3 minutes) x 5x - 10x compression rate x 1% loss of accuracy

• Advanced Version (todo): ✓ 35x - 50x compression rate ✓ no loss of accuracy x Training is needed x Slow

deepcompression.net is under construction


http://deepcompression.net

DeepCompression.net



DeepCompression.net

43

•Username: deepcompression•Password: songhan

Provides a trial account for GTC attendees:

welcome your feedback!



Conclusion

• We have presented a method to compress neural networks without affecting accuracy by finding the right connections and quantizing the weights.

• Pruning the unimportant connections => quantizing the network and enforce weight sharing => apply Huffman encoding.

• We highlight our experiments on ImageNet, and reduced the weight storage by 35×, VGG16 by 49×, without loss of accuracy.

• Now weights can fit in cache


Part2: SqueezeNet++ ——CNN Design Space Exploration

45

Iandola, Han,et al. “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size”, submitted to ECCV

Song Han CVA group, Stanford University

Apr 7, 2016


Motivation: How to choose so many architectural dimensions?

46

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

ECCV#1731

ECCV#1731

ECCV-16 submission ID 1731 7

Table 1. SqueezeNet architectural dimensions. (The formatting of this table was in-spired by the Inception2 paper [11].)

5

layername/type outputsize

filtersize/stride

(ifnotafirelayer)

depths1x1(#1x1

squeeze)

e1x1(#1x1

expand)

e3x3(#3x3

expand)

s1x1sparsity

e1x1sparsity

e3x3sparsity

#bits#parameter

before pruning

#parameter after

pruning

inputimage 224x224x3 - -

conv1 111x111x96 7x7/2(x96) 1 6bit 14,208 14,208

maxpool1 55x55x96 3x3/2 0

fire2 55x55x128 2 16 64 64 100% 100% 33% 6bit 11,920 5,746

fire3 55x55x128 2 16 64 64 100% 100% 33% 6bit 12,432 6,258

fire4 55x55x256 2 32 128 128 100% 100% 33% 6bit 45,344 20,646

maxpool4 27x27x256 3x3/2 0

fire5 27x27x256 2 32 128 128 100% 100% 33% 6bit 49,440 24,742

fire6 27x27x384 2 48 192 192 100% 50% 33% 6bit 104,880 44,700

fire7 27x27x384 2 48 192 192 50% 100% 33% 6bit 111,024 46,236

fire8 27x27x512 2 64 256 256 100% 50% 33% 6bit 188,992 77,581

maxpool8 13x12x512 3x3/2 0

fire9 13x13x512 2 64 256 256 50% 100% 30% 6bit 197,184 77,581

conv10 13x13x1000 1x1/1(x1000) 1 6bit 513,000 103,400

avgpool10 1x1x1000 13x13/1 0

1,248,424(total)

421,098(total)

20%(3x3)

100%(7x7)

4 Evaluation of SqueezeNet

We now turn our attention to evaluating SqueezeNet. In each of the CNNmodel compression papers reviewed in Section 2.1, the goal was to compress anAlexNet [25] model that was trained to classify images using the ImageNet [14](ILSVRC 2012) dataset. Therefore, we use AlexNet5 and the associated modelcompression results as a basis for comparison when evaluating SqueezeNet.

In Table 2, we review SqueezeNet in the context of recent model compres-sion results. The SVD-based approach is able to compress a pretrained AlexNetmodel by a factor of 5X, while diminishing top-1 accuracy to 56.0% [3]. NetworkPruning achieves a 9X reduction in model size while maintaining the baselineof 57.2% top-1 and 80.3% top-5 accuracy on ImageNet [4]. Deep Compressionachieves a 35x reduction in model size while still maintaining the baseline accu-racy level [5]. Now, with SqueezeNet, we achieve a 50X reduction in model sizecompared to AlexNet, while meeting or exceeding the top-1 and top-5 accuracyof AlexNet. We summarize all of the aforementioned results in Table 2.

It appears that we have surpassed the state-of-the-art results from the modelcompression community: even when using uncompressed 32-bit values to repre-sent the model, SqueezeNet has a 1.4⇥ smaller model size than the best e↵ortsfrom the model compression community while maintaining or exceeding the base-line accuracy. Until now, an open question has been: are small models amenable

to compression, or do small models “need” all of the representational power

a↵orded by dense floating-point values? To find out, we applied Deep Compres-

5 Our baseline is bvlc alexnet from the Ca↵e codebase [26].

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

ECCV#1731

ECCV#1731


Table 1. SqueezeNet architectural dimensions. (The formatting of this table was in-spired by the Inception2 paper [11].)

5

layername/type outputsize

filtersize/stride

(ifnotafirelayer)

depths1x1(#1x1

squeeze)

e1x1(#1x1

expand)

e3x3(#3x3

expand)

s1x1sparsity

e1x1sparsity

e3x3sparsity

#bits#parameter

before pruning

#parameter after

pruning

inputimage 224x224x3 - -

conv1 111x111x96 7x7/2(x96) 1 6bit 14,208 14,208

maxpool1 55x55x96 3x3/2 0

fire2 55x55x128 2 16 64 64 100% 100% 33% 6bit 11,920 5,746

fire3 55x55x128 2 16 64 64 100% 100% 33% 6bit 12,432 6,258

fire4 55x55x256 2 32 128 128 100% 100% 33% 6bit 45,344 20,646

maxpool4 27x27x256 3x3/2 0

fire5 27x27x256 2 32 128 128 100% 100% 33% 6bit 49,440 24,742

fire6 27x27x384 2 48 192 192 100% 50% 33% 6bit 104,880 44,700

fire7 27x27x384 2 48 192 192 50% 100% 33% 6bit 111,024 46,236

fire8 27x27x512 2 64 256 256 100% 50% 33% 6bit 188,992 77,581

maxpool8 13x12x512 3x3/2 0

fire9 13x13x512 2 64 256 256 50% 100% 30% 6bit 197,184 77,581

conv10 13x13x1000 1x1/1(x1000) 1 6bit 513,000 103,400

avgpool10 1x1x1000 13x13/1 0

1,248,424(total)

421,098(total)

20%(3x3)

100%(7x7)

4 Evaluation of SqueezeNet

We now turn our attention to evaluating SqueezeNet. In each of the CNNmodel compression papers reviewed in Section 2.1, the goal was to compress anAlexNet [25] model that was trained to classify images using the ImageNet [14](ILSVRC 2012) dataset. Therefore, we use AlexNet5 and the associated modelcompression results as a basis for comparison when evaluating SqueezeNet.

In Table 2, we review SqueezeNet in the context of recent model compres-sion results. The SVD-based approach is able to compress a pretrained AlexNetmodel by a factor of 5X, while diminishing top-1 accuracy to 56.0% [3]. NetworkPruning achieves a 9X reduction in model size while maintaining the baselineof 57.2% top-1 and 80.3% top-5 accuracy on ImageNet [4]. Deep Compressionachieves a 35x reduction in model size while still maintaining the baseline accu-racy level [5]. Now, with SqueezeNet, we achieve a 50X reduction in model sizecompared to AlexNet, while meeting or exceeding the top-1 and top-5 accuracyof AlexNet. We summarize all of the aforementioned results in Table 2.

It appears that we have surpassed the state-of-the-art results from the modelcompression community: even when using uncompressed 32-bit values to repre-sent the model, SqueezeNet has a 1.4⇥ smaller model size than the best e↵ortsfrom the model compression community while maintaining or exceeding the base-line accuracy. Until now, an open question has been: are small models amenable

to compression, or do small models “need” all of the representational power

a↵orded by dense floating-point values? To find out, we applied Deep Compres-

5 Our baseline is bvlc alexnet from the Ca↵e codebase [26].


Micro-Architecture DSX

• Micro-Architecture: how to size the layers: 64? 128? 256? • Sensitivity: Pruning 50% weights for a single layer and measure the accuracy. • => Sensitivity analysis helps sizing the number of parameters in a layer.

Use sensitive analysis

47

585

586

587

588

589

590

591

592

593

594

595

596

597

598

599

600

601

602

603

604

605

606

607

608

609

610

611

612

613

614

615

616

617

618

619

620

621

622

623

624

625

626

627

628

629

585

586

587

588

589

590

591

592

593

594

595

596

597

598

599

600

601

602

603

604

605

606

607

608

609

610

611

612

613

614

615

616

617

618

619

620

621

622

623

624

625

626

627

628

629

ECCV#1731

ECCV#1731


Table 4. Improving accuracy by using sensitivity analysis to determine where to addparameters in a CNN.

Architecture Top-1 Accuracy Top-5 Accuracy Model SizeSqueezeNet 57.5% 80.3% 4.8MB

SqueezeNet++ 59.5% 81.5% 7.1MB

Table 5. Improving accuracy with dense!sparse!dense (DSD) training.


SqueezeNet (DSD) 61.8% 83.5% 4.8MB

constraint gives the network more freedom to escape the saddle point and arriveat a higher-accuracy local minimum. So far, we trained in just three stagesof density (dense!sparse!dense), but regularizing models by intermittentlypruning parameters10 throughout training would be an interesting area of futurework.

8 Conclusions

We have presented SqueezeNet, a CNN architecture that has 50⇥ fewer param-eters than AlexNet and maintains AlexNet-level accuracy on ImageNet. We alsocompressed SqueezeNet to less than 0.5MB, or 510⇥ smaller than than AlexNetwithout compression. In this paper, we focused on ImageNet as a target dataset.However, it has become common practice to apply ImageNet-trained CNN rep-resentations to a variety of applications such as fine-grained object recognition,logo identification in images, and generating sentences about images. ImageNet-trained CNNs have also been applied to a number of applications pertaining toautonomous driving, including pedestrian and vehicle detection as well as seg-menting the shape of the road. We think SqueezeNet will be a good candidateCNN architecture for a variety of applications, especially those in which smallmodel size is of importance.

SqueezeNet is one of several new CNNs that we have discovered while broadlyexploring the design space of CNN architectures. We hope that SqueezeNet willinspire the reader to consider and explore the broad range of possibilities in thedesign space of CNN architectures.

10 This is somewhat like Dropout, which prunes activations instead of parameters.



48

"labrador retriever

dog"

conv1

96

fire2

128

fire3

128

fire4

256

fire5256

fire6

384

fire7

384

fire8

512

fire9512

conv101000

softmax

maxpool/2

maxpool/2

maxpool/2

globalavgpool

conv1

96

fire2

128

fire3

128

fire4

256

fire5256

fire6

384

fire7

384

fire8

512

fire9512

conv101000

softmax

maxpool/2

maxpool/2

maxpool/2

globalavgpool

conv196

fire2

128

fire3

128

fire4

256

fire5256

fire6

384

fire7

384

fire8

512

fire9512

conv101000

softmax

maxpool/2

maxpool/2

maxpool/2

globalavgpool

conv1x1

conv1x1

conv1x1

conv1x1

96

Macro-Architecture DSX



vanilla Fire module simple bypass

Macro-Architecture DSXcomplex bypass with 1x1 Conv

49


Input

1x1 ConvSqueeze

1x1 ConvExpand

3x3 ConvExpand


64

16

64 64

128

Vanilla Fire module

Concat

Input

1x1 ConvSqueeze

1x1 ConvExpand

3x3 ConvExpand


128

16

64 64

128

128

128

Fire module with Simple Bypass

Input

1x1 ConvSqueeze

1x1 ConvExpand

3x3 ConvExpand


1x1 ConvBypass

64

16

64 64

128

64

128

Fire module with Complex Bypass

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

487

488

489

490

491

492

493

494

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

487

488

489

490

491

492

493

494

ECCV#1731

ECCV#1731


Table 3. SqueezeNet accuracy and model size using di↵erent macroarchitecture

Architecture Top-1 Accuracy Top-5 Accuracy Model SizeVanilla SqueezeNet 57.5% 80.3% 4.8MB

SqueezeNet + Simple Bypass 60.4% 82.5% 4.8MBSqueezeNet + Complex Bypass 58.8% 82.0% 7.7MB

at the macroarchitecture level concerning the high-level connections among Firemodules. Inspired by ResNet [16], we explored three di↵erent architectures:

• Vanilla SqueezeNet (as per the prior sections).• SqueezeNet with simple bypass connections between some Fire modules.• SqueezeNet with complex bypass connections between the remaining Firemodules.

We illustrate these three variants of SqueezeNet in Figure 2.Our simple bypass architecture adds bypass connections around Fire modules

3, 5, 7, and 9, requiring these modules to learn a residual function between inputand output. As in ResNet, to implement a bypass connection around Fire3, weset the input to Fire4 equal to (output of Fire2 + output of Fire3), where the+ operator is elementwise addition. This changes the regularization applied tothe parameters of these Fire modules, and, as per ResNet, can improve the finalaccuracy and/or ability to train the full model.

One limitation is that, in the straightforward case, the number of input chan-nels and number of output channels has to be the same; as a result, only half ofthe Fire modules can have simple bypass connections, as shown in the middlediagram of Fig 2. When the “same number of channels” requirement can’t bemet, we use a complex bypass connection, as illustrated on the right of Figure 2.While a simple bypass is “just a wire,” we define a complex bypass as a bypassthat includes a 1x1 convolution layer with the number of filters set equal to thenumber of output channels that are needed. Note that complex bypass connec-tions add extra parameters to the model, while simple bypass connections donot.

In addition to changing the regularization, it is intuitive to us that addingbypass connections would help to alleviate the representational bottleneck intro-duced by squeeze layers. In SqueezeNet, the squeeze ratio (SR) is 0.125, meaningthat every squeeze layer has 8x fewer output channels than the accompanyingexpand layer. Due to this severe dimensionality reduction, a limited amount ofinformation can pass through squeeze layers. However, by adding bypass con-nections to SqueezeNet, we open up avenues for information to flow around thesqueeze layers.

We trained SqueezeNet with above three macroarchitectures and comparedthe accuracy and model size in Table 3. We fixed the microarchitecture to matchSqueezeNet as described in Table 1 throughout the macroarchitecture explo-ration. Interestingly, adding complex bypass layers degraded accuracy. However,


50

DSD Training (Dense-Sparse-Dense) improves accuracy

585

586

587

588

589

590

591

592

593

594

595

596

597

598

599

600

601

602

603

604

605

606

607

608

609

610

611

612

613

614

615

616

617

618

619

620

621

622

623

624

625

626

627

628

629

585

586

587

588

589

590

591

592

593

594

595

596

597

598

599

600

601

602

603

604

605

606

607

608

609

610

611

612

613

614

615

616

617

618

619

620

621

622

623

624

625

626

627

628

629

ECCV#1731

ECCV#1731


Table 4. Improving accuracy by using sensitivity analysis to determine where to addparameters in a CNN.


SqueezeNet++ 59.5% 81.5% 7.1MB

Table 5. Improving accuracy with dense!sparse!dense (DSD) training.


SqueezeNet (DSD) 61.8% 83.5% 4.8MB

constraint gives the network more freedom to escape the saddle point and arriveat a higher-accuracy local minimum. So far, we trained in just three stagesof density (dense!sparse!dense), but regularizing models by intermittentlypruning parameters10 throughout training would be an interesting area of futurework.

8 Conclusions

We have presented SqueezeNet, a CNN architecture that has 50⇥ fewer param-eters than AlexNet and maintains AlexNet-level accuracy on ImageNet. We alsocompressed SqueezeNet to less than 0.5MB, or 510⇥ smaller than than AlexNetwithout compression. In this paper, we focused on ImageNet as a target dataset.However, it has become common practice to apply ImageNet-trained CNN rep-resentations to a variety of applications such as fine-grained object recognition,logo identification in images, and generating sentences about images. ImageNet-trained CNNs have also been applied to a number of applications pertaining toautonomous driving, including pedestrian and vehicle detection as well as seg-menting the shape of the road. We think SqueezeNet will be a good candidateCNN architecture for a variety of applications, especially those in which smallmodel size is of importance.

SqueezeNet is one of several new CNNs that we have discovered while broadlyexploring the design space of CNN architectures. We hope that SqueezeNet willinspire the reader to consider and explore the broad range of possibilities in thedesign space of CNN architectures.

10 This is somewhat like Dropout, which prunes activations instead of parameters.

• dense→sparse→dense (DSD) training yielded 4.3% higher accuracy.• Sparsity is a form of regularization. Once the network arrives at a local minimum given

the sparsity constraint, relaxing the constraint gives the network more freedom to escape the saddle point and arrive at a higher-accuracy local minimum.

• Regularizing models by intermittently pruning parameters throughout training would be an interesting area of future work.

SparseDense Dense

Constrain Relax



Design Space Exploration Conclusion

• SqueezeNet++: sizing the layers with sensitivity analysis • Use even kernel • SqueezeNet +simple + complex bypass layer • DSD training: improves accuracy by 4.3%


EIE: Efficient Inference Engine on Compressed Deep Neural

NetworkSong Han

CVA group, Stanford University Apr 7, 2016

52

Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network”, ISCA 2016

Part 3:


ASIC Accelerator on Compressed DNN

OfflineNo dependency on network connection

Real TimeNo network delayhigh frame rate

Low PowerHigh energy efficiencythat preserves battery

53Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network”, ISCA 2016

• sparse, indirectly indexed, weight shared MxV accelerator.


Distribute Storage and Processing

PE PE PE PE

PE PE PE PE

PE PE PE PE

PE PE PE PE

Central Control

54



Evaluation

1. Cycle-accurate C++ simulator. Two abstract methods: Propagate and Update. Used for DSE and verification. 2. RTL in Verilog, verified its output result with the golden model in Modelsim.3. Synthesized EIE using the Synopsys Design Compiler (DC) under the TSMC 45nm GP standard VT library with worst case PVT corner. 4. Placed and routed the PE using the Synopsys IC compiler (ICC). We used Cacti to get SRAM area and energy numbers. 5. Annotated the toggle rate from the RTL simulation to the gate-level netlist, which was dumped to switching activity interchange format (SAIF), and estimated the power using Prime-Time PX.

55



Layout of an EIE PE

56



Baseline and Benchmark• CPU: Intel Core-i7 5930k

• GPU: NVIDIA TitanX GPU

• Mobile GPU: Jetson TK1 with NVIDIA

57



Result: Speedup / Energy Efficiency

58



Comparison with other Platforms

59

Where are the savings from?

• Three factors for energy saving:

• Matrix is compressed by 35×; less work to do; less bricks to carry

• DRAM => SRAM, no need to go off-chip: 120×; carry bricks from Stanford to Berkeley => Stanford to Palo Alto

• Sparse activation: 3×;lighter bricks to carry

60



Load Balancing and Scalability

61



media coverage


TheNextPlatform

http://www.nextplatform.com/2015/12/08/emergent-chip-vastly-accelerates-deep-neural-networks/


O’Reilly




TechEmergence




Hacker News

https://news.ycombinator.com/item?id=10881683


https://news.ycombinator.com/item?id=10881683

Conclusion

• We present EIE, an energy-efficient engine optimized to operate on compressed deep neural networks.

• By leveraging sparsity in both the activations and the weights, EIE reduces the energy needed to compute a typical FC layer by 3,000×.

• With wrapper logic on top of EIE, 1x1 convolution and 3x3 convolution is possible.


Hardware for Deep Learning

PC Mobile Intelligent Mobile

Computation Mobile Computation

Intelligent Mobile

Computation

68

69

Model Compression [1]. Han et al. “Learning both Weights and Connections for Efficient Neural Networks”, NIPS 2015 [2]. Han et al. “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding”, ICLR 2016, Deep Learning Symposium, NIPS 2015

Hardware Acceleration [3]. Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network”, ISCA 2016

CNN Architecture Design Space Exploration [4]. Iandola, Han,et al. “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size” arXiv’16 [5]. Yao, Han, et.al, “Hardware-friendly convolutional neural network with even-number filter size” ICLR 2016 workshop

Recap


Thank [email protected]

70

[1]. Han et al. “Learning both Weights and Connections for Efficient Neural Networks”, NIPS 2015 [2]. Han et al. “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding”, ICLR 2016 [3]. Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network”, ISCA 2016 [4]. Iandola, Han,et al. “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size” ECCV’16 submission [5]. Yao, Han, et.al, “Hardware-friendly convolutional neural network with even-number filter size” ICLR’16 workshop

mailto:[email protected]

Documents

Deep Compression and EIE - GTC On-Demand Featured Talks ...on-demand.gputechconf.com/gtc/2016/presentation/s6561-song-han... · Deep Compression and EIE: ——Deep Neural Network