Optimizing inference efficiency for tiny DNNsOptimizing inference efficiency for tiny DNNs Harris Teague ... Blankevoort, Max Welling; The IEEE International Conference on Computer

Optimizing inference efficiency for tiny DNNs Harris Teague Principal Engineer – Qualcomm AI Research

Qualcomm Technologies, Inc.

12 February 2020 tinyML Summit 2020, San Jose, CA

Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc

This is the age whereAI can live in your handinstead of the cloud

Mobile

Extended reality (XR)

Computing

IP Cameras

Automotive

Audio

Drones

Robotics Qualcomm® Artificial Intelligence (AI) Engine

Broadening on-device intelligenceQualcomm Artificial Intelligence Engine is a product of Qualcomm Technologies Inc. and/or its subsidiaries.

Efficiency strategies

• DNN optimizing compilers and schedulers

• Architecture optimization◦ Human experts◦ Automated search (NAS)

• Network factoring, pruning, compression

• Better training (get better performance from an existing “constrained” model)◦ Distillation, hyperparameter optimization/search

• Quantization

• Specialized digital accelerators

• New HW In-memory computation

for improving network inference with “tinyML-level” resource constraints






• Quantization



for improving network inference with resource constraints

Schedule optimization

• Reducing peak-local memory utilization• Other cost metrics also used in this framework (eg. mem bandwidth)

especially for irregularly connected networks common in NAS

RandWire SwiftNet

Recently accepted paper from Qualcomm AI Research -> Ahn, et. al., “Memory-Aware Scheduling of Irregularly Wired Neural Networks,” MLSys 2020






• Quantization




Compression

• Reduce costly manual re-engineering

• Apply to models that you already have

• Allow flexibility for rapid tradeoff between accuracy and complexity

for automated optimization of model latency and size

Layer 1

Layer 2

Layer 3

Original model

Spatial SVD

Layer 1

Channelpruning

Combined






• Quantization




Quantization features

These features make it easy for you or your customer to quantize a model without requiring a training setup and training datasets.

Supported today for feedforward neural network models, 8-bit quantization

Cross-layer equalization Shift scaling around the network to make quantization easier and more accurate

Batch-norm folding Combining layers in a network to reduce compute

High-bias absorption Shift constants between layers to improve quantization fidelity

Bias Correction Quantify and compensate for biases produced as a by-product of quantization

MobilenetV2Top1% Imagenet Classification

Floating point

Conventional 8-bit

8-bit+ CLE

8-bit + CLE

+ bias absorb

8-bit + CLE

+ bias absorb+ bias correct

Quantization feature benefits8-bit quantization at near-floating-point accuracy even for challenging models

Floating point

8-bit+ CLE

8-bit + CLE

+ bias absorb

8-bit + CLE

+ bias absorb+ bias correct

DeeplabV3+ mIOU Semantic Segmentation on Pascal VOC

“Data-Free Quantization Through Weight Equalization and Bias Correction,” Markus Nagel, Mart van Baalen, Tijmen Blankevoort, Max Welling; The IEEE International Conference on Computer Vision (ICCV), 2019, pp. 1325-1334

71.7

69.9

71.071.2

0.1 41.4 3.3

72.9

69.7

71.9

70.9

Conventional 8-bit






• Quantization




Potential benefits of In-memory compute

• 10x improvement in efficiency and throughput

•However, still challenges in scaling HW implementations to large area

from Naveen Verma, “Advances and Prospects for In-memory Computing,” 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing (EMC2), Dec 2019

Technology choices for in-memory

Tradeoffs are complex and will be different for each application/company

among many options, how do you choose initial focus?

Domain Alternatives Factors

Process Tech NVM, SRAM, DRAM Fab capabilities, compatibility

Analog measurable Charge, current Power/error/size tradeoffs

Bitcell type AND, XNOR Bitcell design optimization

Bitcell combining Switched cap, charge share Leakage optimization

Array arch Shared, non-shared Flexibility to model size/type

Bit-op support Binary, nBit Simplicity vs. wide-applicability

Analog scope Full analog, mixed signal Stochastic error vs. power

In-memory Binary Neural Network Modeling

• Model binary network operation in-memory

• Pick a model architecture, binarize and train.

• Fine-tune for robustness to in-memory error sources

CIFAR10 small VGG, Multi-keyword Spotting, Single-keyword wakeup

CIM-NN example (for single-keyword wakeup)

Binary in-memory convolution

Binary Neural Network studies

• In-memory errors degrade performance, but fine-tuning for robustness can recover accuracy

• CIM_NN performs better and recovers better, but this architecture to tailors specifically for in-memory implementation and training (rather than “straight binarization”)

• Compelling power efficiency gains

CIFAR10 small VGG, Multi-keyword Spotting, Single-keyword wakeup

Next steps

• Expanding to multi-bit design◦ Applicable to networks that are not easily binarized◦ Further optimize the TOPS/W per model accuracy

• HW/Algorithm co-design◦ ADC design and resolution is a key factor for optimization◦ Further SRAM bitcell improvements to reduce power and analog errors◦ Model training/fine-tuning enhancements to improve accuracy and repeatability

• Compiler and scheduler optimization to automate mapping new networks to this type of accelerator

and work-in-progress

Follow us on:For more information, visit us at:www.qualcomm.com & www.qualcomm.com/blog

Thank you

Nothing in these materials is an offer to sell any of the components or devices referenced herein.

©2018-2019 Qualcomm Technologies, Inc. and/or its affiliated companies. All Rights Reserved.

Qualcomm is a trademark of Qualcomm Incorporated, registered in the United States and other countries. Other products and brand names may be trademarks or registered trademarks of their respective owners.

References in this presentation to “Qualcomm” may mean Qualcomm Incorporated, Qualcomm Technologies, Inc., and/or other subsidiaries or business units within the Qualcomm corporate structure, as applicable. Qualcomm Incorporated includes Qualcomm’s licensing business, QTL, and the vast majority of its patent portfolio. Qualcomm Technologies, Inc., a wholly-owned subsidiary of Qualcomm Incorporated, operates, along with its subsidiaries, all of Qualcomm’s engineering, research and development functions, and all of its product and services businesses, including its semiconductor business, QCT.

Abstract

• In this talk, I will explore some of the ways that we are working on improving model inference efficiency for tiny devices – where power, area, memory, compute resources are limited. I will present results for a few of these: compute scheduling optimization, model compression, quantized inference, and in-memory computing. Finally, I will discuss our plans for next research steps to further understand and develop the technology.

Copyright NoticeThe presentation(s) in this publication comprise the proceedings of tinyML® Summit 2020. The content reflects the opinion of the authors and their respective companies. This version of the presentation may differ from the version that was presented at the tinyML Summit. The inclusion of presentations in this publication does not constitute an endorsement by tinyML Foundation or the sponsors.

There is no copyright protection claimed by this publication. However, each presentation is the work of the authors and their respective companies and may contain copyrighted material. As such, it is strongly encouraged that any use reflect proper acknowledgement to the appropriate source. Any questions regarding the use of any materials presented should be directed to the author(s) or their companies.

tinyML is a registered trademark of the tinyML Foundation.

www.tinyML.org

Documents

Optimizing inference efficiency for tiny DNNsOptimizing inference efficiency for tiny DNNs Harris Teague ... Blankevoort, Max Welling; The IEEE International Conference on Computer