View
5
Download
0
Category
Preview:
Citation preview
Optimizing inference efficiency for tiny DNNs Harris Teague Principal Engineer – Qualcomm AI Research
Qualcomm Technologies, Inc.
12 February 2020 tinyML Summit 2020, San Jose, CA
Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc
This is the age whereAI can live in your handinstead of the cloud
Mobile
Extended reality (XR)
Computing
IP Cameras
Automotive
Audio
Drones
Robotics Qualcomm® Artificial Intelligence (AI) Engine
Broadening on-device intelligenceQualcomm Artificial Intelligence Engine is a product of Qualcomm Technologies Inc. and/or its subsidiaries.
Efficiency strategies
• DNN optimizing compilers and schedulers
• Architecture optimization◦ Human experts◦ Automated search (NAS)
• Network factoring, pruning, compression
• Better training (get better performance from an existing “constrained” model)◦ Distillation, hyperparameter optimization/search
• Quantization
• Specialized digital accelerators
• New HW In-memory computation
for improving network inference with “tinyML-level” resource constraints
Efficiency strategies
• DNN optimizing compilers and schedulers
• Architecture optimization◦ Human experts◦ Automated search (NAS)
• Network factoring, pruning, compression
• Better training (get better performance from an existing “constrained” model)◦ Distillation, hyperparameter optimization/search
• Quantization
• Specialized digital accelerators
• New HW In-memory computation
for improving network inference with resource constraints
Schedule optimization
• Reducing peak-local memory utilization• Other cost metrics also used in this framework (eg. mem bandwidth)
especially for irregularly connected networks common in NAS
RandWire SwiftNet
Recently accepted paper from Qualcomm AI Research -> Ahn, et. al., “Memory-Aware Scheduling of Irregularly Wired Neural Networks,” MLSys 2020
Efficiency strategies
• DNN optimizing compilers and schedulers
• Architecture optimization◦ Human experts◦ Automated search (NAS)
• Network factoring, pruning, compression
• Better training (get better performance from an existing “constrained” model)◦ Distillation, hyperparameter optimization/search
• Quantization
• Specialized digital accelerators
• New HW In-memory computation
for improving network inference with resource constraints
Compression
• Reduce costly manual re-engineering
• Apply to models that you already have
• Allow flexibility for rapid tradeoff between accuracy and complexity
for automated optimization of model latency and size
Layer 1
Layer 2
Layer 3
Original model
Spatial SVD
Layer 1
Channelpruning
Combined
Efficiency strategies
• DNN optimizing compilers and schedulers
• Architecture optimization◦ Human experts◦ Automated search (NAS)
• Network factoring, pruning, compression
• Better training (get better performance from an existing “constrained” model)◦ Distillation, hyperparameter optimization/search
• Quantization
• Specialized digital accelerators
• New HW In-memory computation
for improving network inference with resource constraints
Quantization features
These features make it easy for you or your customer to quantize a model without requiring a training setup and training datasets.
Supported today for feedforward neural network models, 8-bit quantization
Cross-layer equalization Shift scaling around the network to make quantization easier and more accurate
Batch-norm folding Combining layers in a network to reduce compute
High-bias absorption Shift constants between layers to improve quantization fidelity
Bias Correction Quantify and compensate for biases produced as a by-product of quantization
MobilenetV2Top1% Imagenet Classification
Floating point
Conventional 8-bit
8-bit+ CLE
8-bit + CLE
+ bias absorb
8-bit + CLE
+ bias absorb+ bias correct
Quantization feature benefits8-bit quantization at near-floating-point accuracy even for challenging models
Floating point
8-bit+ CLE
8-bit + CLE
+ bias absorb
8-bit + CLE
+ bias absorb+ bias correct
DeeplabV3+ mIOU Semantic Segmentation on Pascal VOC
“Data-Free Quantization Through Weight Equalization and Bias Correction,” Markus Nagel, Mart van Baalen, Tijmen Blankevoort, Max Welling; The IEEE International Conference on Computer Vision (ICCV), 2019, pp. 1325-1334
71.7
69.9
71.071.2
0.1 41.4 3.3
72.9
69.7
71.9
70.9
Conventional 8-bit
Efficiency strategies
• DNN optimizing compilers and schedulers
• Architecture optimization◦ Human experts◦ Automated search (NAS)
• Network factoring, pruning, compression
• Better training (get better performance from an existing “constrained” model)◦ Distillation, hyperparameter optimization/search
• Quantization
• Specialized digital accelerators
• New HW In-memory computation
for improving network inference with resource constraints
Potential benefits of In-memory compute
• 10x improvement in efficiency and throughput
•However, still challenges in scaling HW implementations to large area
from Naveen Verma, “Advances and Prospects for In-memory Computing,” 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing (EMC2), Dec 2019
Technology choices for in-memory
Tradeoffs are complex and will be different for each application/company
among many options, how do you choose initial focus?
Domain Alternatives Factors
Process Tech NVM, SRAM, DRAM Fab capabilities, compatibility
Analog measurable Charge, current Power/error/size tradeoffs
Bitcell type AND, XNOR Bitcell design optimization
Bitcell combining Switched cap, charge share Leakage optimization
Array arch Shared, non-shared Flexibility to model size/type
Bit-op support Binary, nBit Simplicity vs. wide-applicability
Analog scope Full analog, mixed signal Stochastic error vs. power
In-memory Binary Neural Network Modeling
• Model binary network operation in-memory
• Pick a model architecture, binarize and train.
• Fine-tune for robustness to in-memory error sources
CIFAR10 small VGG, Multi-keyword Spotting, Single-keyword wakeup
CIM-NN example (for single-keyword wakeup)
Binary in-memory convolution
Binary Neural Network studies
• In-memory errors degrade performance, but fine-tuning for robustness can recover accuracy
• CIM_NN performs better and recovers better, but this architecture to tailors specifically for in-memory implementation and training (rather than “straight binarization”)
• Compelling power efficiency gains
CIFAR10 small VGG, Multi-keyword Spotting, Single-keyword wakeup
Next steps
• Expanding to multi-bit design◦ Applicable to networks that are not easily binarized◦ Further optimize the TOPS/W per model accuracy
• HW/Algorithm co-design◦ ADC design and resolution is a key factor for optimization◦ Further SRAM bitcell improvements to reduce power and analog errors◦ Model training/fine-tuning enhancements to improve accuracy and repeatability
• Compiler and scheduler optimization to automate mapping new networks to this type of accelerator
and work-in-progress
Follow us on:For more information, visit us at:www.qualcomm.com & www.qualcomm.com/blog
Thank you
Nothing in these materials is an offer to sell any of the components or devices referenced herein.
©2018-2019 Qualcomm Technologies, Inc. and/or its affiliated companies. All Rights Reserved.
Qualcomm is a trademark of Qualcomm Incorporated, registered in the United States and other countries. Other products and brand names may be trademarks or registered trademarks of their respective owners.
References in this presentation to “Qualcomm” may mean Qualcomm Incorporated, Qualcomm Technologies, Inc., and/or other subsidiaries or business units within the Qualcomm corporate structure, as applicable. Qualcomm Incorporated includes Qualcomm’s licensing business, QTL, and the vast majority of its patent portfolio. Qualcomm Technologies, Inc., a wholly-owned subsidiary of Qualcomm Incorporated, operates, along with its subsidiaries, all of Qualcomm’s engineering, research and development functions, and all of its product and services businesses, including its semiconductor business, QCT.
Abstract
• In this talk, I will explore some of the ways that we are working on improving model inference efficiency for tiny devices – where power, area, memory, compute resources are limited. I will present results for a few of these: compute scheduling optimization, model compression, quantized inference, and in-memory computing. Finally, I will discuss our plans for next research steps to further understand and develop the technology.
Copyright NoticeThe presentation(s) in this publication comprise the proceedings of tinyML® Summit 2020. The content reflects the opinion of the authors and their respective companies. This version of the presentation may differ from the version that was presented at the tinyML Summit. The inclusion of presentations in this publication does not constitute an endorsement by tinyML Foundation or the sponsors.
There is no copyright protection claimed by this publication. However, each presentation is the work of the authors and their respective companies and may contain copyrighted material. As such, it is strongly encouraged that any use reflect proper acknowledgement to the appropriate source. Any questions regarding the use of any materials presented should be directed to the author(s) or their companies.
tinyML is a registered trademark of the tinyML Foundation.
www.tinyML.org
Recommended