Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
“Enabling Neural network at the low power edge: A neural network
compiler for hardware constrained embedded system”Chao Xu – Eta Compute
November 24, 2020
tinyML Talks Sponsors
Additional Sponsorships available – contact [email protected] for info
tinyML Strategic Partner
3 © 2020 Arm Limited (or its affiliates)3 © 2020 Arm Limited (or its affiliates)
Optimized models for embedded
Application
Runtime(e.g. TensorFlow Lite Micro)
Optimized low-level NN libraries(i.e. CMSIS-NN)
Arm Cortex-M CPUs and microNPUs
Profiling and debugging
tooling such as Arm Keil MDK
Connect to high-level
frameworks
1
Supported byend-to-end tooling
2
2
RTOS such as Mbed OS
Connect toRuntime
3
3
Arm: The Software and Hardware Foundation for tinyML
1
AI Ecosystem Partners
Resources: developer.arm.com/solutions/machine-learning-on-arm
Stay Connected
@ArmSoftwareDevelopers
@ArmSoftwareDev
PAGE 4| Confidential Presentation ©2020 Deeplite, All Rights Reserved
BECOME BETA USER bit.ly/testdeeplite
WE USE AI TO MAKE OTHER AI FASTER, SMALLER AND MORE POWER EFFICIENT
Automatically compress SOTA models like MobileNet to <200KB with
little to no drop in accuracy for inference on resource-limited MCUs
Reduce model optimization trial & error from weeks to days using
Deeplite's design space exploration
Deploy more models to your device without sacrificing performance or
battery life with our easy-to-use software
Copyright © EdgeImpulse Inc.
TinyML for all developers
Get your free account at http://edgeimpulse.com
Test
Edge Device Impulse
Dataset
Embedded and edge
compute deployment
options
Acquire valuable
training data securely
Test impulse with
real-time device
data flows
Enrich data and train
ML algorithms
Real sensors in real time
Open source SDK
Maxim Integrated: Enabling Edge Intelligencewww.maximintegrated.com/ai
Sensors and Signal Conditioning
Health sensors measure PPG and ECG signals critical to understanding vital signs. Signal chain products enable measuring even the most sensitive signals.
Low Power Cortex M4 Micros
The biggest (3MB flash and 1MB SRAM) and the smallest (256KB flash and 96KB SRAM) Cortex M4 microcontrollers enable algorithms and neural networks to run at wearable power levels
Advanced AI Acceleration
The new MAX78000 implements AI inferences at over 100x lower energy than other embedded options. Now the edge can see and hear like never before.
Wide range of ML methods: GBM, XGBoost, Random
Forest, Logistic Regression, Decision Tree, SVM, CNN, RNN,
CRNN, ANN, Local Outlier Factor, and Isolation Forest
Easy-to-use interface for labeling, recording, validating, and
visualizing time-series sensor data
On-device inference optimized for low latency, low power
consumption, and a small memory footprint
Supports Arm® Cortex™- M0 to M4 class MCUs
Automates complex and labor-intensive processes of a
typical ML workflow – no coding or ML expertise required!
Industrial Predictive Maintenance
Smart Home
Wearables
Qeexo AutoML for Embedded AIAutomated Machine Learning Platform that builds tinyML solutions for the Edge using sensor data
Automotive
Mobile
IoT
QEEXO AUTOML: END-TO-END MACHINE LEARNING PLATFORM
Key Features Target Markets/Applications
For a limited time, sign up to use Qeexo AutoML at automl.qeexo.com for FREE to bring intelligence to your devices!
is for
building products
Automated Feature
Exploration and Model
Generation
Bill-of-Materials
Optimization
Automated Data
Assessment
Edge AI / TinyML
code for the smallest
MCUs
Reality AI Tools® software
Reality AI solutions
Automotive sound recognition & localization
Indoor/outdoor sound event recognition
RealityCheck™ voice anti-spoofing
[email protected] @SensorAI Reality AIhttps://reality.ai
SynSense builds ultra-low-power (sub-mW) sensing and inference hardware for embedded, mobile and edgedevices. We design systems for real-time always-on smart sensing, for audio, vision, IMUs, bio-signals and more.
https://SynSense.ai
Next tinyML Talks
Date Presenter Topic / Title
Tuesday,December 8
Ian CampbellCEO, OnScale
Sakyasingha DasguptaFounder, CEO & Managing Director,Edgecortix Inc.
Training Embedded AI/ML Using Synthetic Data
Using AI to design energy-efficient AI accelerators for the edge
Webcast start time is 8 am Pacific timeEach presentation is approximately 30 minutes in length
Please contact [email protected] if you are interested in presenting
Reminders
youtube.com/tinyml
Slides & Videos will be posted tomorrow
tinyml.org/forums
Please use the Q&A window for your questions
Chao XuChao brings more than 20 years of experience in the advanced signal processing and machine learning, networking semiconductor, and silicon photonics. Prior to Eta Compute, Dr. Xu served as senior director of communication systems and computing and storage platform at Inphi Corporation. Prior to Inphi, he held senior R&D positions at Integrated Device Technology and PMC-Sierra. Dr. Xu has over 30 pending and awarded patents. He received his Ph.D. from the University of Pennsylvania, a master’s of science degree and a bachelor of engineering degree in electrical engineering from the University of Science and Technology of China. His research area includes speech recognition, noise robustness, feature extraction, and other general machine learning methods.
13
Enabling Neural Network at the Ultra Low Power Edge: A neural network compiler for hardware constrained
embedded system Presented by: Chao Xu
14
Introduction – AI/ML in ultra low power edge devices
Challenges of developing AI/ML applications in ultra low power edge devices
Status of deploying AI/ML applications in ultra low power edge devices
Streamline AI/ML design from idea to firmware – TENSAI ® Flow ⎯ An integrated approach to minimize the barrier to design neural network for
ultra-low power edge devices
TENSAI ® Compiler– a Neural Network Compiler for ultra low power edge devices
Summary
Agenda
15
Neural Networks continue to gain interests for deployment in IoT and other mobile and edge devices.
Voice Sound Image
• Neural Sensor Processor
• Software Development Kit
• Trained Neural Net Libraries
• Reference Designs
• Development Tools
Ultra Low Power
Neural Sensor
Platform
Develop Deploy
16
Limited embedded memory (128KB or 256KB)
Limited storage (256KB or 512KB)
Limited computing speed (50MHz++)
Limited power (mW++)
Fragmented technology landscape ⎯ Tensorflow / Tensorflow Lite/ Tensorflow Micro
⎯ Pytorch
⎯ uC or DSP or NPU
Challenges of developing AI/ML applications in ultra low power edge devices
17
Very long development time⎯ Multiple development frameworks
⎯ Floating point / Fixed point
⎯ Hardware constraints
Trade off between power and accuracy.⎯ Small model size limits use cases which require high accuracy
Due to limited accuracy, a lot of practical use cases are impossible in ultra low power edge devices
Status of deploying AI/ML applications in ultra low power edge devices
18
Streamline Design from Idea to Firmware with TENSAI ® Flow
Memory
Speed/Power
Optimization
Design Goals
TENSAI® CompilerTensorFlow
NetworkGeneration
TensorFlow Eta Compute TENSAI®
FlowAccurac
y
Training
Data
DSP Libs
NN Libs
Executors
TF LiteNetwork
Generation
Trained
model
Quantize
Optimize
Eta Compute
Neural Sensor
Processor
FreeRTOS
HAL
Libraries
Embedded
Application
19
Release
Collect dataset
Train
Quantize and
optimize
Validate
TENSAI ® Flow OverviewEta computeGenerated
TENSAI
Compiler
Partner
TENSAI
NN ZOO
Trained NN model
Application
Cloud Connectivity
TENSAI Middleware: AI Kernel,
FreeRTOS, HAL, project generator,
Sensor Drivers
Qualified Algorithms
Qualified Algorithms
Data Collection
from sensor
board or other
methods
• Minimal embedded coding required
• Variety of networks: CNN, MobileNets,
Inception, ResNet, RNN
• Automatic multicore software, no
intervention required
• Optimized kernels for peak performance
and efficiency
20
TENSAI ® Compiler – a Neural Network Compiler for ultra low power edge devices
Resource Constrained HW
FreeRTOS
Executors
Optimized NNKernels
Automatically Generated
Inference CodeTFLite Model File TENSAI ® Compiler
Allows users to train network in floating point in Tensorflow
Quantize to TFLite Model using standard flow
TENSAI ® compiler will generate embedded code directly from TFLite model in ultra low power edge devices
21
Training level optimization
⎯ Hyperparameter optimization (AutoML), Pruning, Weight reduction and Quantization are done at the network level during training.
TENSAI ® Flow Optimization
⎯ Optimizes the trained neural network for best execution on TENSAI Neural Sensor Processor(s)
⎯ Optimizes memory
⎯ Optimizes performance and efficiency
⎯ Enables large neural networks
Partnerships with innovators to develop more efficient networks
⎯ Binarized Networks
⎯ Feature extraction to reduce neural network size
⎯ Traditional Machine Learning – less compute intensive, complements neural networks
TENSAI ® Compiler Pipelines
22
Memory optimization ⎯ Operation pipelining, redundancy reduction⎯ Dynamic allocation and deallocation⎯ Buffer sharing between input and output channels
Performance and efficiency optimization⎯ Optimized kernels for DSP and M3⎯ Multicore workload balancing
Large neural network support⎯ Weight Compression
Efficient LZ4 – Up to 2x more weights⎯ Scale-out ECM3532
Compiler splits model across multiple chips⎯ Weights in external flash
Middleware makes storage transparent Pipelined operations mask transfer delays
TENSAI ® Compiler Optimization
416kB -> 176kB
23
Area Task Model Dataset Image
Size
Inferences
per
second
Power
(mW)
Energy per
inference
(mJ)
VisionImage
classification
Eta Compute -
CNNCIFAR10 32x32 20 2.7 0.14
VisionPerson
detectionMobileNet V1 COCO 96x96
1
3
1.3
4.2
1.3
1.5
VisionPerson, Object
countingMobileNet V1 COCO 256x256 1 4.1 4.1
AudioCommand
recognition
Eta Compute -
GRUGoogle NA 2 to 5 1.2 0.4
MotionActivity
detection
Eta Compute -
CNN
Motion Sense
for Human
Activity
NA 50 1.1 0.02
Compiler Efficiency - Vision, Sound, Motion
24
Compiler Efficiency - Visual Keyword (Human/No Human)Hardware Neural Net Kernels Hardware Secs/Inf Accuracy Frequency Energy(mJ)/Inf
Vendor A Google’ ARM CMSIS M4 1.8 Ditto 96 MHz 16
ECM3531 Google’ ARM CMSIS M3 5.2 Ditto 60 MHz 21
ECM3532 Google’ Eta Compiler M3 0.8 Ditto 60 MHz 3
ECM3532 Google’ Eta Compiler M3 + DSP 0.36 Ditto 60 MHz 1.5
ECM3532 Google’ Eta Compiler M3 + DSP 1 Ditto 20 MHz 1.3
Row 1- Row 2: Hardware Comparison with same kernels. ARM CMSIS M4 kernels are more efficient than M3 kernels. Row 2 – Row 3: Same Hardware, Different kernels still using M3 only. Factor of 7x in software.Row 3 – Row 4: Add DSP. Factor of 2 in speed and 0.5 in energy. Row 4 – Row 5: Drop the frequency, further reduce the energy.
25
CIFAR10 - Image Categorization
CNN
Ops/Inf (M) 25
Weights (kB) 87
Image Size 32x32
Hardware Neural Net Kernels Secs/Inf Energy
(mJ)/Inf
VENDOR B ARM CMSIS 0.1 30
ECM3532 Eta Compute Eta Gen2 0.05 0.15
200x lower
26
Time Vendor C Power Vendor C
Energy Vendor C
Vendor C
Vendor C Power
Vendor C Energy
Compiler Efficiency – Customer Measurements
Network 1Smaller
Network 2Larger
6x Lower4x Faster
Both networks are M3 only. Too small for efficient DSP use.
10x Lower10x Faster
27
Entrance of the conference room
Entrance/exit of the building(counting)
Cafeteria & Corridors
Social Distancing detection
People Counting in Ultra Low Power Edge Devices
1 1
23
28
People Counting Prototyping: ECM3532 EVB + AI Vision Extension
QVGA Camera
Himax HM0360
PIR sensor(Panasonic
EKMB1101112)
Serial Flash 64Mbits
(Macronix MX25R6435F)
ECM3532 EVB
With ECM3532 NSP
TOF sensor (STM VL53L3CX)
Ambient light sensor(Ti OPT3001)
Microphone #1(TDK ICS-41350)
Microphones #2(TDK ICS-41350 PDM)
AI Vision Extension
29
People Counting Performance Measurements
IOU Confidence Accuracy Precision Recall
0.3 0.35 0.943 0.969 0.969
Task Model Dataset Image Size Inf/sec Power
(mW)
Energy/Inf(mJ)
Person
detection
MobileNet
SSD V1COCO 96x96
1
3
1.3
4.2
1.3
1.5
Person,
Object
counting
MobileNet
SSD V1COCO 256x256 1 4.1 4.1
30
Neural Networks continue to gain interests for deployment in IoT and other mobile and edge devices. Yet enabling a NN in a hardware constrained embedded system such as ultra low power edge devices presents many challenges
We show how Eta Compute took an integrated approach to minimize the barrier to design neural network for ultra low power devices
Neural network design and optimization for the embedded world: memory, compute power and accuracy
Hardware and software co-optimization to improve the energy efficiency
Automatic inference code generation based on the model graph by a proprietary hardware-aware compiler tool
Summary
31
World’s Most Energy Efficient AI Platform for the Extreme Edge
Eta (η) is the seventh letter of the Greek alphabet. In electronics, η represents efficiency of
circuits
Copyright Notice
This presentation in this publication was presented as a tinyML® Talks webcast. The content reflects the opinion of the author(s) and their respective companies. The inclusion of presentations in this publication does not constitute an endorsement by tinyML Foundation or the sponsors.
There is no copyright protection claimed by this publication. However, each presentation is the work of the authors and their respective companies and may contain copyrighted material. As such, it is strongly encouraged that any use reflect proper acknowledgement to the appropriate source. Any questions regarding the use of any materials presented should be directed to the author(s) or their companies.
tinyML is a registered trademark of the tinyML Foundation.
www.tinyML.org