Enabling Neural network at the low power edge: A neural

“Enabling Neural network at the low power edge: A neural network

compiler for hardware constrained embedded system”Chao Xu – Eta Compute

November 24, 2020

tinyML Talks Sponsors

Additional Sponsorships available – contact [email protected] for info

tinyML Strategic Partner

mailto:[email protected]

3 © 2020 Arm Limited (or its affiliates)3 © 2020 Arm Limited (or its affiliates)

Optimized models for embedded

Application

Runtime(e.g. TensorFlow Lite Micro)

Optimized low-level NN libraries(i.e. CMSIS-NN)

Arm Cortex-M CPUs and microNPUs

Profiling and debugging

tooling such as Arm Keil MDK

Connect to high-level

frameworks

1

Supported byend-to-end tooling

2

2

RTOS such as Mbed OS

Connect toRuntime

3

3

Arm: The Software and Hardware Foundation for tinyML

1

AI Ecosystem Partners

Resources: developer.arm.com/solutions/machine-learning-on-arm

Stay Connected

@ArmSoftwareDevelopers

@ArmSoftwareDev

PAGE 4| Confidential Presentation ©2020 Deeplite, All Rights Reserved

BECOME BETA USER bit.ly/testdeeplite

WE USE AI TO MAKE OTHER AI FASTER, SMALLER AND MORE POWER EFFICIENT

Automatically compress SOTA models like MobileNet to <200KB with

little to no drop in accuracy for inference on resource-limited MCUs

Reduce model optimization trial & error from weeks to days using

Deeplite's design space exploration

Deploy more models to your device without sacrificing performance or

battery life with our easy-to-use software

Copyright © EdgeImpulse Inc.

TinyML for all developers

Get your free account at http://edgeimpulse.com

Test

Edge Device Impulse

Dataset

Embedded and edge

compute deployment

options

Acquire valuable

training data securely

Test impulse with

real-time device

data flows

Enrich data and train

ML algorithms

Real sensors in real time

Open source SDK

http://edgeimpulse.com/

Maxim Integrated: Enabling Edge Intelligencewww.maximintegrated.com/ai

Sensors and Signal Conditioning

Health sensors measure PPG and ECG signals critical to understanding vital signs. Signal chain products enable measuring even the most sensitive signals.

Low Power Cortex M4 Micros

The biggest (3MB flash and 1MB SRAM) and the smallest (256KB flash and 96KB SRAM) Cortex M4 microcontrollers enable algorithms and neural networks to run at wearable power levels

Advanced AI Acceleration

The new MAX78000 implements AI inferences at over 100x lower energy than other embedded options. Now the edge can see and hear like never before.

Wide range of ML methods: GBM, XGBoost, Random

Forest, Logistic Regression, Decision Tree, SVM, CNN, RNN,

CRNN, ANN, Local Outlier Factor, and Isolation Forest

Easy-to-use interface for labeling, recording, validating, and

visualizing time-series sensor data

On-device inference optimized for low latency, low power

consumption, and a small memory footprint

Supports Arm® Cortex™- M0 to M4 class MCUs

Automates complex and labor-intensive processes of a

typical ML workflow – no coding or ML expertise required!

Industrial Predictive Maintenance

Smart Home

Wearables

Qeexo AutoML for Embedded AIAutomated Machine Learning Platform that builds tinyML solutions for the Edge using sensor data

Automotive

Mobile

IoT

QEEXO AUTOML: END-TO-END MACHINE LEARNING PLATFORM

Key Features Target Markets/Applications

For a limited time, sign up to use Qeexo AutoML at automl.qeexo.com for FREE to bring intelligence to your devices!

https://automl.qeexo.com/

is for

building products

Automated Feature

Exploration and Model

Generation

Bill-of-Materials

Optimization

Automated Data

Assessment

Edge AI / TinyML

code for the smallest

MCUs

Reality AI Tools® software

Reality AI solutions

Automotive sound recognition & localization

Indoor/outdoor sound event recognition

RealityCheck™ voice anti-spoofing

[email protected] @SensorAI Reality AIhttps://reality.ai


https://reality.ai

SynSense builds ultra-low-power (sub-mW) sensing and inference hardware for embedded, mobile and edgedevices. We design systems for real-time always-on smart sensing, for audio, vision, IMUs, bio-signals and more.

https://SynSense.ai

Next tinyML Talks

Date Presenter Topic / Title

Tuesday,December 8

Ian CampbellCEO, OnScale

Sakyasingha DasguptaFounder, CEO & Managing Director,Edgecortix Inc.

Training Embedded AI/ML Using Synthetic Data

Using AI to design energy-efficient AI accelerators for the edge

Webcast start time is 8 am Pacific timeEach presentation is approximately 30 minutes in length

Please contact [email protected] if you are interested in presenting


Reminders

youtube.com/tinyml

Slides & Videos will be posted tomorrow

tinyml.org/forums

Please use the Q&A window for your questions

Chao XuChao brings more than 20 years of experience in the advanced signal processing and machine learning, networking semiconductor, and silicon photonics. Prior to Eta Compute, Dr. Xu served as senior director of communication systems and computing and storage platform at Inphi Corporation. Prior to Inphi, he held senior R&D positions at Integrated Device Technology and PMC-Sierra. Dr. Xu has over 30 pending and awarded patents. He received his Ph.D. from the University of Pennsylvania, a master’s of science degree and a bachelor of engineering degree in electrical engineering from the University of Science and Technology of China. His research area includes speech recognition, noise robustness, feature extraction, and other general machine learning methods.

13

Enabling Neural Network at the Ultra Low Power Edge: A neural network compiler for hardware constrained

embedded system Presented by: Chao Xu

14

Introduction – AI/ML in ultra low power edge devices

Challenges of developing AI/ML applications in ultra low power edge devices

Status of deploying AI/ML applications in ultra low power edge devices

Streamline AI/ML design from idea to firmware – TENSAI ® Flow ⎯ An integrated approach to minimize the barrier to design neural network for

ultra-low power edge devices

TENSAI ® Compiler– a Neural Network Compiler for ultra low power edge devices

Summary

Agenda

15

Neural Networks continue to gain interests for deployment in IoT and other mobile and edge devices.

Voice Sound Image

• Neural Sensor Processor

• Software Development Kit

• Trained Neural Net Libraries

• Reference Designs

• Development Tools

Ultra Low Power

Neural Sensor

Platform

Develop Deploy

16

Limited embedded memory (128KB or 256KB)

Limited storage (256KB or 512KB)

Limited computing speed (50MHz++)

Limited power (mW++)

Fragmented technology landscape ⎯ Tensorflow / Tensorflow Lite/ Tensorflow Micro

⎯ Pytorch

⎯ uC or DSP or NPU

Challenges of developing AI/ML applications in ultra low power edge devices

17

Very long development time⎯ Multiple development frameworks

⎯ Floating point / Fixed point

⎯ Hardware constraints

Trade off between power and accuracy.⎯ Small model size limits use cases which require high accuracy

Due to limited accuracy, a lot of practical use cases are impossible in ultra low power edge devices

Status of deploying AI/ML applications in ultra low power edge devices

18

Streamline Design from Idea to Firmware with TENSAI ® Flow

Memory

Speed/Power

Optimization

Design Goals

TENSAI® CompilerTensorFlow

NetworkGeneration

TensorFlow Eta Compute TENSAI®

FlowAccurac

y

Training

Data

DSP Libs

NN Libs

Executors

TF LiteNetwork

Generation

Trained

model

Quantize

Optimize

Eta Compute

Neural Sensor

Processor

FreeRTOS

HAL

Libraries

Embedded

Application

19

Release

Collect dataset

Train

Quantize and

optimize

Validate

TENSAI ® Flow OverviewEta computeGenerated

TENSAI

Compiler

Partner

TENSAI

NN ZOO

Trained NN model

Application

Cloud Connectivity

TENSAI Middleware: AI Kernel,

FreeRTOS, HAL, project generator,

Sensor Drivers

Qualified Algorithms

Qualified Algorithms

Data Collection

from sensor

board or other

methods

• Minimal embedded coding required

• Variety of networks: CNN, MobileNets,

Inception, ResNet, RNN

• Automatic multicore software, no

intervention required

• Optimized kernels for peak performance

and efficiency

20

TENSAI ® Compiler – a Neural Network Compiler for ultra low power edge devices

Resource Constrained HW

FreeRTOS

Executors

Optimized NNKernels

Automatically Generated

Inference CodeTFLite Model File TENSAI ® Compiler

Allows users to train network in floating point in Tensorflow

Quantize to TFLite Model using standard flow

TENSAI ® compiler will generate embedded code directly from TFLite model in ultra low power edge devices

21

Training level optimization

⎯ Hyperparameter optimization (AutoML), Pruning, Weight reduction and Quantization are done at the network level during training.

TENSAI ® Flow Optimization

⎯ Optimizes the trained neural network for best execution on TENSAI Neural Sensor Processor(s)

⎯ Optimizes memory

⎯ Optimizes performance and efficiency

⎯ Enables large neural networks

Partnerships with innovators to develop more efficient networks

⎯ Binarized Networks

⎯ Feature extraction to reduce neural network size

⎯ Traditional Machine Learning – less compute intensive, complements neural networks

TENSAI ® Compiler Pipelines

22

Memory optimization ⎯ Operation pipelining, redundancy reduction⎯ Dynamic allocation and deallocation⎯ Buffer sharing between input and output channels

Performance and efficiency optimization⎯ Optimized kernels for DSP and M3⎯ Multicore workload balancing

Large neural network support⎯ Weight Compression

Efficient LZ4 – Up to 2x more weights⎯ Scale-out ECM3532

Compiler splits model across multiple chips⎯ Weights in external flash

Middleware makes storage transparent Pipelined operations mask transfer delays

TENSAI ® Compiler Optimization

416kB -> 176kB

23

Area Task Model Dataset Image

Size

Inferences

per

second

Power

(mW)

Energy per

inference

(mJ)

VisionImage

classification

Eta Compute -

CNNCIFAR10 32x32 20 2.7 0.14

VisionPerson

detectionMobileNet V1 COCO 96x96

1

3

1.3

4.2

1.3

1.5

VisionPerson, Object

countingMobileNet V1 COCO 256x256 1 4.1 4.1

AudioCommand

recognition

Eta Compute -

GRUGoogle NA 2 to 5 1.2 0.4

MotionActivity

detection

Eta Compute -

CNN

Motion Sense

for Human

Activity

NA 50 1.1 0.02

Compiler Efficiency - Vision, Sound, Motion

24

Compiler Efficiency - Visual Keyword (Human/No Human)Hardware Neural Net Kernels Hardware Secs/Inf Accuracy Frequency Energy(mJ)/Inf

Vendor A Google’ ARM CMSIS M4 1.8 Ditto 96 MHz 16

ECM3531 Google’ ARM CMSIS M3 5.2 Ditto 60 MHz 21

ECM3532 Google’ Eta Compiler M3 0.8 Ditto 60 MHz 3

ECM3532 Google’ Eta Compiler M3 + DSP 0.36 Ditto 60 MHz 1.5

ECM3532 Google’ Eta Compiler M3 + DSP 1 Ditto 20 MHz 1.3

Row 1- Row 2: Hardware Comparison with same kernels. ARM CMSIS M4 kernels are more efficient than M3 kernels. Row 2 – Row 3: Same Hardware, Different kernels still using M3 only. Factor of 7x in software.Row 3 – Row 4: Add DSP. Factor of 2 in speed and 0.5 in energy. Row 4 – Row 5: Drop the frequency, further reduce the energy.

25

CIFAR10 - Image Categorization

CNN

Ops/Inf (M) 25

Weights (kB) 87

Image Size 32x32

Hardware Neural Net Kernels Secs/Inf Energy

(mJ)/Inf

VENDOR B ARM CMSIS 0.1 30

ECM3532 Eta Compute Eta Gen2 0.05 0.15

200x lower

26

Time Vendor C Power Vendor C

Energy Vendor C

Vendor C

Vendor C Power

Vendor C Energy

Compiler Efficiency – Customer Measurements

Network 1Smaller

Network 2Larger

6x Lower4x Faster

Both networks are M3 only. Too small for efficient DSP use.

10x Lower10x Faster

27

Entrance of the conference room

Entrance/exit of the building(counting)

Cafeteria & Corridors

Social Distancing detection

People Counting in Ultra Low Power Edge Devices

1 1

23

28

People Counting Prototyping: ECM3532 EVB + AI Vision Extension

QVGA Camera

Himax HM0360

PIR sensor(Panasonic

EKMB1101112)

Serial Flash 64Mbits

(Macronix MX25R6435F)

ECM3532 EVB

With ECM3532 NSP

TOF sensor (STM VL53L3CX)

Ambient light sensor(Ti OPT3001)

Microphone #1(TDK ICS-41350)

Microphones #2(TDK ICS-41350 PDM)

AI Vision Extension

29

People Counting Performance Measurements

IOU Confidence Accuracy Precision Recall

0.3 0.35 0.943 0.969 0.969

Task Model Dataset Image Size Inf/sec Power

(mW)

Energy/Inf(mJ)

Person

detection

MobileNet

SSD V1COCO 96x96

1

3

1.3

4.2

1.3

1.5

Person,

Object

counting

MobileNet

SSD V1COCO 256x256 1 4.1 4.1

30

Neural Networks continue to gain interests for deployment in IoT and other mobile and edge devices. Yet enabling a NN in a hardware constrained embedded system such as ultra low power edge devices presents many challenges

We show how Eta Compute took an integrated approach to minimize the barrier to design neural network for ultra low power devices

Neural network design and optimization for the embedded world: memory, compute power and accuracy

Hardware and software co-optimization to improve the energy efficiency

Automatic inference code generation based on the model graph by a proprietary hardware-aware compiler tool

Summary

31

World’s Most Energy Efficient AI Platform for the Extreme Edge

Eta (η) is the seventh letter of the Greek alphabet. In electronics, η represents efficiency of

circuits

Copyright Notice

This presentation in this publication was presented as a tinyML® Talks webcast. The content reflects the opinion of the author(s) and their respective companies. The inclusion of presentations in this publication does not constitute an endorsement by tinyML Foundation or the sponsors.

There is no copyright protection claimed by this publication. However, each presentation is the work of the authors and their respective companies and may contain copyrighted material. As such, it is strongly encouraged that any use reflect proper acknowledgement to the appropriate source. Any questions regarding the use of any materials presented should be directed to the author(s) or their companies.

tinyML is a registered trademark of the tinyML Foundation.

www.tinyML.org

Documents

Enabling Neural network at the low power edge: A neural