SEJONG - On-Device AI for Mobile and Consumer Devices: From...

On-Device AI for Mobile and Consumer Devices:From DNN Model Compression to Domain-Specific Accelerators

Daehyun Kim

Samsung AI at CES 2021

Robots

JetBot 90 AI+ Bot Care & Bot Handy

Neo Quantum Processor Smart Trainer

Home Appliances

SmartThings Cooking Family Hub

Mobile Phones

Single Take Camera Bixby Vision

Our AI Vision

Why On-device AI ?

Two ways to implement AI : Cloud vs. On-device

On-device AI can enhance cloud AI in user experience

Centralized AI On-device AI

Privacy Protection

Reliable AI

Context-aware

Fast Response

Enhancing User Experience : Privacy

Important Information about users can be processed on device

“Read themost recent

message”

Device

Voice Assistant

HW Sensor

Microphone (Voice)

No need to

send the voice to cloud

send ‘most recent message’ to cloud

send private information to cloud

Enhancing User Experience : Reliable AI

No matter what happens on network connection or servers, it works!

In elevator In flight Server error

404Page not found

Enhancing User Experience : Context-aware

Device knows you fairly well AI can suggest the best option

Launch Select Select Start to Record- Camera - More - Super Slow Motion

Recommendation

You can use this with“Record a Super slowmotion video”

Enhancing User Experience : Fast response

Without latency in network and servers, it executes considerably quickly

No Network Latency

No Server Latency

On-device AI Research

On-device AI Challenges

On-device AI is required to run on limited resources, compared to cloud

Resource GapClouds Devices

Computing

Memory

Our On-Device AI Approach

① Neural Network Model Optimization

② AI System Software

③ AI HW Accelerator

NN Compiler/RuntimeModel Analysis, Scheduling, Memory Optimization, …

NN PipelineMulti-Model Pipeline, Multi-Device Pipeline

Neural ProcessorPower Efficient Specialized Accelerator HW IP

Co-Design

Model CompressionPruning, Quantization, …

Model Architecture

NAS, Multi-taking, …

NN TrainingOn-Device Training

FleXORBiQGEMM

nnStreamernnTrainer

FleXOR

FleXOR: Trainable Fractional Quantization

D. Lee, S Kwon, B. Kim, Y Jeon, B Park, J YunNeurips 2020

FleXOR

• A flexible encryption algorithm/architecture (called “FleXOR”)

to enable fractional sub 1-bit numbers to represent each weight

• Contributions

– XOR-based encryption of quantized bits enhances compression ratio

– XOR-aware training algorithm learns encrypted weights

– High model accuracy with sub 1-bit quantization

Differentiable XOR-gate Network

𝑤𝑖𝑏

𝑏𝑖

𝑤𝑖 sign

Existing STE (straight-through estimator)

𝑤𝑖𝑏

𝑏𝑖

Our FleXOR

𝑤𝑘𝑒

𝑤𝑙𝑒

𝜕tanh

1 ≤ 𝑘, 𝑙 ≤ 𝑁𝑖𝑛 1 ≤ 𝑖 ≤ 𝑁𝑜𝑢𝑡

FleXOR should be able to select the best out of 2^Nin possible outputs that are randomly selected from larger 2^Nout

search space.

𝑦 = 𝑡𝑎𝑛ℎ 𝑥1 × −𝑡𝑎𝑛ℎ(𝑥2) 𝑦 = 𝑡𝑎𝑛ℎ 𝟏𝟎𝑥1 × −𝑡𝑎𝑛ℎ(𝟏𝟎𝑥2)

Experiments

- FleXOR allows reduced memory footprint and bandwidth which are critical for energy-efficient inference designs.

- Even though achieving the best accuracy for 1.0 bit/weight is not the main purpose(e.g., XOR gate may be redundant for Nin=Nout), FleXOR shows the minimum accuracy drop.

BiQGEMM: Matrix Multiplication with Lookup Table for Binary-Coding-based Quantized DNNs

Y. Jeon, B. Park, S. Kwon, B. Kim, J. Yun, D. LeeSC20

BiQGEMM

• Previous Binary-Code Quantization Implementation

• Special H/W for MATMUL operation with Quantized weights

- For non-uniform quantization, CPU/GPU/NPU needs to perform on-chip dequantization in practice.

- Binary-code is then only to reduce memory requirements (not latency) without special H/W

• BiQGEMM

• Dedicated binary-coding-based matrix multiplication unit kernel design using lookup tables

• CPUs and GPUs can utilize quantized matrices to improve performance

+1 -1 +1 -1

-1 -1 -1 +1

-1 -1 +1 -1

+1 +1 +1 -1

𝑎00𝑎01𝑎02𝑎03

+1 -1 +1 -1

-1 -1 -1 +1

-1 -1 +1 -1

+1 +1 +1 -1

𝑎00𝑎01𝑎02𝑎03

Binary weights are

transferred in ‘Byte’ level

CPU needs to access

binary data in ‘Bit’ level

+1 -1 +1 -1

-1 -1 -1 +1

-1 -1 +1 -1

+1 +1 +1 -1

𝑎00𝑎01𝑎02𝑎03

Storage (HDD, SSD …)

1010 0001 0010 1110 …

1 byte

BiQGEMM

}𝑊 𝐵0 𝐵1 𝐵2

𝑎00…𝑎0𝑛

𝑎10…𝑎1𝑛

𝑎20…𝑎2𝑛

≈ + +𝑋 𝑋.{

0.03 0.16 -0.07-0.17 0.21 -0.100.20 -0.09 0.01

-0.17 0.21 -0.10

+1 -1 -1 +1

-1 +1 +1 -1

+1 -1 -1 +1

-1 +1 +1 -1

+1 -1 -1 +1

-1 +1 +1 -1

𝑩𝟎 = {−𝟏,+𝟏}𝟑𝟔×𝟒𝑨𝟎

𝑿 = ℝ𝟒×𝟑

• For MatMul, computations in are redundant

• Computations using 𝑿 and B values are performed in advance

and stored in lookup tables0.03 and -0.17 are repeatedly accessed

BiQGEMM

0.03 0.16 -0.07-0.17 0.21 -0.100.20 -0.09 0.01

-0.17 0.21 -0.10

1 0 0 1

0 1 1 0

1 0 0 1

0 1 1 0

1 0 0 1

0 1 1 0

𝑩𝟎 = {𝟎, 𝟏}𝟑𝟔×𝟒𝑨𝟎

𝑿 = ℝ𝟒×𝟑

• MatMul computations are replaced with pre-computation and Lookup table access

• Lookup-table size is empirically determined.

• In practice, 8 bits are used as an index with 256 entries

• No redundant computations

• Number of float multiplications are greatly reduced

• B matrix is now access in ‘Byte’ level (No bit-level operation)

Lookup Table

0 0 -(0.03) -(-0.17)

0 1 -(0.03) +(-0.17)

1 0 +(0.03) -(-0.17)

1 1 +(0.03) +(-0.17)Retrieved from a Lookup Table

BiQGEMM

** -1 in 𝑩𝟎 is stored as 0 in memory

Experimental Results

• Speedup over Eigen using 1-thread. Matrix size is given as m-by-1K. Output size (m) and batch size are annotated along the horizontal axis.

(a) PC (i7-7700)

(b) Mobile (Coretex-A76)

nnStreamer

Linux Foundation AI Project for Efficient Machine Learning Pipeline Development and Execution

M. Ham, J. Moon, G. Lim, S. Woo, W. Song, J. Jung,H. Ahn, P. Kapoor, D. Chae, G. Jang, Y. Ahn, J. Lee

https://nnstreamer.ai/https://github.com/nnstreamer/nnstreamer

Neural Network Pipeline

Efficient and flexible pipelines for Neural Networks

1000s Lines of Code

Manual Parallelization

Direct media/hardware Optimization

10s Lines of Pipeline Description

Automatic Pipeline Parallelization

Reusable Module for media/hardware

Queue Depth Post-Process

Queue Segmentation

Mux Synthesis App Pre-process

Ex) AR Application Post-Process

Do Not Reinvent the Wheel

GStreamer– https://gstreamer.freedesktop.org

– Open source multimedia pipeline framework

– Library for constructing graphs of media-handling components

nnStreamer– But, perfect the wheel!

– Extension of GStreamer for AI processing

Neural network as another media filter

Neural network data as tensor stream

Neural Network to GStreamer Pipeline

Code: src ! sinksr

Code: src ! tensor_filter mode=tensorflow-lite model=abc.tflite ! sink

sinkTensorflow-lite

abc.tflite

Code: src ! tensor_filter mode=tensorflow model=def.pb ! sink

sinkTensorflow

def.pb

Code: src ! tensor_filter mode=caffe2 model=ghi.pb ! sink

sinkCaffe2

ghi.pb

Data Conversion

Code: src ! tensor_filter mode=tensorflow-lite model=abc.tflite ! sinksr

sinkTensorflow-lite

abc.tflite

1920x1080 h264 3x300x300, float32

Code: src ! decodebin ! videoconvert ! videoscale! video/x-raw,format=RGB,width=300,height=300! tensor_convert! tensor_transform mode=typechg option=float32! tensor_filter mode=tensorflow-lite model=abc.tflite ! sink

sinkTensorflow-lite

abc.tflite

1920x1080 h264

3x300x300, float32

1920x1080 yuv rawvi

1920x1080 RGB

videosc

300x300 RGB

Ensure 300x300 RGB

r_transf

3x300x300 uint8

3x300x300 float32

Example: Activity Recognition Sensors

• ~1000 lines 16 lines

• @ 30FPS input, 90.4% 51.4% CPU

• ~40 MiB ~17 MiB Memory (RSS)

Neural Net

nnTrainer

nnTrainer: Towards On-Device Learning for Personalization

Submitted to ATC 21https://github.com/nnstreamer/nntrainer

Light-Weight On-Device Training Framework

nnTrainer– Software framework to train neural network on embedded devices

Personalization– As users keep using AI applications, they get

Faster (ex. 100ms to 50ms)

More accurate (ex. 88% to 95%)

Personalized (ex. A Dog to My Dog)

– While providing privacy

Personal data stay at user devices

Challenges Small data for training

Limited compute/memory resources

nnTrainer Overview

Peak Memory Consumption PyTorch : 1.2 GiB TensorFlow : 1.02 GiB NNTrainer : 0.37 GiB

Optimization of memory usage and training time

Transfer learning & Meta-learning

TFLite / Pytorch model-lelve compatibility

Easy to implement custom operators

Supports Android, Tizen, Linux

nnTrainer System Architecture

Few-Shot Learning: SimpleShot

SimpleShot Implementation SimpleShot Inference Results

Example: HandMoji

Streaming Line Processing Architecture with a WinogradConvolution Array for 4K 60fps Super-Resolution

Applications

Work-in-Progress

Stream Processing for TVs

Non-Streaming Environment Streaming Environment

Super-Resolution Neural Network Scaling

Accelerator Architecture

Architecture Overview Area Breakdown

Implementation & Evaluation

FPGA Demonstration Comparisons of CNN Hardware Accelerators

AI-SW-HW Co-Design

Efficient Neural Network: Voice

13x smaller neural network can show the similar performance in Speech Recognition

91.6% Accuracy @ 530 MB 91.1 % Accuracy @ 38MB

[Reference] Attention based on-device streaming speech recognition with large speech corpus (Interspeech 2019)

Acceleration of Neural Network: Specialized H/W

Voice Recognition NPU: about 4x less power consumption than CPU

ASR Accelerator Architecture

CPUASR Acceleration

(NPU-based)

PowerConsumption 982mW 276mW

Performance Comparison

* Measured under xRT(real-time factor) <1

On-device AI Deployment

Real FLOPS matter!

Big challenge: How to exploit the “peak” FLOPS into “real” FLOPS ???

CONV DECONV

Utiliza

tion (

0.2% UtilizationResults from a commercial NPU IP• Conv : 116.5 MACs/clk (30% Utul.)• Deconv: 1.05 MACs/clk (0.2 % Util.)

Exynos 2100 : 26 TOPS (Peak)Snapdragon 888 : 26 TOPS (Peak)

On-device AI will be embedded in various devices

Visual Display Digital AppliancesMobile

Communications

Big challenge: How to provide the “same” AI experience on variety of devices ???

Thank you

SEJONG - On-Device AI for Mobile and Consumer Devices: From...

Documents

Variational Autoencoders for Medical Imaging · Variational Autoencoders - Background Encoder Decoder X′ Loss Function − ′ + ∙ 𝑁 ,Σ ԡ𝑁0,𝐼 Σ Sample 𝑧from 𝑁(0,𝐼)

Poverty PredictionPerformance metric: mean log loss MeanLogLoss = -1 𝑁 á=1 𝑁[ álog á+(1− á)log(1− á)] id Poor predicted probability 418 0.32 41249 0.28 16205 0.58 Workflow

Personalised Learning checklist...Exponential and logarithmic functions e.g. solve for unknowns in 𝑁=𝑁0 − Handle sin , cos , tan when is expressed in degrees or radians Make

Home Features and Assistive Technology for the Home-Bound ...downloads.hindawi.com/journals/jar/2017/2865960.pdf · 4 JournalofAgingResearch Table3:Characteristicsofthehome-boundelderly(𝑁=66)

Acceptance Sampling - eskisehir.edu.trendustri.eskisehir.edu.tr/ledolgun/TKY302(ing)/icerik... · 2019. 12. 27. · Example: 𝑁=10000, 𝑛=89, =2, and the incoming lots are of

Shiraz Dossa- Human Status and Politcs- Hannah Arendt on the Holocaust

STRO Basile 4 43 à 4 'BEAIJÑËIUX (Dossa;cí à 8 25 secondes ...chaingysportnature.free.fr/scratch 5km 2009.pdf · DEC/\STRO Basile 4 43 à 4 'BEAIJÑËIUX (Dossa;cí à 8 25 secondes

Validity and Reliability of the Korean Version of the ...downloads.hindawi.com/journals/oti/2017/9452051.pdf · Table2:InternalconsistencyofK-USER-P(𝑁=67). K-USER-Pfrequency K-USER-Prestriction

PUBL0050:AdvancedQuantitativeMethods Lecture6 ... · Imputing 𝑁 1𝑡 1. Matching • Foreachtimeperiod𝑡,findthe𝑀‘closest’unitstounit1andaverage theobservedoutcomes:

REC TCC Cotonou, Bénin 16 Octobre 2012 (c) Union Internationale des Transports Routiers (IRU) 2012 INTERNATIONAL ROAD TRANSPORT UNION Monsieur Marcel Dossa

LECTURE 16: PCA AND SVD - uni-luebeck.dePCA STEPS Linearly transform an 𝑁×𝑑atrix m 𝑋nto an i 𝑁×𝑚atrix m Ò Centralize the data (subtract the mean). Calculate the 𝑑×𝑑

S L -A F CO2 H O...ltz-ft September 26th, 2017 - TERENO Workshop 2 EDDY COVARIANCE METHOD NEE ET =𝜌ഥ′𝑐′ ′ =𝜌ഥ′ 1 𝑁−1 𝑘=0 𝑁−1 𝑘− ഥ 𝑐𝑘−𝑐ҧ

Intra- and Inter-Network Routing€¦ · State Packet (LPS). 3. For each 𝑁 𝑖 ℎof 𝑁 𝑥 , calculate the distance to 𝑁 𝑖 ℎvia 𝑁 𝑥 c𝑜 =( 𝑙 →𝑁 𝑥

How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

· 2 What are logarithms? Logarithms are another way to write an exponential function! log 𝑁= means =𝑁, where >0 and ≠1 Parts of a logarithm: base, anti-logarithm, and exponent

CE 2014 Workshop, A Statistical Distance Approach to ... · Sufficient Statistics 14 MLR: 𝒚𝑛×1~𝑁(𝑿𝜷,𝜎2𝑰), 𝑇 =(𝒚𝑇𝒚,𝑿𝑇𝒚)is the complete sufficient

Nurses as Power Brokers: Changing Roles and Culture Change in Nursing Homes Dana Beth Weinberg, PhD Rebekah Zincavage, MA Almas Dossa, MPH, MS Sue Pfefferle,

Commonly Used Distributionsweb.eng.fiu.edu/leet/Eng_Data/chapt04_2014.pdf · (4.14) (4.15) ~𝑁 , 𝜎2 J Section 4.4: The Lognormal Distribution For data that contain outliers,

Herbs in Chronic Kidney Disease Panacea or Poison? · Herbs in Chronic Kidney Disease Panacea or Poison? Anar Dossa B.Sc.Pharm. CDE October 12, 2007 BC Nephrology Days ... Tung Shueh

Routing on the Channel Dependency Graph - GitLab · Routing Deadlocks –Channel Dependency Graph Channel Dependency Graph (CDG) Channels/links of 𝐼=𝐺 :𝑁, ; are nodes in