Research Faculty Summit 2018...In collaboration with Google’s Mobile Vision Team [Yang et al.,...

Systems | Fueling future disruptions

ResearchFaculty Summit 2018

Symposia on VLSI Technology and Circuits

Efficient Edge Computing for Deep Neural Networks and Beyond

Vivienne Sze

Massachusetts Institute of Technology

In collaboration with Yu-Hsin Chen, Joel Emer, Tien-Ju Yang, Sertac Karaman, Luca Carlone, Amr Suleiman, Zhengdong Zhang

ContactInfoEmail:sze@mit.eduWebsite:www.rle.mit.edu/eems

Outline

•  Limitations of Existing Efficient DNN Approaches •  Looking Beyond the DNN Accelerator for Acceleration

•  Looking Beyond DNNs: Other forms of inference at the edge

Limitations of Existing Efficient DNN Approaches

Y.-H. Chen*, T.-J. Yang*, J. Emer, V. Sze, “Understanding the Limitations of Existing Energy-Efficient Design

Approaches for Deep Neural Networks,” SysML 2018.

Energy-Efficient Processing of DNNs

A significant amount of algorithm and hardware research on energy-efficient processing of DNNs

We identified various limitations to existing approaches

V. Sze, Y.-H. Chen, T-J. Yang, J. Emer,

“Efficient Processing of Deep Neural Networks: A Tutorial and Survey,” Proceedings of the IEEE,

Dec. 2017 eyeriss.mit.edu/tutorial.html

Design of Efficient DNN Algorithms

•  Popular efficient DNN algorithm approaches

•  Focus on reducing number of MACs and weights •  Does it translate to energy savings?

SUXQLQJ�QHXURQV

SUXQLQJ�V\QDSVHV

DIWHU�SUXQLQJEHIRUH�SUXQLQJ

NetworkPruning

CompactNetworkArchitectures

Examples:SqueezeNet,MobileNet

...alsoreducedprecision

Data Movement is Expensive

Energy of weight depends on memory hierarchy and

dataflow

DRAM Global Buffer

ALU fetch data to run a MAC here

Buffer ALU

RF ALU

Normalized Energy Cost*

PE ALU 2×

1× 1× (Reference)

DRAM ALU

0.5 – 1.0 kB

100 – 500 kB

NoC: 200 – 1000 PEs

* measured from a commercial 65nm process Slide 6

Energy-Evaluation Methodology

Energy estimation tool available at eyeriss.mit.edu

DNN Shape Configuration (# of channels, # of filters, etc.)

DNN Weights and Input Data

[0.3, 0, -0.4, 0.7, 0, 0, 0.1, …]

DNN Energy Consumption L1 L2 L3

Energy

Memory Accesses

Optimization

# of MACs Calculation

# acc. at mem. level 1 # acc. at mem. level 2

# acc. at mem. level n

# of MACs

Hardware Energy Costs of each MAC and Memory Access

[Yang et al., CVPR 2017]

Key Observations

•  Number of weights alone is not a good metric for energy •  All data types should be considered

OutputFeatureMap43%

InputFeatureMap25%

Weights22%

Computa:on10%

EnergyConsump:onofGoogLeNet

[Yang et al., CVPR 2017] Slide 8

Energy-Aware Pruning

•  Sort layers based on energy and prune layers that consume most energy first

•  EAP reduces AlexNet energy by 3.7x and outperforms the previous work that uses magnitude-based pruning by 1.7x

Ori. DC EAP

Normalized Energy (AlexNet)

2.1x 3.7x

[Yang et al., CVPR 2017]

Magnitude Based Pruning

Energy Aware Pruning

Directly target energy and incorporate it into the optimization of DNNs to provide

greater energy savings

NetAdapt: Platform-Aware DNN Adaptation

•  Automatically adapt DNN to a mobile platform to reach a target latency or energy budget

•  Use empirical measurements to guide optimization (avoid modeling of tool chain or platform architecture)

NetAdapt Measure

NetworkProposals

EmpiricalMeasurementsMetric Proposal A … Proposal Z

Latency 15.6 … 14.3

Energy 41 … 46

PretrainedNetwork Metric Budget

Latency 3.8

Energy 10.5

Budget

AdaptedNetwork

Pla8orm

A B C D Z

[Yang et al., ECCV 2018] In collaboration with Google’s Mobile Vision Team Slide 10

Improved Latency vs. Accuracy Tradeoff

•  NetAdapt boosts the real inference speed of MobileNet by up to 1.7x with higher accuracy

+0.3% accuracy 1.7x faster

+0.3% accuracy 1.6x faster

*Tested on the ImageNet dataset and a Google Pixel 1 CPU

Many Efficient DNN Design Approaches

SUXQLQJ�QHXURQV

SUXQLQJ�V\QDSVHV

DIWHU�SUXQLQJEHIRUH�SUXQLQJ

NetworkPruning

CompactNetworkArchitectures

1 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0

0 1 1 0 0 1 1 0

ReducePrecision32-bit float

8-bit fixed

Binary 0

NoguaranteethatDNNalgorithmdesignerwilluseagivenapproach.

Needflexiblehardware!

Existing DNN Architectures

•  Specialized DNN hardware often rely on certain properties of DNN in order to achieve high energy-efficiency

•  Example: Reduce memory access by amortizing across MAC array

MAC array Weight

Memory

Activation

Memory

Weight reuse

Activation reuse

Limitation of Existing DNN Architectures

•  Example: Reuse and array utilization depends on # of channels, feature map/batch size –  Not efficient across all network architectures (e.g., compact DNNs) –  Less efficient as array scales up in size –  Can be challenging to exploit sparsity

MAC array (spatial

accumulation)

Number of filters (output channels)

Number of input channels

MAC array (temporal

accumulation)

Number of filters (output channels)

feature map or batch size

Eyeriss v2: Balancing Flexibility and Efficiency

eyeriss.mit.edu

[Chen et al., arXiv 2018] Over an order of magnitude faster and more energy efficient than Eyeriss v1

Efficiently supports •  Wide range of filter shapes

–  Large and Compact

•  Different Layers –  CONV, FC, depth wise, etc.

•  Wide range of sparsity –  Dense and Sparse

•  Scalable architecture

•  Flexible dataflow, called Row-Stationary Plus (RS+), that enables the spatial mapping of data from all dimensions for high PE array utilization and data reuse for various layer shapes and sizes

eyeriss.mit.edu [Chen et al., arXiv 2018]

Active PEIdle PE

S1×G1

Row Stationary Row Stationary Plus

Output fmap width* Output fmap width*

Filter width*

# channel groups*

*tiling parameters Slide 16

•  Flexible dataflow, called Row-Stationary Plus (RS+), that enables the spatial mapping of data from all dimensions for high PE array utilization and data reuse for various layer shapes and sizes

•  Flexible NoC to support RS+ that can operate in different modes for different requirements –  Utilizes multicast to exploit spatial data reuse

–  Utilizes unicast for high BW for weights for FC and weights & activations for compact network architectures

•  Processes data in both compressed and raw format to minimize data movement for both CONV and FC layers –  Exploit sparsity in both weights and activations

eyeriss.mit.edu [Chen et al., arXiv 2018]

Looking Beyond the DNN Accelerator for Acceleration

Z. Zhang, V. Sze, “FAST: A Framework to Accelerate Super-Resolution Processing on Compressed Videos,” CVPRW 2017

Super-Resolution on Mobile Devices

Use super-resolution to improve the viewing experience of lower-resolution content (reduce communication bandwidth)

Screens are getting larger

LowResoluConStreaming

Transmit low resolution for lower bandwidth

HighResoluConPlayback

Complexity of Super Resolution Algorithms

State-of-the-art super resolution algorithms use DNNs à computationally expensive, especially at high resolutions (HD or 4K)

SRCNN (Dong et, al. ECCV 14)

8032 MACs/pixel à ~500 GMAC/s for HD @ 30 fps

FAST: A Framework to Accelerate Super Resolution

A framework that accelerates any SR algorithm by up to 15x when running on compressed videos

[Zhangetal.,CVPRW2017]

FAST SR

15x faster

Compressed video

SR algorithm

Real-time

Free Information in Compressed Videos

This representation can help accelerate super-resolution

Block-structure Motion-compensation

Representation from compressed video

Compressed video

Pixels

Video as a stack of pixels

Decode

Transfer is Lightweight

High-res video

Low-res video High-res video

Low-res video

Transfer

Fractional Interpolation

Bicubic Interpolation

Skip Flag

The complexity of the transfer is comparable to bicubic interpolation. Transfer N frames, accelerate by N

Transfer allows SR to run on only a subset of frames

SR SR SR SR

Evaluation: Accelerating SRCNN

PartyScene RaceHorse BasketballPass Examples of videos in the test set (20 videos for HEVC development)

4x acceleration with NO PSNR LOSS. 16x acceleration with 0.2 dB loss of PSNR Slide 24

Visual Evaluation

SRCNN FAST + SRCNN Bicubic

Code released at www.rle.mit.edu/eems/fast Slide 25

Beyond Deep Neural Networks

A. Suleiman, Z. Zhang, L. Carlone, S. Karaman, V. Sze, “Navion: A Fully Integrated Energy-Efficient Visual-Inertial Odometry Accelerator

for Autonomous Navigation of Nano Drones,” Symposium on VLSI 2018

Energy-Efficient Autonomous Navigation of NanoDrones

[Zhang et al., RSS 2017], [Suleiman et al., VLSI 2018]

Navion Chip Localization and

Mapping at < 30mW (full integration

on-chip)

http://navion.mit.edu In collaboration with Sertac Karaman (AeroAstro) Luca Carlone (AeroAstro) Slide 27

Localization and Mapping Using VIO*

Visual-Inertial Odometry

Localization

Mapping

Image sequence

IMU Inertial Measurement Unit

*Subset of SLAM algorithm (Simultaneous Localization And Mapping) Slide 28

VIO: Backend uses Factor Graph to Infer State of Drone

Camera

Vision Frontend (VFE)

IMU Frontend (IFE) Updated States (xi)

& Sparse 3D Map

Feature Tracks

Estimated States

Backend (BE)

Factor Graph

4000+ factors

IMU Factors Vision Factors Other Factors

Non-linear least squares factor graph optimization

Exploit sparsity for 5.4x memory reduction and 7.2x speed up

Summary

•  Design considerations for deep learning at the edge –  Incorporate direct metrics into algorithm design for improved efficiency – Use a flexible dataflow and NoC to exploit data reuse for energy

efficiency and increase PE utilization for speed

•  Accelerate deep learning by looking beyond the accelerator –  Exploit data representation for FAST Super-Resolution

•  Other forms of inference at the edge beyond deep learning –  Graphical models for localization and mapping in nanodrones

Formoreinfo:www.rle.mit.edu/eems Slide 30

Thank you!

Research Faculty Summit 2018...In collaboration with Google’s Mobile Vision Team [Yang et al.,...

Documents

Discriminative Decorelation for clustering and classification ECCV 12

2014 eccv stm

Google’s Car

Google’s Billion Dollar Eigenvector

Second’Internaonal’Workshop’on’Parts’and’A5ributes ECCV

conferences / seminars / symposia

Google’s “Think/Do” Tank

Google’s prius2

ELCC 2020 Industry Guidelines - European Society for ... · INDUSTRY SATELLITE SYMPOSIA AND OTHER ACTIVITIES POLICY. 1.0 Satellite symposia schedule. Satellite symposia will take

Forensic Investigation of Google’s “hello” Investigation of Google’s “hello”.pdf · Forensic Investigation of Google’s “hello” Ver. 1 Page 2 of 30 Forensic Investigation

ECCV Describing Clothing by Semantic Attributeschenlab.ece.cornell.edu/.../ECCV2012_ClothingAttributes.pdf · Describing Clothing by Semantic Attributes Anonymous ECCV submission

Google’s tridente

CTS Exegetical Symposia

Google’s link disavow tool

PERS SEMINARS STATEMENTS SYMPOSIA WORKSHOPS … · PAPERS SEMINARS STATEMENTS SYMPOSIA WORKSHOPS MEETINGS PRESENTATIONS PAPERS SEMINARS STATEMENTS SYMPOSIA Printed at the United Nations,

Final program · · 2018-04-13Disc Degeneration 2 12:00-13:30 Industry Lunch Symposia Industry Lunch Symposia Industry Lunch Symposia Industry Lunch Symposia 13:30-14:30 Symposium

Confirmed Mini Symposia

6418 Ch11 pp342-377myresource.phoenix.edu/secure/resource/HUM176R2/... · Google’s CEO Eric Schmidt says.2 What Google’s ads lack in creativity, they make up in precision. Google’s

Using Google’s RSS Reader

ECCV Tutorial Mesh Processing Numerics Bruno Lévy INRIA - ALICE