33
Systems | Fueling future disruptions Research Faculty Summit 2018

Research Faculty Summit 2018...In collaboration with Google’s Mobile Vision Team [Yang et al., ECCV 2018] Slide 10 Symposia on VLSI Technology and Circuits Improved Latency vs. Accuracy

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Research Faculty Summit 2018...In collaboration with Google’s Mobile Vision Team [Yang et al., ECCV 2018] Slide 10 Symposia on VLSI Technology and Circuits Improved Latency vs. Accuracy

Systems | Fueling future disruptions

ResearchFaculty Summit 2018

Page 2: Research Faculty Summit 2018...In collaboration with Google’s Mobile Vision Team [Yang et al., ECCV 2018] Slide 10 Symposia on VLSI Technology and Circuits Improved Latency vs. Accuracy

Symposia on VLSI Technology and Circuits

Efficient Edge Computing for Deep Neural Networks and Beyond

Vivienne Sze

Massachusetts Institute of Technology

In collaboration with Yu-Hsin Chen, Joel Emer, Tien-Ju Yang, Sertac Karaman, Luca Carlone, Amr Suleiman, Zhengdong Zhang

ContactInfoEmail:[email protected]:www.rle.mit.edu/eems

Page 3: Research Faculty Summit 2018...In collaboration with Google’s Mobile Vision Team [Yang et al., ECCV 2018] Slide 10 Symposia on VLSI Technology and Circuits Improved Latency vs. Accuracy

Symposia on VLSI Technology and Circuits

Outline

•  Limitations of Existing Efficient DNN Approaches •  Looking Beyond the DNN Accelerator for Acceleration

•  Looking Beyond DNNs: Other forms of inference at the edge

Slide 2

Page 4: Research Faculty Summit 2018...In collaboration with Google’s Mobile Vision Team [Yang et al., ECCV 2018] Slide 10 Symposia on VLSI Technology and Circuits Improved Latency vs. Accuracy

Symposia on VLSI Technology and Circuits

Limitations of Existing Efficient DNN Approaches

3

Y.-H. Chen*, T.-J. Yang*, J. Emer, V. Sze, “Understanding the Limitations of Existing Energy-Efficient Design

Approaches for Deep Neural Networks,” SysML 2018.

Page 5: Research Faculty Summit 2018...In collaboration with Google’s Mobile Vision Team [Yang et al., ECCV 2018] Slide 10 Symposia on VLSI Technology and Circuits Improved Latency vs. Accuracy

Symposia on VLSI Technology and Circuits

Energy-Efficient Processing of DNNs

A significant amount of algorithm and hardware research on energy-efficient processing of DNNs

We identified various limitations to existing approaches

V. Sze, Y.-H. Chen, T-J. Yang, J. Emer,

“Efficient Processing of Deep Neural Networks: A Tutorial and Survey,” Proceedings of the IEEE,

Dec. 2017 eyeriss.mit.edu/tutorial.html

Slide 4

Page 6: Research Faculty Summit 2018...In collaboration with Google’s Mobile Vision Team [Yang et al., ECCV 2018] Slide 10 Symposia on VLSI Technology and Circuits Improved Latency vs. Accuracy

Symposia on VLSI Technology and Circuits

Design of Efficient DNN Algorithms

•  Popular efficient DNN algorithm approaches

•  Focus on reducing number of MACs and weights •  Does it translate to energy savings?

SUXQLQJ�QHXURQV

SUXQLQJ�V\QDSVHV

DIWHU�SUXQLQJEHIRUH�SUXQLQJ

NetworkPruning

C1

1S

R

1

R

SC

CompactNetworkArchitectures

Examples:SqueezeNet,MobileNet

...alsoreducedprecision

Slide 5

Page 7: Research Faculty Summit 2018...In collaboration with Google’s Mobile Vision Team [Yang et al., ECCV 2018] Slide 10 Symposia on VLSI Technology and Circuits Improved Latency vs. Accuracy

Symposia on VLSI Technology and Circuits

Data Movement is Expensive

Energy of weight depends on memory hierarchy and

dataflow

DRAM Global Buffer

PE

PE PE

ALU fetch data to run a MAC here

ALU

Buffer ALU

RF ALU

Normalized Energy Cost*

200×

PE ALU 2×

1× 1× (Reference)

DRAM ALU

0.5 – 1.0 kB

100 – 500 kB

NoC: 200 – 1000 PEs

* measured from a commercial 65nm process Slide 6

Page 8: Research Faculty Summit 2018...In collaboration with Google’s Mobile Vision Team [Yang et al., ECCV 2018] Slide 10 Symposia on VLSI Technology and Circuits Improved Latency vs. Accuracy

Symposia on VLSI Technology and Circuits

Energy-Evaluation Methodology

Energy estimation tool available at eyeriss.mit.edu

DNN Shape Configuration (# of channels, # of filters, etc.)

DNN Weights and Input Data

[0.3, 0, -0.4, 0.7, 0, 0, 0.1, …]

DNN Energy Consumption L1 L2 L3

Energy

Memory Accesses

Optimization

# of MACs Calculation

# acc. at mem. level 1 # acc. at mem. level 2

# acc. at mem. level n

# of MACs

Hardware Energy Costs of each MAC and Memory Access

Ecomp

Edata

[Yang et al., CVPR 2017]

Slide 7

Page 9: Research Faculty Summit 2018...In collaboration with Google’s Mobile Vision Team [Yang et al., ECCV 2018] Slide 10 Symposia on VLSI Technology and Circuits Improved Latency vs. Accuracy

Symposia on VLSI Technology and Circuits

Key Observations

•  Number of weights alone is not a good metric for energy •  All data types should be considered

OutputFeatureMap43%

InputFeatureMap25%

Weights22%

Computa:on10%

EnergyConsump:onofGoogLeNet

[Yang et al., CVPR 2017] Slide 8

Page 10: Research Faculty Summit 2018...In collaboration with Google’s Mobile Vision Team [Yang et al., ECCV 2018] Slide 10 Symposia on VLSI Technology and Circuits Improved Latency vs. Accuracy

Symposia on VLSI Technology and Circuits

Energy-Aware Pruning

•  Sort layers based on energy and prune layers that consume most energy first

•  EAP reduces AlexNet energy by 3.7x and outperforms the previous work that uses magnitude-based pruning by 1.7x

0 0.5

1 1.5

2 2.5

3 3.5

4 4.5

Ori. DC EAP

Normalized Energy (AlexNet)

2.1x 3.7x

x109

[Yang et al., CVPR 2017]

Magnitude Based Pruning

Energy Aware Pruning

Directly target energy and incorporate it into the optimization of DNNs to provide

greater energy savings

Slide 9

Page 11: Research Faculty Summit 2018...In collaboration with Google’s Mobile Vision Team [Yang et al., ECCV 2018] Slide 10 Symposia on VLSI Technology and Circuits Improved Latency vs. Accuracy

Symposia on VLSI Technology and Circuits

NetAdapt: Platform-Aware DNN Adaptation

•  Automatically adapt DNN to a mobile platform to reach a target latency or energy budget

•  Use empirical measurements to guide optimization (avoid modeling of tool chain or platform architecture)

NetAdapt Measure

NetworkProposals

EmpiricalMeasurementsMetric Proposal A … Proposal Z

Latency 15.6 … 14.3

Energy 41 … 46

PretrainedNetwork Metric Budget

Latency 3.8

Energy 10.5

Budget

AdaptedNetwork

Pla8orm

A B C D Z

[Yang et al., ECCV 2018] In collaboration with Google’s Mobile Vision Team Slide 10

Page 12: Research Faculty Summit 2018...In collaboration with Google’s Mobile Vision Team [Yang et al., ECCV 2018] Slide 10 Symposia on VLSI Technology and Circuits Improved Latency vs. Accuracy

Symposia on VLSI Technology and Circuits

Improved Latency vs. Accuracy Tradeoff

•  NetAdapt boosts the real inference speed of MobileNet by up to 1.7x with higher accuracy

+0.3% accuracy 1.7x faster

+0.3% accuracy 1.6x faster

*Tested on the ImageNet dataset and a Google Pixel 1 CPU

Slide 11

Page 13: Research Faculty Summit 2018...In collaboration with Google’s Mobile Vision Team [Yang et al., ECCV 2018] Slide 10 Symposia on VLSI Technology and Circuits Improved Latency vs. Accuracy

Symposia on VLSI Technology and Circuits

Many Efficient DNN Design Approaches

SUXQLQJ�QHXURQV

SUXQLQJ�V\QDSVHV

DIWHU�SUXQLQJEHIRUH�SUXQLQJ

NetworkPruning

C 1

1 S

R

1

R

S C

CompactNetworkArchitectures

1 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0

0 1 1 0 0 1 1 0

ReducePrecision32-bit float

8-bit fixed

Binary 0

NoguaranteethatDNNalgorithmdesignerwilluseagivenapproach.

Needflexiblehardware!

Slide 12

Page 14: Research Faculty Summit 2018...In collaboration with Google’s Mobile Vision Team [Yang et al., ECCV 2018] Slide 10 Symposia on VLSI Technology and Circuits Improved Latency vs. Accuracy

Symposia on VLSI Technology and Circuits

Existing DNN Architectures

•  Specialized DNN hardware often rely on certain properties of DNN in order to achieve high energy-efficiency

•  Example: Reduce memory access by amortizing across MAC array

MAC array Weight

Memory

Activation

Memory

Weight reuse

Activation reuse

Slide 13

Page 15: Research Faculty Summit 2018...In collaboration with Google’s Mobile Vision Team [Yang et al., ECCV 2018] Slide 10 Symposia on VLSI Technology and Circuits Improved Latency vs. Accuracy

Symposia on VLSI Technology and Circuits

Limitation of Existing DNN Architectures

•  Example: Reuse and array utilization depends on # of channels, feature map/batch size –  Not efficient across all network architectures (e.g., compact DNNs) –  Less efficient as array scales up in size –  Can be challenging to exploit sparsity

MAC array (spatial

accumulation)

Number of filters (output channels)

Number of input channels

MAC array (temporal

accumulation)

Number of filters (output channels)

feature map or batch size

Slide 14

Page 16: Research Faculty Summit 2018...In collaboration with Google’s Mobile Vision Team [Yang et al., ECCV 2018] Slide 10 Symposia on VLSI Technology and Circuits Improved Latency vs. Accuracy

Symposia on VLSI Technology and Circuits

Eyeriss v2: Balancing Flexibility and Efficiency

eyeriss.mit.edu

[Chen et al., arXiv 2018] Over an order of magnitude faster and more energy efficient than Eyeriss v1

Efficiently supports •  Wide range of filter shapes

–  Large and Compact

•  Different Layers –  CONV, FC, depth wise, etc.

•  Wide range of sparsity –  Dense and Sparse

•  Scalable architecture

Slide 15

Page 17: Research Faculty Summit 2018...In collaboration with Google’s Mobile Vision Team [Yang et al., ECCV 2018] Slide 10 Symposia on VLSI Technology and Circuits Improved Latency vs. Accuracy

Symposia on VLSI Technology and Circuits

Eyeriss v2: Balancing Flexibility and Efficiency

•  Flexible dataflow, called Row-Stationary Plus (RS+), that enables the spatial mapping of data from all dimensions for high PE array utilization and data reuse for various layer shapes and sizes

eyeriss.mit.edu [Chen et al., arXiv 2018]

Active PEIdle PE

F1

S1

F1

S1×G1

Row Stationary Row Stationary Plus

Output fmap width* Output fmap width*

Filter width*

Filter width*

# channel groups*

*tiling parameters Slide 16

Page 18: Research Faculty Summit 2018...In collaboration with Google’s Mobile Vision Team [Yang et al., ECCV 2018] Slide 10 Symposia on VLSI Technology and Circuits Improved Latency vs. Accuracy

Symposia on VLSI Technology and Circuits

Eyeriss v2: Balancing Flexibility and Efficiency

•  Flexible dataflow, called Row-Stationary Plus (RS+), that enables the spatial mapping of data from all dimensions for high PE array utilization and data reuse for various layer shapes and sizes

•  Flexible NoC to support RS+ that can operate in different modes for different requirements –  Utilizes multicast to exploit spatial data reuse

–  Utilizes unicast for high BW for weights for FC and weights & activations for compact network architectures

•  Processes data in both compressed and raw format to minimize data movement for both CONV and FC layers –  Exploit sparsity in both weights and activations

eyeriss.mit.edu [Chen et al., arXiv 2018]

Slide 17

Page 19: Research Faculty Summit 2018...In collaboration with Google’s Mobile Vision Team [Yang et al., ECCV 2018] Slide 10 Symposia on VLSI Technology and Circuits Improved Latency vs. Accuracy

Symposia on VLSI Technology and Circuits

Looking Beyond the DNN Accelerator for Acceleration

18

Z. Zhang, V. Sze, “FAST: A Framework to Accelerate Super-Resolution Processing on Compressed Videos,” CVPRW 2017

Page 20: Research Faculty Summit 2018...In collaboration with Google’s Mobile Vision Team [Yang et al., ECCV 2018] Slide 10 Symposia on VLSI Technology and Circuits Improved Latency vs. Accuracy

Symposia on VLSI Technology and Circuits

Super-Resolution on Mobile Devices

Use super-resolution to improve the viewing experience of lower-resolution content (reduce communication bandwidth)

Screens are getting larger

LowResoluConStreaming

Transmit low resolution for lower bandwidth

HighResoluConPlayback

Slide 19

Page 21: Research Faculty Summit 2018...In collaboration with Google’s Mobile Vision Team [Yang et al., ECCV 2018] Slide 10 Symposia on VLSI Technology and Circuits Improved Latency vs. Accuracy

Symposia on VLSI Technology and Circuits

Complexity of Super Resolution Algorithms

State-of-the-art super resolution algorithms use DNNs à computationally expensive, especially at high resolutions (HD or 4K)

SRCNN (Dong et, al. ECCV 14)

8032 MACs/pixel à ~500 GMAC/s for HD @ 30 fps

Slide 20

Page 22: Research Faculty Summit 2018...In collaboration with Google’s Mobile Vision Team [Yang et al., ECCV 2018] Slide 10 Symposia on VLSI Technology and Circuits Improved Latency vs. Accuracy

Symposia on VLSI Technology and Circuits

FAST: A Framework to Accelerate Super Resolution

A framework that accelerates any SR algorithm by up to 15x when running on compressed videos

[Zhangetal.,CVPRW2017]

FAST SR

15x faster

Compressed video

SR algorithm

Real-time

Slide 21

Page 23: Research Faculty Summit 2018...In collaboration with Google’s Mobile Vision Team [Yang et al., ECCV 2018] Slide 10 Symposia on VLSI Technology and Circuits Improved Latency vs. Accuracy

Symposia on VLSI Technology and Circuits

Free Information in Compressed Videos

This representation can help accelerate super-resolution

Block-structure Motion-compensation

Representation from compressed video

Compressed video

Pixels

Video as a stack of pixels

Decode

Slide 22

Page 24: Research Faculty Summit 2018...In collaboration with Google’s Mobile Vision Team [Yang et al., ECCV 2018] Slide 10 Symposia on VLSI Technology and Circuits Improved Latency vs. Accuracy

Symposia on VLSI Technology and Circuits

Transfer is Lightweight

High-res video

Low-res video High-res video

SR

Low-res video

Transfer

Fractional Interpolation

Bicubic Interpolation

Skip Flag

The complexity of the transfer is comparable to bicubic interpolation. Transfer N frames, accelerate by N

Transfer allows SR to run on only a subset of frames

SR SR SR SR

SR

Slide 23

Page 25: Research Faculty Summit 2018...In collaboration with Google’s Mobile Vision Team [Yang et al., ECCV 2018] Slide 10 Symposia on VLSI Technology and Circuits Improved Latency vs. Accuracy

Symposia on VLSI Technology and Circuits

Evaluation: Accelerating SRCNN

PartyScene RaceHorse BasketballPass Examples of videos in the test set (20 videos for HEVC development)

4x acceleration with NO PSNR LOSS. 16x acceleration with 0.2 dB loss of PSNR Slide 24

Page 26: Research Faculty Summit 2018...In collaboration with Google’s Mobile Vision Team [Yang et al., ECCV 2018] Slide 10 Symposia on VLSI Technology and Circuits Improved Latency vs. Accuracy

Symposia on VLSI Technology and Circuits

Visual Evaluation

SRCNN FAST + SRCNN Bicubic

Code released at www.rle.mit.edu/eems/fast Slide 25

Page 27: Research Faculty Summit 2018...In collaboration with Google’s Mobile Vision Team [Yang et al., ECCV 2018] Slide 10 Symposia on VLSI Technology and Circuits Improved Latency vs. Accuracy

Symposia on VLSI Technology and Circuits

Beyond Deep Neural Networks

26

A. Suleiman, Z. Zhang, L. Carlone, S. Karaman, V. Sze, “Navion: A Fully Integrated Energy-Efficient Visual-Inertial Odometry Accelerator

for Autonomous Navigation of Nano Drones,” Symposium on VLSI 2018

Page 28: Research Faculty Summit 2018...In collaboration with Google’s Mobile Vision Team [Yang et al., ECCV 2018] Slide 10 Symposia on VLSI Technology and Circuits Improved Latency vs. Accuracy

Symposia on VLSI Technology and Circuits

Energy-Efficient Autonomous Navigation of NanoDrones

[Zhang et al., RSS 2017], [Suleiman et al., VLSI 2018]

Navion Chip Localization and

Mapping at < 30mW (full integration

on-chip)

http://navion.mit.edu In collaboration with Sertac Karaman (AeroAstro) Luca Carlone (AeroAstro) Slide 27

Page 29: Research Faculty Summit 2018...In collaboration with Google’s Mobile Vision Team [Yang et al., ECCV 2018] Slide 10 Symposia on VLSI Technology and Circuits Improved Latency vs. Accuracy

Symposia on VLSI Technology and Circuits

Localization and Mapping Using VIO*

Visual-Inertial Odometry

(VIO)

Localization

Mapping

Image sequence

IMU Inertial Measurement Unit

*Subset of SLAM algorithm (Simultaneous Localization And Mapping) Slide 28

Page 30: Research Faculty Summit 2018...In collaboration with Google’s Mobile Vision Team [Yang et al., ECCV 2018] Slide 10 Symposia on VLSI Technology and Circuits Improved Latency vs. Accuracy

Symposia on VLSI Technology and Circuits

VIO: Backend uses Factor Graph to Infer State of Drone

Camera

IMU

Vision Frontend (VFE)

IMU Frontend (IFE) Updated States (xi)

& Sparse 3D Map

Feature Tracks

Estimated States

Backend (BE)

Factor Graph

4000+ factors

IMU Factors Vision Factors Other Factors

Non-linear least squares factor graph optimization

Exploit sparsity for 5.4x memory reduction and 7.2x speed up

Slide 29

Page 31: Research Faculty Summit 2018...In collaboration with Google’s Mobile Vision Team [Yang et al., ECCV 2018] Slide 10 Symposia on VLSI Technology and Circuits Improved Latency vs. Accuracy

Symposia on VLSI Technology and Circuits

Summary

•  Design considerations for deep learning at the edge –  Incorporate direct metrics into algorithm design for improved efficiency – Use a flexible dataflow and NoC to exploit data reuse for energy

efficiency and increase PE utilization for speed

•  Accelerate deep learning by looking beyond the accelerator –  Exploit data representation for FAST Super-Resolution

•  Other forms of inference at the edge beyond deep learning –  Graphical models for localization and mapping in nanodrones

Formoreinfo:www.rle.mit.edu/eems Slide 30

Page 32: Research Faculty Summit 2018...In collaboration with Google’s Mobile Vision Team [Yang et al., ECCV 2018] Slide 10 Symposia on VLSI Technology and Circuits Improved Latency vs. Accuracy

Thank you!

Page 33: Research Faculty Summit 2018...In collaboration with Google’s Mobile Vision Team [Yang et al., ECCV 2018] Slide 10 Symposia on VLSI Technology and Circuits Improved Latency vs. Accuracy