INSTITUTE OF COMPUTING Elastic Processor for Deep Learning · DianNao: A Neural Network Accelerator • Support multiple neural network algorithms, such as DNN, CNN, MLP,SOM • Pre-tape-out

Elastic Processor

for Deep Learning

Zhiwei Xu

Institute of Computing Technology (ICT)

Chinese Academy of Sciences (CAS)

http://novel.ict.ac.cn/zxu/

[email protected]

INSTITUTE OF COMPUTING

TECHNOLOGY

This research is supported in part by the CAS Strategic Priority Program (Grants

XDA06010403, XDB02040009) and the NSF of China (Grants 61473275, 61532016）

中科寒武纪

Machine-Learning Applications

Data

Centers Smart Phones

Embedded

Devices Supercomputers

Audio recognition

Automatic

translation Drug design Image

analysis

Business

analytics

Ad Prediction

Robotics

Consumer

electronics

What Can AI Learn from HPC?

• Architecture

• Execution model

• Programming model

?? 2015 2020 2025 1976

Look at the Past to See the Future Xu et al, “High-Performance Computing Environment: A Review of Twenty Years of Experiments in China”,

National Science Review 3(1): 36-48 (2016)

Flops

Growth

1976 to 1995 1995 to 2015

Speed Increase Annual Growth Rate Speed Increase Annual Growth Rate

China 600 times 40% 28 million times 136%

World 1550 times 47% 0.2 million times 84%

1.00E+06

1.00E+07

1.00E+08

1.00E+09

1.00E+10

1.00E+11

1.00E+12

1.00E+13

1.00E+14

1.00E+15

1.00E+16

1.00E+17

1976 1980 1985 1990 1995 2000 2005 2010 2015

Top1 in Top500

China HPCE

China Top1

The Environment Approach

The Machine Approach

Growth trends of top HPC systems in China and in the World: before and after 1995

Flops • Ecosystem

– Stacks

– Community

• Direction – Goal

– Objectives

– Priority

• 2035: ??

High-End Computing: Three Phases

• Priority went through two phases

– Speed (flops), aka performance

– Scalability: market scalability, problem scalability

?? 2015 2020 2025 1976

Speed

First

Scalability First Efficiency

First

Fundamental Challenge First time in 70 years

• Energy efficiency growth lags behind speed growth

• Big data computing is especially bad: 15-150 KOPJ

1.0E+00

1.0E+03

1.0E+06

1.0E+09

1.0E+12

1.0E+15

1.0E+18

1945 1957 1964 1976 1985 1995 2005 2015

Operations/S Opeartions/KWH Watt

ENIAC UNIVAC CDC6600 Cray-1 Cray-2 NWT BG/L

TH-2

1 TOPJ

50 GOPJ

2025

• 0.12 MOPJ (Spark) • 16 KOPJ (Hadoop)

Most Important Efficiency Metric

• Energy Efficiency: GOPS/W≈GOPJ

0.1

1.0

10.0

100.0

1,000.0

10,000.0

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

MFlops/W

HPCS

ASCI Earth Simulator

K-Computer

863HPC

Hadoop Efficiency Is Low • Lacks a high-performance communication substrate

– Use HTTP, RPC, direct Sockets over TCP/IP to communicate

– Can MPI be used for big data?

Speed Efficiency (Sustained/Peak)

Energy Efficiency

(Operations per Joule)

Payload: 0.002%

Linpack: 94.5%

Total op: 4.22%

Instruction: 4.72%

Payload: 1.55x104

Linpack: 7.26x108

Total op: 2.20x107

Instruction: 2.45x107

16

204

Jim Gray’s Data Challenge: Four Phases

1. Terasort challenge raised (sort 1TB in 1 minute) 2. Speed growth: TB/minute in 2009 (ad hoc methods)

3. Data size growth: TBPB10PB (Map-Reduce)

4. Efficiency: 1PB sorted on 190 AWS instances in 2014

151

49 33

4.95 3.48

1.03

362

975

33

234

387

1

10

100

1000

1996 1998 2000 2002 2004 2006 2008 2010 2012 2014

1TB 1PB 10PB

Hadoop

N=3658

Google

N=1000

Google

N=8000 Spark

N=190

?

? 2020

Efficiency

becomes

top priority

Execution Time

(Minutes)

Makimoto’s Wave

• Semiconductor technology will soon enter another phase change. But what is it?

Standardization

Customization

1957 1967 1977 1987 1997 2007 2017 2027

Discrete

Circuits

Memory

P FPGA ?

SoC

SiP ASIC IC

LSI

Makimoto’s Wave

• HFSI: Highly Flexible Super Integration • Redundant circuits can be shut off when not in use

Standardization

Customization

1957 1967 1977 1987 1997 2007 2017 2027

Discrete

Circuits

Memory

P FPGA HFSI

SoC

SiP ASIC IC

LSI

EP

Our Answer:

Elastic

Processor

Elastic Processor

• A new architecture style (FISC)

– Featuring function instructions executed by

programmable ASIC accelerators

– Targeting 1000 GOPS/W = 1 TOPJ

RISC ARM

FISC Function Instruction Set Computer

CISC Intel X86

Chip types: 10s 1K 10K

Power: 10~100W 1~10W 0.1~1W

Apps/chip: 10M 100K 10K

http://img1.mydrivers.com/img/20141112/7cdb086508194a12abff00b520c3aff7.jpg

DianNao: A Neural Network Accelerator

• Support multiple neural network algorithms, such as DNN, CNN, MLP,SOM

• Pre-tape-out simulation results: 0.98GHz, 452 GOPS, 0.485W 931.96 GOPS/W @ 65nm

• ASPLOS 2014 Best Paper

IC Layout in GDSII 700 speedup over Intel Core i7

Architecture

Three More Accelerators

Showing Potential for TOPJ • DaDianNao: An NN supercomputer containing up to 64 chips

• PuDianNao: A polyvalent machine learning accelerator

• ShiDianNao: A vision accelerator for embedded devices (cameras)

“电脑” “大电脑” “普电脑” “视电脑” DianNao DaDianNao PuDianNao ShiDianNao 931 GOPS/W 100-250 GOPS/W 300-1200 GOPS/W 2-4 TOPS/W

ASPLOS’14 Best Paper MICRO’14 Best Paper ASPLOS’15 ISCA’15

DaDianNao: An NN Supercomputer

• In average, 450x speedup and 150x energy saving over K20 GPU for a 64-chip system

Speedup

46.38x vs. CPU

28.94x vs. GPU

1.87x vs. DianNao

Energy saving

4688.13x vs. GPU

63.48x vs. DianNao

ShiDianNao:

An Vision Accelerator for

Embedded Devices

PuDianNao

Area: 3.51 mm2

Power: 596 mW

Freq: 1 GHz

Supporting many ML algorithms:

CNN/DNN, LR, Kmeans, SVM,

NB, KNN, CT, …

Area: 3.02 mm2

Power: 485 mW

Freq: 0.98 GHz

Supporting CNN/DNN

DianNao

Interesting Tradeoff

ASIC

Flexibility

Efficiency

CPU

DianNao Ideally

GPU

PuDianNao

Design Strategies of ML Accelerator

• Considering

– Diverse algorithms

• Different styles, targets, objects, models

– The function unit efficiency

• Detailed analysis of the common computation

primitives of various machine learning

algorithms

– The memory system efficiency

• Detailed analysis of the common memory

access patterns of various machine learning

algorithms

Analysis Summary

Computation

primitives

Distance of vectors

Dot product of vectors

Counting

Non-linear functions

Sorting

memory access

patterns “Three pillars”

Architecture of PuDianNao

Machine Learning Unit (MLU)

Distance of vectors Dot Product of vectors Couting Non-linear function Sorting

Experimental Results

TSMC 65nm

Area: 3.51 mm2

Performance Results

Performance: 1.14x w.r.t. K20 GPU

Energy Results

Energy: 128.41x w.r.t. K20 GPU

References http://novel.ict.ac.cn/zxu/#PAPERS

• Yunji Chen, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam, “DianNao Family: Energy-Efficient Hardware Accelerators for Machine Learning”, CACM, to appear.

• Zhiwei Xu, Xuebin Chi, Nong Xiao, “High-Performance Computing Environment: A Review of Twenty Years of Experiments in China”, National Science Review 3(1): 36-48 (2016).

• Tianshi Chen, Qi Guo, Olivier Temam, Yue Wu, Yungang Bao, Zhiwei Xu, Yunji Chen, “Statistical Performance Comparisons of Computers”, IEEE Trans. Computers 64(5): 1442-1455 (2015).

• Shaoli Liu, Zidong Du, Jinhua Tao, Dong Han, Tao Luo, Yuan Xie, Yunji Chen, and Tianshi Chen, "DianNaoYu: An Instruction Set Architecture for Neural Networks", ISCA'16.

• Zidong Du, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen, and Olivier Temam, "ShiDianNao: Shifting Vision Processing Closer to the Sensor", ISCA'15.

• Daofu Liu, Tianshi Chen, Shaoli Liu, Jinhong Zhou, Shengyuan Zhou, Olivier Temam, Xiaobing Feng, Xuehai Zhou, and Yunji Chen, "PuDianNao: A Polyvalent Machine Learning Accelerator", ASPLOS'15.

• Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam, "DaDianNao: A Machine-Learning Supercomputer", MICRO'14.

• Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam, "DianNao: A Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning", ASPLOS'14.

• Zhiwei Xu, “High-Performance Techniques for Big Data Computing in Internet Services”, Invited speech at SC12, SC Companion 2012: 1861-1895.

• Zhiwei Xu, “Measuring Green IT in Society”, IEEE Computer 45(5): 83-85 (2012).

• Zhiwei Xu, “How Much Power Is Needed for a Billion-Thread High-Throughput Server?”, Frontiers of Computer Science 6(4): 339-346 (2012).


谢谢! Thank you! 诚邀加盟！We are hiring!

Zhiwei Xu 徐志伟 Institute of Computing Technology (ICT)

Chinese Academy of Sciences (CAS)


[email protected]

www.cambricon.com

HR: [email protected]

INSTITUTE OF COMPUTING

TECHNOLOGY

中科寒武纪

Documents

INSTITUTE OF COMPUTING Elastic Processor for Deep Learning · DianNao: A Neural Network Accelerator • Support multiple neural network algorithms, such as DNN, CNN, MLP,SOM • Pre-tape-out