Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Elastic Processor
for Deep Learning
Zhiwei Xu
Institute of Computing Technology (ICT)
Chinese Academy of Sciences (CAS)
http://novel.ict.ac.cn/zxu/
INSTITUTE OF COMPUTING
TECHNOLOGY
This research is supported in part by the CAS Strategic Priority Program (Grants
XDA06010403, XDB02040009) and the NSF of China (Grants 61473275, 61532016)
中科寒武纪
Machine-Learning Applications
Data
Centers Smart Phones
Embedded
Devices Supercomputers
Audio recognition
Automatic
translation Drug design Image
analysis
Business
analytics
Ad Prediction
Robotics
Consumer
electronics
What Can AI Learn from HPC?
• Architecture
• Execution model
• Programming model
?? 2015 2020 2025 1976
Look at the Past to See the Future Xu et al, “High-Performance Computing Environment: A Review of Twenty Years of Experiments in China”,
National Science Review 3(1): 36-48 (2016)
Flops
Growth
1976 to 1995 1995 to 2015
Speed Increase Annual Growth Rate Speed Increase Annual Growth Rate
China 600 times 40% 28 million times 136%
World 1550 times 47% 0.2 million times 84%
1.00E+06
1.00E+07
1.00E+08
1.00E+09
1.00E+10
1.00E+11
1.00E+12
1.00E+13
1.00E+14
1.00E+15
1.00E+16
1.00E+17
1976 1980 1985 1990 1995 2000 2005 2010 2015
Top1 in Top500
China HPCE
China Top1
The Environment Approach
The Machine Approach
Growth trends of top HPC systems in China and in the World: before and after 1995
Flops • Ecosystem
– Stacks
– Community
• Direction – Goal
– Objectives
– Priority
• 2035: ??
High-End Computing: Three Phases
• Priority went through two phases
– Speed (flops), aka performance
– Scalability: market scalability, problem scalability
?? 2015 2020 2025 1976
Speed
First
Scalability First Efficiency
First
Fundamental Challenge First time in 70 years
• Energy efficiency growth lags behind speed growth
• Big data computing is especially bad: 15-150 KOPJ
1.0E+00
1.0E+03
1.0E+06
1.0E+09
1.0E+12
1.0E+15
1.0E+18
1945 1957 1964 1976 1985 1995 2005 2015
Operations/S Opeartions/KWH Watt
ENIAC UNIVAC CDC6600 Cray-1 Cray-2 NWT BG/L
TH-2
1 TOPJ
50 GOPJ
2025
• 0.12 MOPJ (Spark) • 16 KOPJ (Hadoop)
Most Important Efficiency Metric
• Energy Efficiency: GOPS/W≈GOPJ
0.1
1.0
10.0
100.0
1,000.0
10,000.0
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
MFlops/W
HPCS
ASCI Earth Simulator
K-Computer
863HPC
Hadoop Efficiency Is Low • Lacks a high-performance communication substrate
– Use HTTP, RPC, direct Sockets over TCP/IP to communicate
– Can MPI be used for big data?
Speed Efficiency (Sustained/Peak)
Energy Efficiency
(Operations per Joule)
Payload: 0.002%
Linpack: 94.5%
Total op: 4.22%
Instruction: 4.72%
Payload: 1.55x104
Linpack: 7.26x108
Total op: 2.20x107
Instruction: 2.45x107
16
204
Jim Gray’s Data Challenge: Four Phases
1. Terasort challenge raised (sort 1TB in 1 minute) 2. Speed growth: TB/minute in 2009 (ad hoc methods)
3. Data size growth: TBPB10PB (Map-Reduce)
4. Efficiency: 1PB sorted on 190 AWS instances in 2014
151
49 33
4.95 3.48
1.03
362
975
33
234
387
1
10
100
1000
1996 1998 2000 2002 2004 2006 2008 2010 2012 2014
1TB 1PB 10PB
Hadoop
N=3658
N=1000
N=8000 Spark
N=190
?
? 2020
Efficiency
becomes
top priority
Execution Time
(Minutes)
Makimoto’s Wave
• Semiconductor technology will soon enter another phase change. But what is it?
Standardization
Customization
1957 1967 1977 1987 1997 2007 2017 2027
Discrete
Circuits
Memory
P FPGA ?
SoC
SiP ASIC IC
LSI
Makimoto’s Wave
• HFSI: Highly Flexible Super Integration • Redundant circuits can be shut off when not in use
Standardization
Customization
1957 1967 1977 1987 1997 2007 2017 2027
Discrete
Circuits
Memory
P FPGA HFSI
SoC
SiP ASIC IC
LSI
EP
Our Answer:
Elastic
Processor
Elastic Processor
• A new architecture style (FISC)
– Featuring function instructions executed by
programmable ASIC accelerators
– Targeting 1000 GOPS/W = 1 TOPJ
RISC ARM
FISC Function Instruction Set Computer
CISC Intel X86
Chip types: 10s 1K 10K
Power: 10~100W 1~10W 0.1~1W
Apps/chip: 10M 100K 10K
http://img1.mydrivers.com/img/20141112/7cdb086508194a12abff00b520c3aff7.jpg
DianNao: A Neural Network Accelerator
• Support multiple neural network algorithms, such as DNN, CNN, MLP,SOM
• Pre-tape-out simulation results: 0.98GHz, 452 GOPS, 0.485W 931.96 GOPS/W @ 65nm
• ASPLOS 2014 Best Paper
IC Layout in GDSII 700 speedup over Intel Core i7
Architecture
Three More Accelerators
Showing Potential for TOPJ • DaDianNao: An NN supercomputer containing up to 64 chips
• PuDianNao: A polyvalent machine learning accelerator
• ShiDianNao: A vision accelerator for embedded devices (cameras)
“电脑” “大电脑” “普电脑” “视电脑” DianNao DaDianNao PuDianNao ShiDianNao 931 GOPS/W 100-250 GOPS/W 300-1200 GOPS/W 2-4 TOPS/W
ASPLOS’14 Best Paper MICRO’14 Best Paper ASPLOS’15 ISCA’15
DaDianNao: An NN Supercomputer
• In average, 450x speedup and 150x energy saving over K20 GPU for a 64-chip system
Speedup
46.38x vs. CPU
28.94x vs. GPU
1.87x vs. DianNao
Energy saving
4688.13x vs. GPU
63.48x vs. DianNao
ShiDianNao:
An Vision Accelerator for
Embedded Devices
PuDianNao
Area: 3.51 mm2
Power: 596 mW
Freq: 1 GHz
Supporting many ML algorithms:
CNN/DNN, LR, Kmeans, SVM,
NB, KNN, CT, …
Area: 3.02 mm2
Power: 485 mW
Freq: 0.98 GHz
Supporting CNN/DNN
DianNao
Interesting Tradeoff
ASIC
Flexibility
Efficiency
CPU
DianNao Ideally
GPU
PuDianNao
Design Strategies of ML Accelerator
• Considering
– Diverse algorithms
• Different styles, targets, objects, models
– The function unit efficiency
• Detailed analysis of the common computation
primitives of various machine learning
algorithms
– The memory system efficiency
• Detailed analysis of the common memory
access patterns of various machine learning
algorithms
Analysis Summary
Computation
primitives
Distance of vectors
Dot product of vectors
Counting
Non-linear functions
Sorting
memory access
patterns “Three pillars”
Architecture of PuDianNao
Machine Learning Unit (MLU)
Distance of vectors Dot Product of vectors Couting Non-linear function Sorting
Experimental Results
TSMC 65nm
Area: 3.51 mm2
Performance Results
Performance: 1.14x w.r.t. K20 GPU
Energy Results
Energy: 128.41x w.r.t. K20 GPU
References http://novel.ict.ac.cn/zxu/#PAPERS
• Yunji Chen, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam, “DianNao Family: Energy-Efficient Hardware Accelerators for Machine Learning”, CACM, to appear.
• Zhiwei Xu, Xuebin Chi, Nong Xiao, “High-Performance Computing Environment: A Review of Twenty Years of Experiments in China”, National Science Review 3(1): 36-48 (2016).
• Tianshi Chen, Qi Guo, Olivier Temam, Yue Wu, Yungang Bao, Zhiwei Xu, Yunji Chen, “Statistical Performance Comparisons of Computers”, IEEE Trans. Computers 64(5): 1442-1455 (2015).
• Shaoli Liu, Zidong Du, Jinhua Tao, Dong Han, Tao Luo, Yuan Xie, Yunji Chen, and Tianshi Chen, "DianNaoYu: An Instruction Set Architecture for Neural Networks", ISCA'16.
• Zidong Du, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen, and Olivier Temam, "ShiDianNao: Shifting Vision Processing Closer to the Sensor", ISCA'15.
• Daofu Liu, Tianshi Chen, Shaoli Liu, Jinhong Zhou, Shengyuan Zhou, Olivier Temam, Xiaobing Feng, Xuehai Zhou, and Yunji Chen, "PuDianNao: A Polyvalent Machine Learning Accelerator", ASPLOS'15.
• Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam, "DaDianNao: A Machine-Learning Supercomputer", MICRO'14.
• Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam, "DianNao: A Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning", ASPLOS'14.
• Zhiwei Xu, “High-Performance Techniques for Big Data Computing in Internet Services”, Invited speech at SC12, SC Companion 2012: 1861-1895.
• Zhiwei Xu, “Measuring Green IT in Society”, IEEE Computer 45(5): 83-85 (2012).
• Zhiwei Xu, “How Much Power Is Needed for a Billion-Thread High-Throughput Server?”, Frontiers of Computer Science 6(4): 339-346 (2012).
http://novel.ict.ac.cn/zxu/
谢谢! Thank you! 诚邀加盟!We are hiring!
Zhiwei Xu 徐志伟 Institute of Computing Technology (ICT)
Chinese Academy of Sciences (CAS)
http://novel.ict.ac.cn/zxu/
www.cambricon.com
INSTITUTE OF COMPUTING
TECHNOLOGY
中科寒武纪