27
Elastic Processor for Deep Learning Zhiwei Xu Institute of Computing Technology (ICT) Chinese Academy of Sciences (CAS) http://novel.ict.ac.cn/zxu/ [email protected] INSTITUTE OF COMPUTING TECHNOLOGY This research is supported in part by the CAS Strategic Priority Program (Grants XDA06010403, XDB02040009) and the NSF of China (Grants 61473275, 61532016中科寒武纪

INSTITUTE OF COMPUTING Elastic Processor for Deep Learning · DianNao: A Neural Network Accelerator • Support multiple neural network algorithms, such as DNN, CNN, MLP,SOM • Pre-tape-out

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

  • Elastic Processor

    for Deep Learning

    Zhiwei Xu

    Institute of Computing Technology (ICT)

    Chinese Academy of Sciences (CAS)

    http://novel.ict.ac.cn/zxu/

    [email protected]

    INSTITUTE OF COMPUTING

    TECHNOLOGY

    This research is supported in part by the CAS Strategic Priority Program (Grants

    XDA06010403, XDB02040009) and the NSF of China (Grants 61473275, 61532016)

    中科寒武纪

  • Machine-Learning Applications

    Data

    Centers Smart Phones

    Embedded

    Devices Supercomputers

    Audio recognition

    Automatic

    translation Drug design Image

    analysis

    Business

    analytics

    Ad Prediction

    Robotics

    Consumer

    electronics

  • What Can AI Learn from HPC?

    • Architecture

    • Execution model

    • Programming model

    ?? 2015 2020 2025 1976

  • Look at the Past to See the Future Xu et al, “High-Performance Computing Environment: A Review of Twenty Years of Experiments in China”,

    National Science Review 3(1): 36-48 (2016)

    Flops

    Growth

    1976 to 1995 1995 to 2015

    Speed Increase Annual Growth Rate Speed Increase Annual Growth Rate

    China 600 times 40% 28 million times 136%

    World 1550 times 47% 0.2 million times 84%

    1.00E+06

    1.00E+07

    1.00E+08

    1.00E+09

    1.00E+10

    1.00E+11

    1.00E+12

    1.00E+13

    1.00E+14

    1.00E+15

    1.00E+16

    1.00E+17

    1976 1980 1985 1990 1995 2000 2005 2010 2015

    Top1 in Top500

    China HPCE

    China Top1

    The Environment Approach

    The Machine Approach

    Growth trends of top HPC systems in China and in the World: before and after 1995

    Flops • Ecosystem

    – Stacks

    – Community

    • Direction – Goal

    – Objectives

    – Priority

    • 2035: ??

  • High-End Computing: Three Phases

    • Priority went through two phases

    – Speed (flops), aka performance

    – Scalability: market scalability, problem scalability

    ?? 2015 2020 2025 1976

    Speed

    First

    Scalability First Efficiency

    First

  • Fundamental Challenge First time in 70 years

    • Energy efficiency growth lags behind speed growth

    • Big data computing is especially bad: 15-150 KOPJ

    1.0E+00

    1.0E+03

    1.0E+06

    1.0E+09

    1.0E+12

    1.0E+15

    1.0E+18

    1945 1957 1964 1976 1985 1995 2005 2015

    Operations/S Opeartions/KWH Watt

    ENIAC UNIVAC CDC6600 Cray-1 Cray-2 NWT BG/L

    TH-2

    1 TOPJ

    50 GOPJ

    2025

    • 0.12 MOPJ (Spark) • 16 KOPJ (Hadoop)

  • Most Important Efficiency Metric

    • Energy Efficiency: GOPS/W≈GOPJ

    0.1

    1.0

    10.0

    100.0

    1,000.0

    10,000.0

    1993

    1994

    1995

    1996

    1997

    1998

    1999

    2000

    2001

    2002

    2003

    2004

    2005

    2006

    2007

    2008

    2009

    2010

    2011

    2012

    2013

    2014

    MFlops/W

    HPCS

    ASCI Earth Simulator

    K-Computer

    863HPC

  • Hadoop Efficiency Is Low • Lacks a high-performance communication substrate

    – Use HTTP, RPC, direct Sockets over TCP/IP to communicate

    – Can MPI be used for big data?

    Speed Efficiency (Sustained/Peak)

    Energy Efficiency

    (Operations per Joule)

    Payload: 0.002%

    Linpack: 94.5%

    Total op: 4.22%

    Instruction: 4.72%

    Payload: 1.55x104

    Linpack: 7.26x108

    Total op: 2.20x107

    Instruction: 2.45x107

    16

    204

  • Jim Gray’s Data Challenge: Four Phases

    1. Terasort challenge raised (sort 1TB in 1 minute) 2. Speed growth: TB/minute in 2009 (ad hoc methods)

    3. Data size growth: TBPB10PB (Map-Reduce)

    4. Efficiency: 1PB sorted on 190 AWS instances in 2014

    151

    49 33

    4.95 3.48

    1.03

    362

    975

    33

    234

    387

    1

    10

    100

    1000

    1996 1998 2000 2002 2004 2006 2008 2010 2012 2014

    1TB 1PB 10PB

    Hadoop

    N=3658

    Google

    N=1000

    Google

    N=8000 Spark

    N=190

    ?

    ? 2020

    Efficiency

    becomes

    top priority

    Execution Time

    (Minutes)

  • Makimoto’s Wave

    • Semiconductor technology will soon enter another phase change. But what is it?

    Standardization

    Customization

    1957 1967 1977 1987 1997 2007 2017 2027

    Discrete

    Circuits

    Memory

    P FPGA ?

    SoC

    SiP ASIC IC

    LSI

  • Makimoto’s Wave

    • HFSI: Highly Flexible Super Integration • Redundant circuits can be shut off when not in use

    Standardization

    Customization

    1957 1967 1977 1987 1997 2007 2017 2027

    Discrete

    Circuits

    Memory

    P FPGA HFSI

    SoC

    SiP ASIC IC

    LSI

    EP

    Our Answer:

    Elastic

    Processor

  • Elastic Processor

    • A new architecture style (FISC)

    – Featuring function instructions executed by

    programmable ASIC accelerators

    – Targeting 1000 GOPS/W = 1 TOPJ

    RISC ARM

    FISC Function Instruction Set Computer

    CISC Intel X86

    Chip types: 10s 1K 10K

    Power: 10~100W 1~10W 0.1~1W

    Apps/chip: 10M 100K 10K

    http://img1.mydrivers.com/img/20141112/7cdb086508194a12abff00b520c3aff7.jpg

  • DianNao: A Neural Network Accelerator

    • Support multiple neural network algorithms, such as DNN, CNN, MLP,SOM

    • Pre-tape-out simulation results: 0.98GHz, 452 GOPS, 0.485W 931.96 GOPS/W @ 65nm

    • ASPLOS 2014 Best Paper

    IC Layout in GDSII 700 speedup over Intel Core i7

    Architecture

  • Three More Accelerators

    Showing Potential for TOPJ • DaDianNao: An NN supercomputer containing up to 64 chips

    • PuDianNao: A polyvalent machine learning accelerator

    • ShiDianNao: A vision accelerator for embedded devices (cameras)

    “电脑” “大电脑” “普电脑” “视电脑” DianNao DaDianNao PuDianNao ShiDianNao 931 GOPS/W 100-250 GOPS/W 300-1200 GOPS/W 2-4 TOPS/W

    ASPLOS’14 Best Paper MICRO’14 Best Paper ASPLOS’15 ISCA’15

  • DaDianNao: An NN Supercomputer

    • In average, 450x speedup and 150x energy saving over K20 GPU for a 64-chip system

  • Speedup

    46.38x vs. CPU

    28.94x vs. GPU

    1.87x vs. DianNao

    Energy saving

    4688.13x vs. GPU

    63.48x vs. DianNao

    ShiDianNao:

    An Vision Accelerator for

    Embedded Devices

  • PuDianNao

    Area: 3.51 mm2

    Power: 596 mW

    Freq: 1 GHz

    Supporting many ML algorithms:

    CNN/DNN, LR, Kmeans, SVM,

    NB, KNN, CT, …

    Area: 3.02 mm2

    Power: 485 mW

    Freq: 0.98 GHz

    Supporting CNN/DNN

    DianNao

  • Interesting Tradeoff

    ASIC

    Flexibility

    Efficiency

    CPU

    DianNao Ideally

    GPU

    PuDianNao

  • Design Strategies of ML Accelerator

    • Considering

    – Diverse algorithms

    • Different styles, targets, objects, models

    – The function unit efficiency

    • Detailed analysis of the common computation

    primitives of various machine learning

    algorithms

    – The memory system efficiency

    • Detailed analysis of the common memory

    access patterns of various machine learning

    algorithms

  • Analysis Summary

    Computation

    primitives

    Distance of vectors

    Dot product of vectors

    Counting

    Non-linear functions

    Sorting

    memory access

    patterns “Three pillars”

  • Architecture of PuDianNao

  • Machine Learning Unit (MLU)

    Distance of vectors Dot Product of vectors Couting Non-linear function Sorting

  • Experimental Results

    TSMC 65nm

    Area: 3.51 mm2

  • Performance Results

    Performance: 1.14x w.r.t. K20 GPU

  • Energy Results

    Energy: 128.41x w.r.t. K20 GPU

  • References http://novel.ict.ac.cn/zxu/#PAPERS

    • Yunji Chen, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam, “DianNao Family: Energy-Efficient Hardware Accelerators for Machine Learning”, CACM, to appear.

    • Zhiwei Xu, Xuebin Chi, Nong Xiao, “High-Performance Computing Environment: A Review of Twenty Years of Experiments in China”, National Science Review 3(1): 36-48 (2016).

    • Tianshi Chen, Qi Guo, Olivier Temam, Yue Wu, Yungang Bao, Zhiwei Xu, Yunji Chen, “Statistical Performance Comparisons of Computers”, IEEE Trans. Computers 64(5): 1442-1455 (2015).

    • Shaoli Liu, Zidong Du, Jinhua Tao, Dong Han, Tao Luo, Yuan Xie, Yunji Chen, and Tianshi Chen, "DianNaoYu: An Instruction Set Architecture for Neural Networks", ISCA'16.

    • Zidong Du, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen, and Olivier Temam, "ShiDianNao: Shifting Vision Processing Closer to the Sensor", ISCA'15.

    • Daofu Liu, Tianshi Chen, Shaoli Liu, Jinhong Zhou, Shengyuan Zhou, Olivier Temam, Xiaobing Feng, Xuehai Zhou, and Yunji Chen, "PuDianNao: A Polyvalent Machine Learning Accelerator", ASPLOS'15.

    • Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam, "DaDianNao: A Machine-Learning Supercomputer", MICRO'14.

    • Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam, "DianNao: A Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning", ASPLOS'14.

    • Zhiwei Xu, “High-Performance Techniques for Big Data Computing in Internet Services”, Invited speech at SC12, SC Companion 2012: 1861-1895.

    • Zhiwei Xu, “Measuring Green IT in Society”, IEEE Computer 45(5): 83-85 (2012).

    • Zhiwei Xu, “How Much Power Is Needed for a Billion-Thread High-Throughput Server?”, Frontiers of Computer Science 6(4): 339-346 (2012).

    http://novel.ict.ac.cn/zxu/

  • 谢谢! Thank you! 诚邀加盟!We are hiring!

    Zhiwei Xu 徐志伟 Institute of Computing Technology (ICT)

    Chinese Academy of Sciences (CAS)

    http://novel.ict.ac.cn/zxu/

    [email protected]

    www.cambricon.com

    HR: [email protected]

    INSTITUTE OF COMPUTING

    TECHNOLOGY

    中科寒武纪