52
Deep Learning Processors Data Flow & Domain Specific Architecture Chixiao Chen

Deep Learning Processors Data Flow & Domain Specific

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Deep Learning Processors Data Flow & Domain Specific

Deep Learning ProcessorsData Flow &

Domain Specific Architecture Chixiao Chen

Page 2: Deep Learning Processors Data Flow & Domain Specific

Announcement

• We do not have class on 5/23, 5/20.6/6’s class is on pending

• HW 4 is merged with Final Project

• HW4 & Final PJ will be on-line this week end.• Option I: Implementing a SIMD instruction to perform a NN (RTL) &

propose a new technique (not mentioned from class) to furtherincrease the performance (reference and slides)

• Option II: Implementing an advanced instruction to realize the NN

Page 3: Deep Learning Processors Data Flow & Domain Specific

Lecture Last Time

• Vector Processor and SIMD

Page 4: Deep Learning Processors Data Flow & Domain Specific

Lecture Last Time

• Multithread Pipeline Architecture & GPU

Page 5: Deep Learning Processors Data Flow & Domain Specific

Overview

• Deep learning hardware not using instructions

• Reconfigurable hardware• FPGA

• CGRA

• Data flows• Systolic arrays

• Weight/output/Row stationary

Page 6: Deep Learning Processors Data Flow & Domain Specific

End of Instruction-based Processor ?

• Moore’s Law, especially Dennard Scaling, is ending.

Distribution A – Approved for Public Release; Distribution Unlimited

• Voltage scaling slowed drastically

• Asymptotically approaching threshold

Why did we hit a power/cooling wall?

9/12/2012 19

The good old days of Dennard Scaling:

Today, now that Dennard Scaling is dead:

X

Ng = CMOS gates/unit area Cload = capacitive load/CMOS gate f = clock frequency V = supply voltage

Data courtesy S. Borkar/Intel 2011 CICC

Sept 10, 2012 10

And L3 energy scaling ended in 2005

Gordon Moore, ISSCC 2003Moore, ISSCC Keynote, 2003

Moore, ISSCC Keynote, 2003

Page 7: Deep Learning Processors Data Flow & Domain Specific

Accelerator without Instruction Arch

• Idea: Leverage dark silicon to “fight” the utilization wall

• Insights: • Power is now more expensive

than area

• Specialized logic can improve energy efficiency by 10-1000x

• Accelerator is the basic specializingmethod

Page 8: Deep Learning Processors Data Flow & Domain Specific

Reconfigurable Computing DevicesFPGA

Page 9: Deep Learning Processors Data Flow & Domain Specific

Reconfigurable Computing and Architecture

• From the traditional reconfigurable architecture and hardware

• best known reconfigurable

architectures: FPGA• configuration block: bit level

• data routing the major challenge

• We also called it fine-grained reconfigurable architectures.

Page 10: Deep Learning Processors Data Flow & Domain Specific

Look-up-table based Configurability

• Basic structure of LUT: SRAM + Switches/ MUX’s

• Mapping of “any” logic – generate bit table

Page 11: Deep Learning Processors Data Flow & Domain Specific

Cluster based element

• CLB = Configurable logic block

• Locally, a CLB has crossbar, LUTs DFFs, Carry in/out logic

Page 12: Deep Learning Processors Data Flow & Domain Specific

Global Fabric

• Interconnection between CLBs:

• CB = connection boxes

• SB = switch boxes

Page 13: Deep Learning Processors Data Flow & Domain Specific

Modern FPGA has something more than CLB

• CLB is not efficient for data storage, need something like SRAM (known as BRAM in FPGA)

• Multiplier and accumulator is also not efficient (DSP)

Page 14: Deep Learning Processors Data Flow & Domain Specific

DSP block in FPGA

• DSP48 IP in Xilinx FPGA

• Splitting for Deep Learning

Page 15: Deep Learning Processors Data Flow & Domain Specific

A second thought

• FPGA using IPs for efficiency, so when we using FPGA, we first want to maximize the IP utilization

• Question: Is FPGA a real fine-grained reconfigurable hardware now?

Page 16: Deep Learning Processors Data Flow & Domain Specific

Case study – Microsoft Catapult

• Microsoft servers are no longer CPU/GPU only based, they are CPU/GPU/FPGA based

• FPGA is more efficient for domain specific tasks• FPGA (unconfigured) has 3962 18-

bit ALUs and 5 MiB of on-chip memory

• Programmed in Verilog RTL• Shell is 23% of the FPGA

Page 17: Deep Learning Processors Data Flow & Domain Specific

Catapult for Deep Learning

• CNN accelerators, can be matched to multiple FPGAs

Page 18: Deep Learning Processors Data Flow & Domain Specific

Coarse-Grained Reconfigurable Architecture

• Computing Blocks (PEs) and switched based routing

• Less flexible than fine-grained reconfigurable architectures

• But easier to program

• Fast reconfiguration

Page 19: Deep Learning Processors Data Flow & Domain Specific

MorphoSys (Singh et al 2000)

• Overall Architecture • Inside each Reconfigurable cell

Page 20: Deep Learning Processors Data Flow & Domain Specific

In history

• CGRA was once a specific category for high performance design, but it never succeed

• But CGRA people never give up.

DARPA Software defined hardware Project in 2017

Page 21: Deep Learning Processors Data Flow & Domain Specific

CGRA / FPGA Today

• CGRA as a specific chip category no longer exists

• BUT MOST FPGA implementation, even ASIC, design using CGRA ideas (Thinker from 清华)

Catapult view with PEs

Page 22: Deep Learning Processors Data Flow & Domain Specific

Data FlowsRouting among PE arrays

Page 23: Deep Learning Processors Data Flow & Domain Specific

Assembly code for MAC

Page 24: Deep Learning Processors Data Flow & Domain Specific

Do we really need so many memory access?

• Assume we have two M x M matrix

• Questions: How many scalar multiplication we need?How many data we really need to read/write from/to

memory?

Page 25: Deep Learning Processors Data Flow & Domain Specific

Data flow Example for Matrix MultiplicationWith multiple PEs (CGRA)

Systolic Array

Page 26: Deep Learning Processors Data Flow & Domain Specific

Data flow Example for Matrix MultiplicationWith multiple PEs (CGRA)

Systolic Array

Page 27: Deep Learning Processors Data Flow & Domain Specific

Data flow Example for Matrix MultiplicationWith multiple PEs (CGRA)

Systolic Array

Page 28: Deep Learning Processors Data Flow & Domain Specific

Data flow Example for Matrix MultiplicationWith multiple PEs (CGRA)

Systolic Array

Page 29: Deep Learning Processors Data Flow & Domain Specific

Data flow Example for Matrix MultiplicationWith multiple PEs (CGRA)

Systolic Array

Page 30: Deep Learning Processors Data Flow & Domain Specific

Case Study: TPU using systolic array

• Google’s DNN ASIC

• 256 x 256 8-bit matrix-multiply unit

• Large software-managed scratchpad

• Coprocessor on the PCIe bus

Page 31: Deep Learning Processors Data Flow & Domain Specific

TPU Operation

Ref:Google TPU form ISCA 2017

Page 32: Deep Learning Processors Data Flow & Domain Specific

CNN

More chance of Bandwidth saving

1) Matrix Multiplication

2) Sliding window

Page 33: Deep Learning Processors Data Flow & Domain Specific

High Dimensional CNN

More chance of Bandwidth saving

3) Filter sharing

4) Feature sharing

Page 34: Deep Learning Processors Data Flow & Domain Specific

Data Reuse relaxes the bandwidth requirement

Page 35: Deep Learning Processors Data Flow & Domain Specific

Data Reuse relaxes the bandwidth requirement

Page 36: Deep Learning Processors Data Flow & Domain Specific

Using CGRA idea, but routing is application specific

Data flow

Page 37: Deep Learning Processors Data Flow & Domain Specific

The design methodology using the memory hierarchy idea but no need for coherency

Page 38: Deep Learning Processors Data Flow & Domain Specific

Common data flows for multi PE designs

Page 39: Deep Learning Processors Data Flow & Domain Specific

Common data flows for multi PE designs

Page 40: Deep Learning Processors Data Flow & Domain Specific

Row Stationary

A specific Data flow for CNNs stride = 1

Page 41: Deep Learning Processors Data Flow & Domain Specific

Row Stationary

A specific Data flow for CNNs stride = 1

Page 42: Deep Learning Processors Data Flow & Domain Specific

Row Stationary

A specific Data flow for CNNs stride = 1

Page 43: Deep Learning Processors Data Flow & Domain Specific

Row Stationary

A specific Data flow for CNNs stride = 1

Page 44: Deep Learning Processors Data Flow & Domain Specific

Row Stationary

A specific Data flow for CNNs stride = 1

Page 45: Deep Learning Processors Data Flow & Domain Specific

Row Stationary

A specific Data flow for CNNs stride = 1

Page 46: Deep Learning Processors Data Flow & Domain Specific

Row Stationary

A specific Data flow for CNNs stride = 1

Page 47: Deep Learning Processors Data Flow & Domain Specific

Row Stationary

A specific Data flow for CNNs stride = 1

Page 48: Deep Learning Processors Data Flow & Domain Specific

Row Stationary

A specific Data flow for CNNs stride = 1

Page 49: Deep Learning Processors Data Flow & Domain Specific

Row Stationary

A specific Data flow for CNNs stride = 1

Page 50: Deep Learning Processors Data Flow & Domain Specific

Row Stationary

A specific Data flow for CNNs stride = 1

Page 51: Deep Learning Processors Data Flow & Domain Specific

Row StationaryPerformance

Case study:Eyeriss from MIT, ISSCC/ISCA 2016

Page 52: Deep Learning Processors Data Flow & Domain Specific

ConclusionsFPGA ,CGRA, Data flow (systolic Array)