In-Memory Data Parallel Processornvmw.ucsd.edu/nvmw2019-program/unzip/current/nvmw...Need a programming language that merges concepts of Data-Flow and SIMD for maximizing parallelism

In-MemoryDataParallelProcessor

DaichiFujiki ScottMahlke Reetuparna Das

M-BitsResearchGroup

2

CPU GPUDATA-PARALLELAPPLICATIONS

ARITHMETIC

DATACOMMUNICATION

MANYCORESIMDOoO

MANYTHREADSIMTSIMD

“Datamovementiswhatmatters,notarithmetic”– BillDally

1000x 40x

In-MemoryComputingexposesparallelismwhileminimizingdatamovementcost

3

CPU GPU IN-MEMORY

� In-situcomputing�Massiveparallelism

SIMDslotsoverdensememoryarrays

Highbandwidth/Lowdatamovement

In-MemoryComputing– ReducesDataMovement

4

CPU GPU IN-MEMORY

C11

C21

11

11I11 = (Vdd/2) C11

I21 = (Vdd/2) C21

I1 = (Vdd/2) (C11+ C21)

Vdd/2

C11

C21

11

00I11 = (Vdd/2) C11

I1 = (Vdd/2) (C11 - C21)

Vdd/2

C11 C12

DAC

DAC

DAC

DAC

V1

I11=(Vdd – V1)C11

V2

I12=(Vdd – V2) C12

DAC

DAC DAC

C11 C12

V1

I12= V1C12

DACI11= V1C11

C21 C22

I22= V2C22

DAC

V2

I21= V2C21

I1=I11+I21 I2=I21+I22

(a) Addition (b) Dot-product (c) Element-wise multiplication (d) Subtraction

I21 = (Vdd/2) C21

11


5

CPU (2sockets)IntelXeonE5-2597

GPUNVIDIA TITANXp

ReRAMScaledfromISAAC*

Area(mm2) 912.24 471 494

TDP(W) 290 250 416

On-chipmemory(MB) 78.96 9.14 8,590

SIMDslots 448 3,840 2,097,152Freq(GHz) 3.6 1.585 0.02

SIMD Freq Product 3,227 6,086 41,953

In-MemoryComputing– ExposesParallelism

IN-MEMORY


In-MemoryComputingToday

6

C11 C12V1

I11=V1C11DAC

C21 C22I22=V2C22

DAC

V2

C11 C12V1

DAC

C21 C22DAC

V2I12=V1C12

I21=V2C21

I11 I12

I21 I22

I1 =I11+I21 I2 =I12+I22Multiplication + Summation

ReRAM Dot-productAccelerator

• PRIME[Chi2016,ISCA]• ISAAC [Shafiee 2016,ISCA]• Dot-ProductEngine[Hu2016,DAC]• PipeLayer [Song2017,HPCA]

IN-MEMORY

In-MemoryComputing–NoDemonstrationofGeneralPurposeComputing

� Noestablishedprogrammingmodel/executionmodel� Limitedcomputationprimitives

Howtoprogram?

7

In-MemoryDataParallelProcessorOverview

Microarchitecture

ISA

ExecutionModel

ProgrammingModel

Compiler

HW

SW

8

MemoryISA

ADD MOVI

DOT MOVG

MUL SHIFT{L/R}

SUB MASK

MOV LUT

MOVS REDUCE_SUM

DataFlowGraph

IB1 IB2IMPCom

pilerILP

IB1 IB2IB1 IB2

Mod

ule

DLP

ComputationPrimitives

CA

CB

A

B

DAC

DAC

HW SWProcessorArchitecture- ISA- ExecutionModel- ProgrammingModel- Compiler

9

Informationstoredinanalog(cellconductanceC=1/resistance)

A

CA

ReadWrite

ComputationPrimitivesVdd

CA

CB

I = ( IA + IB )

Vdd/2

DAC

DAC


10

(a)Addition (b)Subtraction*

IA

IA = (Vdd/2) CA

Ohm’slaw[mult]

Kirchhoff’slaw[add]

IB

Vdd

*Newprimitive

ComputationPrimitives

(c)Dot-product

CA CCDAC

CB CD

IDY=VYCD

DAC

V

I1=IAX+IBY I2=ICX+IDY

ICX=VXCC

VX


11

Y

IAX=VACA

IBY=VYCB

C11 C12

V1

I11=(Vdd–V1)C11

V2

I12=(Vdd–V2)C12

DAC

DAC DAC

(c)Element-wisemultiplication

11

VddVdd - Multiplier

Multiplicand

! " # $% & = #! + %" $! + &" !

" (# %) = #! %"

BA

A B

!+ "+

X- Y-

d*Newprimitive

*

Microarchitecture


LUT

ReRAMPU

ReRAMPU...

ReRAMPU ... ReRAM

PU

Reg.File

Cluster

RRAMXB

S+H

DAC

DAC ADCADCS+A

Reg

ProcessingUnitRouter=RowDecoder +Shift&Add Unit

12

Microarchitecture


LUT

ReRAMPU

ReRAMPU...

ReRAMPU ... ReRAM

PU

Reg.File

Cluster

RRAMXB

S+H

DAC

DAC ADCADCS+A

Reg

ProcessingUnit

ArraySize 128x128

R/WLatency 50ns

MultiLevelCell 2

ADCResolution 5

ADCFrequency 1.2GSps

DACResolution 2

LUTsize 256x8

13

DAC

DAC

ShiftandHold

DAC

DAC

DAC

DAC

DAC

DAC

SampleandHold

8PUs/array1284BRegs/PU

512Bx8RegisterFile

ALU ALU ALU ALU

ALU ALU ALU ALU

ISAHW SWProcessorArchitecture- ISA- ExecutionModel- ProgrammingModel- Compiler

Opcode Format CyclesADD <MASK> <DST> 3

DOT <MASK> <REG_MASK> <DST> 18

MUL <SRC> <SRC> <DST> 18

SUB <SRC> <SRC> <DST> 3

MOV <SRC> <DST> 3

MOVS <SRC> <DST> <MASK> 3

MOVI <SRC> <IMM> 1

MOVG <GADDR> <GADDR> Variable

SHIFT{L/R} <SRC> <SRC> <IMM> 3

MASK <SRC> <SRC> <IMM> 3

LUT <SRC> <SRC> 4

REDUCE_SUM <SRC> <GADDR> Variable

In-situCom

putationMovesR/W

Misc

14


15

ProgrammingModelNeedaprogramminglanguagethatmergesconceptsofData-FlowandSIMDformaximizingparallelism


KEYOBSERVATION

Data-FlowExplicitdataflowexposesInstructionLevelParallelism

SIMD DataLevelParallelism

16

Side-effectFree

Nodependenceonsharedmemoryprimitives

ExecutionModelHW SWProcessorArchitecture- ISA- ExecutionModel- ProgrammingModel- Compiler

17

DataFlowGraph

……

InputMatrixA

InputMatrixB

…

…

InputMatrixA

InputMatrixB


18

DLP

DataFlowGraph

……

InputMatrixA

InputMatrixB

DecomposedDFG

↕

ModuleModule

Unrollinnermostdimension

ModularizedexecutionflowAppliedtotheinnermostdimension


DecomposedDataFlowGraph

IB1 IB2

Module

19

ILP

DataFlowGraph

……

InputMatrixA

InputMatrixB

…

…

InputMatrixA

InputMatrixB

IB IB

InstructionBlock(IB)Partialexecution

sequenceofaModuleMappedtoasinglearray

ExecutionModel


ReRAMArray

IB1 IB2

Module

20

IB1 IB2 IB1 IB2

IB1

IB1

IB1

IB2

IB2

IB2Module


IB IB

InstructionBlock(IB)ComponentsofModuleMappedtoasinglearray

ExecutionModel


DataFlowGraphs

…

…

IB1 IB2

Modules

21

IB1 IB2 IB1 IB2

ReRAMArray IB1

IB1

IB1

IB2

IB2

IB2

InputMatrixA

InputMatrixB

…

…

Module


IB IB

InstructionBlock(IB)ComponentsofModuleMappedtoasinglearray

…


22

CompilationFlow


Python

C++

Java

SemanticAnalysis

Optimization

NodeMerging

IBExpansion

Pipelining

InstructionLowering

IBScheduling

CodeGen

TargetMachineModeling

IMPCompilerTensorFlowDFG(ProtocolBuffer)

Backend

23

CompilationFlow


Python

C++

Java

SemanticAnalysis

Optimization

NodeMerging

IBExpansion

Pipelining

InstructionLowering

IBScheduling

CodeGen



Backend

24

Placeholder

Placeholder

Add

Reduce

88

16

Optimization1NodeMerging


Placeholder

56

32 … …

56

32 … …

Placeholder

Add+Reduce 16

++

+

+

SemanticAnalysis

Optimization

NodeMerging

IBExpansion

Pipelining

InstructionLowering

IBScheduling

CodeGen

25

• Exploitmulti-operandADD/SUB• Reduceredundantwritebacks

NodeMerging

Placeholder

Placeholder

Optimization2IBExpansion


5362 … …

Placeholder

Placeholder

Add

…

88

32 … …

Add88 Add

Unpack

Pack

32

56+

+

+

56

+

SemanticAnalysis

Optimization

NodeMerging

IBExpansion

Pipelining

InstructionLowering

IBScheduling

CodeGen

26

…

IBExpansion

Exposemoreparallelisminamoduletoarchitecture.

CompilationFlow


Python

C++

Java

SemanticAnalysis

Optimization

NodeMerging

IBExpansion

Pipelining



Backend

InstructionLowering

IBScheduling

CodeGen

28

CompilerBackend


SemanticAnalysis

Optimization

NodeMerging

IBExpansion

Pipelining

InstructionLowering

IBScheduling

CodeGen

InstructionLowering

InstructionLowering:TransformhighlevelTFinsts intomemoryISA

29

InstLowering

AddMul

LUT

Add Sub Mul Div Sqrt Exp SumConv2D

Less …

MemoryISA

…

SupportedTFoperationnodes

Division Algorithmq=a/b

1. 12 = 345 6 (LUT)

2. 82 = 9123. ;2 = 1 − 6124. 8> = 82 + ;2825. ;> = ;2@6. 8@ = 8> + ;>8>

High-levelTFNode

Newton-Raphson/Maclaurin

Div

CompilerBackend


Optimization

NodeMerging

IBExpansion

Pipelining

InstructionLowering

IBScheduling

CodeGen

SemanticAnalysis

IBScheduling

DFG

IBScheduling

Target#ofIBs

=1

IB1LargeExecutionTime

30

CompilerBackend


Optimization

NodeMerging

IBExpansion

Pipelining

InstructionLowering

IBScheduling

CodeGen

SemanticAnalysis

IBScheduling

DFG

IBScheduling

IB1 IB2 IB1 IB2

Good :) Bad :(

Target#ofIBs

=2

LargeExecutionTime

NetworkDelay

31

CompilerBackend


Bottom-UpGreedy

Collectcandidateassignments

MakefinalassignmentsMinimizedatatransferlatencybytakingbothoperand&successor

locationintoconsideration

Optimization

NodeMerging

IBExpansion

Pipelining

InstructionLowering

IBScheduling

CodeGen

SemanticAnalysis

IBScheduling

32

[Ellis1986]

CompilerBackend


Bottom-UpGreedy




Optimization

NodeMerging

IBExpansion

Pipelining

InstructionLowering

IBScheduling

CodeGen

SemanticAnalysis

IBScheduling

33

[Ellis1986]

CompilerBackend


Bottom-UpGreedy




Optimization

NodeMerging

IBExpansion

Pipelining

InstructionLowering

IBScheduling

CodeGen

SemanticAnalysis

IBScheduling

1

1

34

[Ellis1986] IB1 IB2

Time

IB1ischosenbecause…

Closertooperandlocations

CompilerBackend


Bottom-UpGreedy




Optimization

NodeMerging

IBExpansion

Pipelining

InstructionLowering

IBScheduling

CodeGen

SemanticAnalysis

IBScheduling

22

2

1

1

35

[Ellis1986] IB1 IB2

Time

IB2ischosenbecause…

Earlierslotsavailable

CompilerBackend


Bottom-UpGreedy




Optimization

NodeMerging

IBExpansion

Pipelining

InstructionLowering

IBScheduling

CodeGen

SemanticAnalysis

IBScheduling

22

2

1

1

1

1

36

[Ellis1986] IB1 IB2

Time

IB1ischosenbecause…Betteroverlapofcomm.

andcomputation

EvaluationMethodologyBenchmarks• PARSEC3.0

− Blackscholes,Canneal,Fluidanimate

• Rodinia− Backprop,Hotspot,Kmeans,Streamcluster

CPU(2sockets) GPU(1card) IMP

ProcessorIntelXeonE5-2597v3,3.6GHz,28cores,

56threads

NVIDIATitanXp,1.6GHz,3840cuda

cores

20MHzReRAM,4096Tiles,

64ReRAMPU/Tile

On-chipmemory 78.96MB 9.14MB 8,590MB

Off-chipmemory 64GBDRAM 12GBDRAM

Profiler/Simulator(Performance) IntelVTune Amplifier NVPROF

Cycleaccuratesimulator

(Booksim Integrated)

Profiler/Simulator(Power) InterRAPLInterface

NVIDIASystemManagementInterface

Tracebasedsimulation

Methodology

37

OffloadedKernel/ApplicationSpeedup(CPU)

• CapacitylimitationofIMPsettlestheupper-boundofperformanceimprovement.

38

OffloadedKernelSpeedup ApplicationSpeedup

0.000.100.200.300.400.500.600.700.800.901.00

1 2 3 4 5 6 7 8 9 10Normalized

Executio

nTime Series1 Series2 Series3 Series4

7.5x

1

10

100

1 2 3 4 5

Offloaded

Kerne

lSpe

edup 41x

KernelSpeedup(GPU)

39

1

10

100

1,000

10,000

1 2 3 4 5

KernelSpe

edup

763x

• GPUbenchmarksareabletousehigherDLP,dotproductoperations,andmulti-rowaddition.

Summary

Microarchitecture

ISA

ExecutionModel

ProgrammingModel

Compiler

HW

SWContributions

In-memorycomputingstackforgeneralpurposeprogramming

• UsedTensorFlowfortheprogrammingfrontend

• Developedacompilerforin-memorycomputingonReRAM

• DevelopedISAandcomputationprimitives

Results

763xspeedup 440xenergyefficiency..overserverclassGPGPU

40

In-MemoryDataParallelProcessor

Daichi Fujiki ScottMahlke Reetuparna Das

M-BitsResearchGroup

Thank you!

Documents

In-Memory Data Parallel Processornvmw.ucsd.edu/nvmw2019-program/unzip/current/nvmw...Need a programming language that merges concepts of Data-Flow and SIMD for maximizing parallelism