Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
In-MemoryDataParallelProcessor
DaichiFujiki ScottMahlke Reetuparna Das
M-BitsResearchGroup
2
CPU GPUDATA-PARALLELAPPLICATIONS
ARITHMETIC
DATACOMMUNICATION
MANYCORESIMDOoO
MANYTHREADSIMTSIMD
“Datamovementiswhatmatters,notarithmetic”– BillDally
1000x 40x
In-MemoryComputingexposesparallelismwhileminimizingdatamovementcost
3
CPU GPU IN-MEMORY
� In-situcomputing�Massiveparallelism
SIMDslotsoverdensememoryarrays
Highbandwidth/Lowdatamovement
In-MemoryComputing– ReducesDataMovement
4
CPU GPU IN-MEMORY
C11
C21
11
11I11 = (Vdd/2) C11
I21 = (Vdd/2) C21
I1 = (Vdd/2) (C11+ C21)
Vdd/2
C11
C21
11
00I11 = (Vdd/2) C11
I1 = (Vdd/2) (C11 - C21)
Vdd/2
C11 C12
DAC
DAC
DAC
DAC
V1
I11=(Vdd – V1)C11
V2
I12=(Vdd – V2) C12
DAC
DAC DAC
C11 C12
V1
I12= V1C12
DACI11= V1C11
C21 C22
I22= V2C22
DAC
V2
I21= V2C21
I1=I11+I21 I2=I21+I22
(a) Addition (b) Dot-product (c) Element-wise multiplication (d) Subtraction
I21 = (Vdd/2) C21
11
� In-situcomputing�Massiveparallelism
5
CPU (2sockets)IntelXeonE5-2597
GPUNVIDIA TITANXp
ReRAMScaledfromISAAC*
Area(mm2) 912.24 471 494
TDP(W) 290 250 416
On-chipmemory(MB) 78.96 9.14 8,590
SIMDslots 448 3,840 2,097,152Freq(GHz) 3.6 1.585 0.02
SIMD Freq Product 3,227 6,086 41,953
In-MemoryComputing– ExposesParallelism
IN-MEMORY
� In-situcomputing�Massiveparallelism
In-MemoryComputingToday
6
C11 C12V1
I11=V1C11DAC
C21 C22I22=V2C22
DAC
V2
C11 C12V1
DAC
C21 C22DAC
V2I12=V1C12
I21=V2C21
I11 I12
I21 I22
I1 =I11+I21 I2 =I12+I22Multiplication + Summation
ReRAM Dot-productAccelerator
• PRIME[Chi2016,ISCA]• ISAAC [Shafiee 2016,ISCA]• Dot-ProductEngine[Hu2016,DAC]• PipeLayer [Song2017,HPCA]
IN-MEMORY
In-MemoryComputing–NoDemonstrationofGeneralPurposeComputing
� Noestablishedprogrammingmodel/executionmodel� Limitedcomputationprimitives
Howtoprogram?
7
In-MemoryDataParallelProcessorOverview
Microarchitecture
ISA
ExecutionModel
ProgrammingModel
Compiler
HW
SW
8
MemoryISA
ADD MOVI
DOT MOVG
MUL SHIFT{L/R}
SUB MASK
MOV LUT
MOVS REDUCE_SUM
DataFlowGraph
IB1 IB2IMPCom
pilerILP
IB1 IB2IB1 IB2
Mod
ule
DLP
ComputationPrimitives
CA
CB
A
B
DAC
DAC
HW SWProcessorArchitecture- ISA- ExecutionModel- ProgrammingModel- Compiler
9
Informationstoredinanalog(cellconductanceC=1/resistance)
A
CA
ReadWrite
ComputationPrimitivesVdd
CA
CB
I = ( IA + IB )
Vdd/2
DAC
DAC
HW SWProcessorArchitecture- ISA- ExecutionModel- ProgrammingModel- Compiler
10
(a)Addition (b)Subtraction*
IA
IA = (Vdd/2) CA
Ohm’slaw[mult]
Kirchhoff’slaw[add]
IB
Vdd
*Newprimitive
ComputationPrimitives
(c)Dot-product
CA CCDAC
CB CD
IDY=VYCD
DAC
V
I1=IAX+IBY I2=ICX+IDY
ICX=VXCC
VX
HW SWProcessorArchitecture- ISA- ExecutionModel- ProgrammingModel- Compiler
11
Y
IAX=VACA
IBY=VYCB
C11 C12
V1
I11=(Vdd–V1)C11
V2
I12=(Vdd–V2)C12
DAC
DAC DAC
(c)Element-wisemultiplication
11
VddVdd - Multiplier
Multiplicand
! " # $% & = #! + %" $! + &" !
" (# %) = #! %"
BA
A B
!+ "+
X- Y-
d*Newprimitive
*
Microarchitecture
HW SWProcessorArchitecture- ISA- ExecutionModel- ProgrammingModel- Compiler
LUT
ReRAMPU
ReRAMPU...
ReRAMPU ... ReRAM
PU
Reg.File
Cluster
RRAMXB
S+H
DAC
DAC ADCADCS+A
Reg
ProcessingUnitRouter=RowDecoder +Shift&Add Unit
12
Microarchitecture
HW SWProcessorArchitecture- ISA- ExecutionModel- ProgrammingModel- Compiler
LUT
ReRAMPU
ReRAMPU...
ReRAMPU ... ReRAM
PU
Reg.File
Cluster
RRAMXB
S+H
DAC
DAC ADCADCS+A
Reg
ProcessingUnit
ArraySize 128x128
R/WLatency 50ns
MultiLevelCell 2
ADCResolution 5
ADCFrequency 1.2GSps
DACResolution 2
LUTsize 256x8
13
DAC
DAC
ShiftandHold
DAC
DAC
DAC
DAC
DAC
DAC
SampleandHold
8PUs/array1284BRegs/PU
512Bx8RegisterFile
ALU ALU ALU ALU
ALU ALU ALU ALU
ISAHW SWProcessorArchitecture- ISA- ExecutionModel- ProgrammingModel- Compiler
Opcode Format CyclesADD <MASK> <DST> 3
DOT <MASK> <REG_MASK> <DST> 18
MUL <SRC> <SRC> <DST> 18
SUB <SRC> <SRC> <DST> 3
MOV <SRC> <DST> 3
MOVS <SRC> <DST> <MASK> 3
MOVI <SRC> <IMM> 1
MOVG <GADDR> <GADDR> Variable
SHIFT{L/R} <SRC> <SRC> <IMM> 3
MASK <SRC> <SRC> <IMM> 3
LUT <SRC> <SRC> 4
REDUCE_SUM <SRC> <GADDR> Variable
In-situCom
putationMovesR/W
Misc
14
HW SWProcessorArchitecture- ISA- ExecutionModel- ProgrammingModel- Compiler
15
ProgrammingModelNeedaprogramminglanguagethatmergesconceptsofData-FlowandSIMDformaximizingparallelism
HW SWProcessorArchitecture- ISA- ExecutionModel- ProgrammingModel- Compiler
KEYOBSERVATION
Data-FlowExplicitdataflowexposesInstructionLevelParallelism
SIMD DataLevelParallelism
16
Side-effectFree
Nodependenceonsharedmemoryprimitives
ExecutionModelHW SWProcessorArchitecture- ISA- ExecutionModel- ProgrammingModel- Compiler
17
DataFlowGraph
……
InputMatrixA
InputMatrixB
…
…
InputMatrixA
InputMatrixB
ExecutionModelHW SWProcessorArchitecture- ISA- ExecutionModel- ProgrammingModel- Compiler
18
DLP
DataFlowGraph
……
InputMatrixA
InputMatrixB
DecomposedDFG
↕
ModuleModule
Unrollinnermostdimension
ModularizedexecutionflowAppliedtotheinnermostdimension
ExecutionModelHW SWProcessorArchitecture- ISA- ExecutionModel- ProgrammingModel- Compiler
DecomposedDataFlowGraph
IB1 IB2
Module
19
ILP
DataFlowGraph
……
InputMatrixA
InputMatrixB
…
…
InputMatrixA
InputMatrixB
IB IB
InstructionBlock(IB)Partialexecution
sequenceofaModuleMappedtoasinglearray
ExecutionModel
HW SWProcessorArchitecture- ISA- ExecutionModel- ProgrammingModel- Compiler
ReRAMArray
IB1 IB2
Module
20
IB1 IB2 IB1 IB2
IB1
IB1
IB1
IB2
IB2
IB2Module
ModularizedexecutionflowAppliedtotheinnermostdimension
IB IB
InstructionBlock(IB)ComponentsofModuleMappedtoasinglearray
ExecutionModel
HW SWProcessorArchitecture- ISA- ExecutionModel- ProgrammingModel- Compiler
DataFlowGraphs
…
…
IB1 IB2
Modules
21
IB1 IB2 IB1 IB2
ReRAMArray IB1
IB1
IB1
IB2
IB2
IB2
InputMatrixA
InputMatrixB
…
…
Module
ModularizedexecutionflowAppliedtotheinnermostdimension
IB IB
InstructionBlock(IB)ComponentsofModuleMappedtoasinglearray
…
HW SWProcessorArchitecture- ISA- ExecutionModel- ProgrammingModel- Compiler
22
CompilationFlow
HW SWProcessorArchitecture- ISA- ExecutionModel- ProgrammingModel- Compiler
Python
C++
Java
SemanticAnalysis
Optimization
NodeMerging
IBExpansion
Pipelining
InstructionLowering
IBScheduling
CodeGen
TargetMachineModeling
IMPCompilerTensorFlowDFG(ProtocolBuffer)
Backend
23
CompilationFlow
HW SWProcessorArchitecture- ISA- ExecutionModel- ProgrammingModel- Compiler
Python
C++
Java
SemanticAnalysis
Optimization
NodeMerging
IBExpansion
Pipelining
InstructionLowering
IBScheduling
CodeGen
TargetMachineModeling
IMPCompilerTensorFlowDFG(ProtocolBuffer)
Backend
24
Placeholder
Placeholder
Add
Reduce
88
16
Optimization1NodeMerging
HW SWProcessorArchitecture- ISA- ExecutionModel- ProgrammingModel- Compiler
Placeholder
56
32 … …
56
32 … …
Placeholder
Add+Reduce 16
++
+
+
SemanticAnalysis
Optimization
NodeMerging
IBExpansion
Pipelining
InstructionLowering
IBScheduling
CodeGen
25
• Exploitmulti-operandADD/SUB• Reduceredundantwritebacks
NodeMerging
Placeholder
Placeholder
Optimization2IBExpansion
HW SWProcessorArchitecture- ISA- ExecutionModel- ProgrammingModel- Compiler
5362 … …
Placeholder
Placeholder
Add
…
88
32 … …
Add88 Add
Unpack
Pack
32
56+
+
+
56
+
SemanticAnalysis
Optimization
NodeMerging
IBExpansion
Pipelining
InstructionLowering
IBScheduling
CodeGen
26
…
IBExpansion
Exposemoreparallelisminamoduletoarchitecture.
CompilationFlow
HW SWProcessorArchitecture- ISA- ExecutionModel- ProgrammingModel- Compiler
Python
C++
Java
SemanticAnalysis
Optimization
NodeMerging
IBExpansion
Pipelining
TargetMachineModeling
IMPCompilerTensorFlowDFG(ProtocolBuffer)
Backend
InstructionLowering
IBScheduling
CodeGen
28
CompilerBackend
HW SWProcessorArchitecture- ISA- ExecutionModel- ProgrammingModel- Compiler
SemanticAnalysis
Optimization
NodeMerging
IBExpansion
Pipelining
InstructionLowering
IBScheduling
CodeGen
InstructionLowering
InstructionLowering:TransformhighlevelTFinsts intomemoryISA
29
InstLowering
AddMul
LUT
Add Sub Mul Div Sqrt Exp SumConv2D
Less …
MemoryISA
…
SupportedTFoperationnodes
Division Algorithmq=a/b
1. 12 = 345 6 (LUT)
2. 82 = 9123. ;2 = 1 − 6124. 8> = 82 + ;2825. ;> = ;2@6. 8@ = 8> + ;>8>
High-levelTFNode
Newton-Raphson/Maclaurin
Div
CompilerBackend
HW SWProcessorArchitecture- ISA- ExecutionModel- ProgrammingModel- Compiler
Optimization
NodeMerging
IBExpansion
Pipelining
InstructionLowering
IBScheduling
CodeGen
SemanticAnalysis
IBScheduling
DFG
IBScheduling
Target#ofIBs
=1
IB1LargeExecutionTime
30
CompilerBackend
HW SWProcessorArchitecture- ISA- ExecutionModel- ProgrammingModel- Compiler
Optimization
NodeMerging
IBExpansion
Pipelining
InstructionLowering
IBScheduling
CodeGen
SemanticAnalysis
IBScheduling
DFG
IBScheduling
IB1 IB2 IB1 IB2
Good :) Bad :(
Target#ofIBs
=2
LargeExecutionTime
NetworkDelay
31
CompilerBackend
HW SWProcessorArchitecture- ISA- ExecutionModel- ProgrammingModel- Compiler
Bottom-UpGreedy
Collectcandidateassignments
MakefinalassignmentsMinimizedatatransferlatencybytakingbothoperand&successor
locationintoconsideration
Optimization
NodeMerging
IBExpansion
Pipelining
InstructionLowering
IBScheduling
CodeGen
SemanticAnalysis
IBScheduling
32
[Ellis1986]
CompilerBackend
HW SWProcessorArchitecture- ISA- ExecutionModel- ProgrammingModel- Compiler
Bottom-UpGreedy
Collectcandidateassignments
MakefinalassignmentsMinimizedatatransferlatencybytakingbothoperand&successor
locationintoconsideration
Optimization
NodeMerging
IBExpansion
Pipelining
InstructionLowering
IBScheduling
CodeGen
SemanticAnalysis
IBScheduling
33
[Ellis1986]
CompilerBackend
HW SWProcessorArchitecture- ISA- ExecutionModel- ProgrammingModel- Compiler
Bottom-UpGreedy
Collectcandidateassignments
MakefinalassignmentsMinimizedatatransferlatencybytakingbothoperand&successor
locationintoconsideration
Optimization
NodeMerging
IBExpansion
Pipelining
InstructionLowering
IBScheduling
CodeGen
SemanticAnalysis
IBScheduling
1
1
34
[Ellis1986] IB1 IB2
Time
IB1ischosenbecause…
Closertooperandlocations
CompilerBackend
HW SWProcessorArchitecture- ISA- ExecutionModel- ProgrammingModel- Compiler
Bottom-UpGreedy
Collectcandidateassignments
MakefinalassignmentsMinimizedatatransferlatencybytakingbothoperand&successor
locationintoconsideration
Optimization
NodeMerging
IBExpansion
Pipelining
InstructionLowering
IBScheduling
CodeGen
SemanticAnalysis
IBScheduling
22
2
1
1
35
[Ellis1986] IB1 IB2
Time
IB2ischosenbecause…
Earlierslotsavailable
CompilerBackend
HW SWProcessorArchitecture- ISA- ExecutionModel- ProgrammingModel- Compiler
Bottom-UpGreedy
Collectcandidateassignments
MakefinalassignmentsMinimizedatatransferlatencybytakingbothoperand&successor
locationintoconsideration
Optimization
NodeMerging
IBExpansion
Pipelining
InstructionLowering
IBScheduling
CodeGen
SemanticAnalysis
IBScheduling
22
2
1
1
1
1
36
[Ellis1986] IB1 IB2
Time
IB1ischosenbecause…Betteroverlapofcomm.
andcomputation
EvaluationMethodologyBenchmarks• PARSEC3.0
− Blackscholes,Canneal,Fluidanimate
• Rodinia− Backprop,Hotspot,Kmeans,Streamcluster
CPU(2sockets) GPU(1card) IMP
ProcessorIntelXeonE5-2597v3,3.6GHz,28cores,
56threads
NVIDIATitanXp,1.6GHz,3840cuda
cores
20MHzReRAM,4096Tiles,
64ReRAMPU/Tile
On-chipmemory 78.96MB 9.14MB 8,590MB
Off-chipmemory 64GBDRAM 12GBDRAM
Profiler/Simulator(Performance) IntelVTune Amplifier NVPROF
Cycleaccuratesimulator
(Booksim Integrated)
Profiler/Simulator(Power) InterRAPLInterface
NVIDIASystemManagementInterface
Tracebasedsimulation
Methodology
37
OffloadedKernel/ApplicationSpeedup(CPU)
• CapacitylimitationofIMPsettlestheupper-boundofperformanceimprovement.
38
OffloadedKernelSpeedup ApplicationSpeedup
0.000.100.200.300.400.500.600.700.800.901.00
1 2 3 4 5 6 7 8 9 10Normalized
Executio
nTime Series1 Series2 Series3 Series4
7.5x
1
10
100
1 2 3 4 5
Offloaded
Kerne
lSpe
edup 41x
KernelSpeedup(GPU)
39
1
10
100
1,000
10,000
1 2 3 4 5
KernelSpe
edup
763x
• GPUbenchmarksareabletousehigherDLP,dotproductoperations,andmulti-rowaddition.
Summary
Microarchitecture
ISA
ExecutionModel
ProgrammingModel
Compiler
HW
SWContributions
In-memorycomputingstackforgeneralpurposeprogramming
• UsedTensorFlowfortheprogrammingfrontend
• Developedacompilerforin-memorycomputingonReRAM
• DevelopedISAandcomputationprimitives
Results
763xspeedup 440xenergyefficiency..overserverclassGPGPU
40
In-MemoryDataParallelProcessor
Daichi Fujiki ScottMahlke Reetuparna Das
M-BitsResearchGroup
Thank you!