Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Duality Cache
for Data Parallel Acceleration
Daichi Fujiki
Scott Malke
Reetuparna Das
University of MichiganMbits Research Group
Duality = Storage + Compute
Why compute in-cache ?
Transforming caches into massively parallel vector ALUs3
18-core Xeon processor45 MB LLC
Way
1
Way
20
Way
19
2.5MB LLC slice
CBOXTMU
32kB data bank
8kB array
8kB SRAM array
D EN
QC
A&B
A^B
SCout
Cin
Vref
C_EN
~A & ~B
SA SA
BL BLB
DR
S = A^B^C
Bitline ALU
18 LLC slices 360 ways 5760 arrays 1,474,560 ALUs
WLBit-Slice 3Bit-Slice 2Bit-Slice 1Bit-Slice 0
Bit-Slice 3Bit-Slice 2Bit-Slice 1Bit-Slice 0
Row decoders
0
255
255
= A + B
BL/BLB
Logic
Arr
ay
AA
rray
B
0
1
1
0
0
0
1
1
1
0
0
1A +
B
Way
2
Way
2
Transforming caches into massively parallel vector ALUs
4
18-core Xeon processor45 MB LLC
Way
1
Way
20
Wa
y 1
9
2.5MB LLC slice
CBOXTMU
32kB data bank
8kB array
8kB SRAM array
WL
Row decoders
0
255
255
= A + B
BL/BLB
Logic
D EN
QC
A&B
A^B
SCout
Cin
Vref
C_EN
~A & ~B
SA SA
BL BLB
DR
S = A^B^C
Bitline ALU
Array AArray B
A + B
Passive Last Level Cache transformed into ∼ 1 million bit-serial active ALUs
✓ Multiply ✓ Divide ✓ Add
Bit-serial operation @2.5 GHz
18 LLC slices 360 ways 5760 arrays 1,474,560 ALUs
Fixed Point (Configurable Precision)
45 MB LLC →
Neural Cache [ISCA ‘18]5
CPU
SRAM
Digital Bit-Serial Operations @ 2.5GHz
Supported Operations
✓ Integer Operations
Programming Model
✓ Manually Mapped CNN KernelsCPU
NeuralCache
1 million bit-line ALUs
Cache Mode Accelerator Mode
35 MB LLC
Neural Network
Duality Cache6
CPU
SRAM
Digital Bit-Serial Operations @ 2.5GHz
Supported Operations
✓ Integer Operations
✓ Floating Point Operations
✓ Transcendental Functions (sin, cos, log etc.)
Programming Model
✓ SIMT & VLIW execution
✓ CUDA / OpenACC programs
CPU
DualityCache
1 million bit-line ALUs
Cache Mode Accelerator Mode
35 MB LLC
Data Parallel Applications
1. Reduced data movement
2. Cost
3. On chip memory capacity
7
Duality Cache Benefits
DDR4
CPU
SRAM
Baseline
✓No memcpy. Efficient serial/parallel interleaving.
+ 471 mm2 +250W (GPU) vs. ✓+30 mm2 +6W = ~3.5% (DC)
✓ ~10x thread capacity. Flexible cache allocation.
GDDR5DDR4
CPU
SRAM
GPU
PC
Ie
A GPU system
DDR4
CPU
DualityCache
Duality Cache
Outline
Background
Integer / Logical Operations
Floating Point Operations
Transcendental Functions
Execution Model
Programming Model
Compiler
Methodology / Results
8
9
BLB0 BL0 BLBn BLn
SA
Ro
w D
eco
der
SADifferential
Sense Amplifiers
Bitlines
Wordlines
Ro
w D
eco
der
-O
Changes
SA SA
Vref
SA SA
Vref Single-endedSense Amplifiers
Additional row decoder
Reconfigurablesense amplifiers
Logical Operations In-SRAM
SA SA
Vref
Logical Operations In-SRAM10
BLB0 BL0 BLBn BLn
Ro
w D
eco
der
Ro
w D
eco
der
SA SA
Vref Single-endedSense Amplifiers
A AND B
A
B
BA
0 1
0 11 0
1 0
A AND B
10
SA SA
Vref
BLB0 BL0 BLBn BLn
Ro
w D
eco
der
Ro
w D
eco
der
SA SA
Vref Single-endedSense Amplifiers
A
B
BA
0 1
0 11 0
1 0
A AND B
10
A NOR B
1 0
11
Logical Operations In-SRAM
Bit-Serial Integer Operations
A + B
Row decoders
0
255
255BL/BLB
Sum
Carry
Arr
ay A
Arr
ay B
A +
B
Wo
rd 3
Wo
rd 2
Wo
rd 1
Wo
rd 0
}
}
}
S S S S
Transposed data
0 0 0 0𝑠𝑖 = 𝑎𝑖 ⊕bi⊕ 𝑐𝑖−1𝑐𝑖 = 𝑎𝑖 × bi + 𝑎𝑖 ⊕bi × 𝑐𝑖−1
In-memory operations
D EN
QC
A&B
A^B
SCout
Cin
Vref
C_EN
~A & ~B
SA
SA
BL BLB
DR
S = A^B^C
Bit-Serial Integer Operations13
A + B
Row decoders
0
255
255BL/BLB
Sum
Carry
Arr
ay A
Arr
ay B
A +
B
Wo
rd 3
Wo
rd 2
Wo
rd 1
Wo
rd 0
}
}
}
S S S S
Transposed data
0 0 0 0
A × B
Row decoders
0
255
255BL/BLB
Sum
Carry
Wo
rd 3
Wo
rd 2
Wo
rd 1
Wo
rd 0
S S S S
Transposed data
0 0 0 0𝑠𝑖 = 𝑎𝑖 ⊕bi⊕ 𝑐𝑖−1𝑐𝑖 = 𝑎𝑖 × bi + 𝑎𝑖 ⊕bi × 𝑐𝑖−1
In-memory operations Tag 0 0 0 0
Arr
ay A
Arr
ay B
A ×
B
}
}
}
𝒑𝒑(𝑖) = 𝑎𝑖 × 𝒃 × 2𝑖
𝒑(1) = 𝒑𝒑(1)𝒑(2) = 𝒑(1) + 𝒑𝒑(2)…𝒑(𝑛) = 𝒑(𝑛−1) + 𝒑𝒑(𝑛)
𝑎𝑖 ≔ 𝑖th bit of bit vector 𝒂
𝒂(𝑖) ≔ bit vector 𝒂 after 𝑖th iteration
partial product
predication(tag)
Bit-Serial Floating Point Operations14
A + B
+ × 102 9. 25000
+ × 101 2.50000
SIMDlane 1
- × 104 1. 02500
+ × 102 2.50000
SIMDlane 2
+ × 102 9. 25000
+ × 102 0.25000
- × 104 1. 02500
+ × 104 0.02500
1-bit shift
2-bit shift
①Mantissa Denormalization
|ediff| = 1
|ediff| = 2
sgn exp mnt mnt
× 102 9. 25000
× 102 0.25000
× 104 998.975
× 104 0.02500
②Convert mnt into 2’s complement format
Up to mntbit * #lanes cycles for shift ops!
Conversion cost (signed 2’s complement)!
Bit-Serial Floating Point Operations15
A + B
+ × 102 9. 25000
+ × 101 2.50000
SIMDlane 1
- × 104 1. 02500
+ × 102 2.50000
SIMDlane 2
ediff = 1
ediff = 2
sgn exp mnt
SIMDlane 1
SIMDlane 2
exp
9. 25000
2.50000
998.975
2.50000
msb mnt
× 102
× 101
× 104
× 102
Preemptive conversion to 2’s complement format
exp
9. 25000
2.50000
998.975
2.50000
msb mnt
× 102
× 101
× 104
× 102
Unique ediff enumeration using CAM-search
Uniq ediffs = {1, 2}
+ × 102 9. 25000
+ × 102 0.25000
- × 104 1. 02500
+ × 104 0.02500
1-bit shift
2-bit shift
①Mantissa Denormalization
mnt
× 102 9. 25000
× 102 0.25000
× 104 998.975
× 104 0.02500
②Convert mnt into 2’s complement format
Bit-Serial Floating Point Operations16
A + B
+ × 102 9. 25000 + × 101 2.50000SIMDlane 1
- × 104 1. 02500 + × 102 2.50000SIMDlane 2
|ediff| = 1
|ediff| = 2
sgn exp mnt
SIMDlane 3
- × 103 1. 02500 + × 101 2.50000 |ediff| = 2
SIMDlane 4
- × 103 1. 02500 + × 101 2.50000 |ediff| = 2
SIMDlane 256
- × 103 1. 00500 + × 102 2.50100 |ediff| = 1
...
Up to mntbit * #lanes cycles for shift ops!
8 cycle ediff read + 23 cycle mnt shift
8 cycle ediff read + 23 cycle mnt shift
8 cycle ediff read + 23 cycle mnt shift
8 cycle ediff read + 23 cycle mnt shift...
8 cycle ediff read + 23 cycle mnt shift
|ediff| = 1
|ediff| = 2
23 cycle mnt shift
23 cycle mnt shift
+ CAM-search cycleUnique ediff enumeration using CAM-search✓
Uniq ediffs = {1, 2}
7,936 cycles (total)
Bit-Serial Floating Point Operations17
A + B
SIMDlane 1
SIMDlane 2
exp
× 102 9. 25000
× 102 0.25000
× 104 998.975
× 104 0.02500
mnt
× 102 10. 0000
× 104 999.000
+ =
+ =
③Perform addition
+ × 103 1. 0000
- × 104 1.00000
⑤Normalization④Convert back to sign expression
+ × 102 10. 0000
- × 104 1.00000
sgn exp mnt
②Convert mnt into complement format
Bit-Serial Floating Point Operations18
A + B
SIMDlane 1
SIMDlane 2
exp
× 102 9. 25000
× 102 0.25000
× 104 998.975
× 104 0.02500
mnt
× 102 10. 0000
× 104 999.000
+ =
+ =
③Perform addition
④Convert back to sign expression
+ × 102 10. 0000
- × 104 1.00000
+ × 103 1. 0000
- × 104 1.00000
⑤Normalization
sgn exp mnt
Compiler-directed reconversion (2’s compl→signed)
②Convert mnt into complement format
Bit-Serial Floating Point Operations19
A + B
SIMDlane 1
SIMDlane 2
exp
× 102 9. 25000
× 102 0.25000
× 104 998.975
× 104 0.02500
mnt
× 102 10. 0000
× 104 999.000
+ =
+ =
③Perform addition
× 103 1. 0000
× 104 999.000
④Partial normalization
exp
9. 25000
2.50000
998.975
2.50000
mnt
× 102
× 101
× 104
× 102
SIMDlane 1
SIMDlane 2
× 102 10. 0000
× 104 999.000
+ =
+ =
③Perform shift+add𝑠𝑖 = 𝑎𝑖 + 𝑏𝑖+ediff
②Convert mnt into complement format
A + B
Row decoders
0
255
255BL/BLB
Sum
Carry
Arr
ay B
A +
B
Wo
rd 3
Wo
rd 2
Wo
rd 1
Wo
rd 0
}
}
S S S S
Transposed data
0 0 0 0
Arr
ay A
}
ediff= exp_a-exp_b
2. Swap operandsCalculate ediffIf ediff[i] < 0 Then swap(A[i], B[i])
1
msb
exp
mnt
swapA[1], B[1]
3. Enumerate unique ediffUsing search
uniq_ediff= {1}
001
5. Normalize expIf bit_overflowThen exp_c=exp_a+1
mnt_c>>=1
4. ARSHADD Foreach uniq_ediffA[i]+(B[i]>>ediff)
+
B[i]>>1
Arr
ay A
Arr
ay B
sgn
exp
mnt
1. Convert into 2’s complement
S =
A +
B
Latency Optimizations (Integer, FP)21
0000101101
0000111101
0000001001
0000000101
SIMD Lanes
Leading Zero Search
Search leading zeros.
Skip inner loop iterations
& reduce inner loop cycles.
001010 000101×
e.g. Multiplication O(n2) →O(na×nb)
na nb
0000101101
0000111101
0000001001
0000000101
SIMD Lanes
0 0 0 0Tag
Zero Tag Search
Check tag bits are all zero.
Skip the inner loop iteration.
Useful for (const) integer multiplications.
Optimizations with CAM Search
SA SA
Vref
BLB0 BL0 BLBn BLn
SA SA
Vref
0 1
1 1
0 0
0 0
0 1
0 000
1
1
0
0
1
0
Search String
0 1Tag
Cycle 1Reduction AND
0 1
2 cycle CAM
Optimizations with CAM Search
SA SA
Vref
BLB0 BL0 BLBn BLn
SA SA
Vref
0 1
1 1
0 0
0 0
0 1
0 0
1
1
0
0
1
0
Search String
0 1Tag
Cycle 2
Reduction NOR
1 1
Ediff enumeration
1. Perform leading zero search to limit the search space for CAM search.
(e.g. ∀ediff < 8 ⇒ 4-bit CAM search)
2. Perform CAM search over ediff vector for values 0 ≤ 𝑖 ≤ 2𝑛.
3. If any hit (wired OR of tag), perform ARSHADD
2 cycle CAM
Supported In-Cache Operations24
Operation Type Algorithm Latency Optimizations
add, sub uint, int [2], Bit-serial O(n) -
mul uint, int [2], Bit-serial O(n2) LZS (multiplicand), ZTS
div, rem uint, int [2], Bit-serial O(n2) LZS (dividend, divisor), ZTS,ZRS
and, or, xor uint [1] O(n)
shl, shr uint, int Bit-serial O(n2) LZS, CAM Search
add, sub float Bit-serial O(n2) LZS, CAM Search
mul float Bit-serial O(n2) ZTS
div float Bit-serial O(n2) ZTS, ZRS
sin, cos fixed point CORDIC O(nk) Parallel CORDIC
exp fixed point CORDIC O(nk) Parallel CORDIC
log fixed point CORDIC O(nk) Parallel CORDIC
sqrt fixed point CORDIC O(nk) Parallel CORDIC
rsqrt float Fast Inv Sqrt O(n2)
Algorithm[1] Compute Caches (HPCA’17)[2] Neural Cache (ISCA’18)
Latencyn: Data Bit Length (bit)k: CORDIC iteration count
OptimizationsLZS: Leading Zero SearchZTS: Zero Tag SearchZRS: Zero Residue Search
Transcendental functions using CORDIC25
Benefits
• No multiplications.
• Can be combined with Bit-serial algorithms.
• Constant static 𝛼𝑖 series for all possible inputs. No LUT required.
Highly parallelizable for large SIMD processor.
• Operation level parallelism.
Express an angle 𝜃 using add/sub of a series. (0 ≤ 𝜃 < 𝜋/2)
𝜃 =±𝛼𝑖
Vector rotation requires multiplications with tan(𝛼𝑖).𝑥𝑖𝑦𝑖
=cos 𝛼𝑖 −sin 𝛼𝑖sin 𝛼𝑖 cos 𝛼𝑖
𝑥𝑖−1𝑦𝑖−1
= cos(𝛼𝑖)1 − tan 𝛼𝑖
tan 𝛼𝑖 1
𝑥𝑖−1𝑦𝑖−1
CORDIC performs pseudo rotation so that tan 𝛼𝑖 = ±2−𝑖
𝑥𝑖𝑦𝑖
= 𝐾𝑥𝑖−1 ∓ 𝑦𝑖−1 × 2
−𝑖
𝑦𝑖−1 ± 𝑥𝑖−1 × 2−1 , 𝜃 = ±arctan2
−𝑖
Outline
Background
Integer / Logical Operations
Floating Point Operations
Transcendental Functions
Execution Model
Programming Model
Compiler
Methodology / Results
26
Execution Model27
DualityCache
Bank
Bank
Bank
Bank
Way
Tag
C-Box
Wa
y 1
Way
20
Way
19
CBOXTMU
Wa
y 2
2.5MB LLC slice
Execution Model28
DualityCache
Bank
Bank
Bank
Bank
Way
Tag
C-Box
Cache Subarrays
…32-
bit
8x
32-
bit re
gis
ters
256 threads
32-b
it…
32-b
it…
Execution Model29
DualityCache
Bank
Bank
Bank
Bank
Way
Tag
C-Box
Cache Subarrays
…32-
bit
8x
32-
bit re
gis
ters
256 threads
32-b
it…
32-b
it…
1 thread
32 x 32-bit registers
Execution Model30
DualityCache
Bank
Bank
Bank
Bank
Way
Tag
C-Box
Cache Subarrays
8x 3
2-
bit re
gis
ters
256 threads / Bank
1 thread
32 x 32-bit registers
4-wide VLIW-style instruction issue
add sub load nop
add sub load nop
Bank2
1024 thread 4-wide VLIW SIMD processor
Bank3Bank4
Execution Model31
DualityCache
Bank
Bank
Bank
Bank
Way
Tag
C-Box
1 thread
32 x 32-bit registers
4-wide VLIW-style instruction issue
Tag
add//!1sub//!2…
Tag compare
Window 1
FSMcmdsel
Window 2
FSMcmdsel
Window 3
FSMcmdsel
Window 4
FSMcmdsel
Decoder
256 threads / Bank
Bank2Bank3Bank4
add sub load nop
add sub load nop
Execution Model 32
DualityCache
Bank
Bank
Bank
Bank
Way
Tag
C-Box
1 thread
32 x 32-bit registers
4-wide VLIW-style instruction issue
add sub load nop
add sub load nop
Bank2Bank3Bank4
Tag
add//!1sub//!2…
Tag compare
Window 1
FSMcmdsel
Window 2
FSMcmdsel
Window 3
FSMcmdsel
Window 4
FSMcmdsel
Decoder
PC+1
0
2047
jmp_en
256 threads / Bank
Execution Model 33
DualityCache
Bank
Bank
Bank
Bank
Way
Tag
C-Box
add sub load nop
Tag
add//!1sub//!2…
Tag compare
Window 1
FSMcmdsel
Window 2
FSMcmdsel
Window 3
FSMcmdsel
Window 4
FSMcmdsel
Decoder
PC+1
0
2047
add sub load nop
C-B
ox
TMU *MSHR
LLC / Mem
Ba
nk
* Transposing Memory Unit
jmp_en
Programming Model34
↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
Independent CTAs (Thread Blocks)
↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
Warp
// kernel__global__ void vecadd (const float* A, const float* B, float* C, int n_el) {
int tid = blockDim.x * blockIdx.x + threadIdx.x;if (tid < n_el)
C[tid] = A[tid] + B[tid];}
// kernel invocationvecadd (A, B, C, n_el);
Block
Grid
Programming Model35
Data Movement Management
Parallel Region
Loop Mapping Optimization
Simple and seamless directive-based parallel programming paradigm for heterogeneous architectures.
Target Architectures: Native x86, NVIDIA GPU, OpenCLCompiler Support: PGI, gcc9.1, Riken Omni, OpenUH
Programming Model36
↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
Independent CTAs (Thread Blocks)
↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
Inter-thread communication
Bank
Bank
Bank
Bank
Way
Tag
C-Box
Bank
Bank
Bank
Bank
Way
Tag
C-Box
Bank
Bank
Bank
Bank
Way
Tag
C-Box
DualityCache
XBar
Resource Comparison37
CPU
DualityCache
GPU (NVIDIA GP100)
SMs
Resource Comparison38
DualityCache
SMs
Bank
Bank
Bank
Bank
Way
Tag
C-Box
Re
gis
ter
WarpScheduler
Re
gis
ter
WarpScheduler
Shared Mem / L1
Xeon E5 35MB LLC NVIDIA GP100 SMs
SM
Bank× 4 = Register
Cores(32 lanes)
Cores(32 lanes)
Capacity
× 4≥
PU Resources
32 x 32-bit reg. / thread
1,024 threads32 PUs
256 PUs
Resource Comparison39
DualityCache
SMs
Bank
Bank
Bank
Bank
Way
Tag
C-Box
Re
gis
ter
WarpScheduler
Re
gis
ter
WarpScheduler
Shared Mem / L1
Xeon E5 35MB LLC NVIDIA GP100 SMs
SM
Bank× 4 = Register
Cores(32 lanes)
Cores(32 lanes)
Capacity
256 PUs × 4
≥1,024 threads
32 PUs
PU Resources
32 x 32-bit reg. / thread
System Resources(2 socket CPU + GPU)
150x more PUs
10x more Threadsx 560 x 28
Compiler40
nvccNVIDIA CUDA
Compiler
ompccRIKEN Omni
OpenACC compiler frontend
CUDA binary
ELF
PTX
SASS
CUDA Runtime
Duality Cache Compiler
KernelAnalysis
InstructionScheduler
PTXOptimizer
ResourceAllocator
DC binary
ELF
DC-PTX
DC Runtime
Built on GPU Ocelot Framework
Compiler Optimization(1) Register Pressure Aware Instruction Scheduling
41
Duality Cache Compiler
KernelAnalysis
InstructionScheduler
PTXOptimizer
ResourceAllocator
Built on GPU Ocelot Framework
Instruction Scheduling First
Maximize parallelism utilizing abundant shared register resources in VLIW architecture
Result in frequent register spill
Resource Allocation First
Optimized (small) resource usage and reuse distance
Introduces many false dependencies (less parallelism)
✓
!
✓
!
InstructionScheduler
ResourceAllocator
Interactively perform resource allocation and instruction scheduling.
Bottom-Up Greedy (BUG) based
Linear Scan Register Allocation based
BB-DFG
PU1 PU2 PU1 PU2
Bad :(
Network Delay
BB1
BB3
BB2
BB4
CFG
PU1 PU2
Reg.spill
Alive in @ PU2
Alive out @ PU2
Alive out
Alive in Alive in
Alive outAlive out
Alive in
Good :)
Large Execution Time
Compiler Optimization(1) Register Pressure Aware Instruction Scheduling
Bottom-Up Greedy
Collect candidate assignments
Make final assignments
Minimize data transfer latency by
taking both operand & successor location into consideration
[Ellis 1986]
22
2
1
1
1
1
Time
PU1 PU2Alive in
Alive out
Compiler Optimization(1) Register Pressure Aware Instruction Scheduling
Register pressure
Compiler Optimization(2) PTX Optimizations
44
Duality Cache Compiler
KernelAnalysis
InstructionScheduler
PTXOptimizer
ResourceAllocator
Built on GPU Ocelot Framework
I. AST Balancing
a + b + c + d+
++
a bc
d +
+c d
+a b
Folded associative Exprs Regenerate balanced AST-subtreetargeting max ILP = 4
Compiler Optimization(2) PTX Optimizations
45
Duality Cache Compiler
KernelAnalysis
InstructionScheduler
PTXOptimizer
ResourceAllocator
Built on GPU Ocelot Framework
I. AST Balancing
II. Thread Independent Variable Isolation
a + b + c + d+
++
a bc
d +
+c d
+a b
Regenerate balanced AST-subtreetargeting max ILP = 4
__device__ kernel{
int bid = blockIdx.x;for (int i = 0; i < ITER_CNT; ++i){
// Do something}
}
bid and i are thread independent variable.⇓
Loop unroll / const fold.
Do not store them in thread private registers to reduce reg pressure.
Folded associative Exprs
Outline
Background
Integer / Logical Operations
Floating Point Operations
Transcendental Functions
Execution Model
Programming Model
Compiler
Methodology / Results
46
Evaluation Methodology47
♦ Benchmarks−Rodinia
backprop
bfs
b+tree
dwt2d
hotspot
hotspot3D
hybridsort
nw
streamcluster
gaussian
heartwall
leukocyte
lud
nn
−PathScaleOpenACCbenchmark
divergence
gradient
lapsgrb
laplacian
tricubic
tricubic2
uxx1
vecadd
wave13pt
gameoflife
gaussblur
matvec
whispering
CPU2-sockets
GPU1-card, PCIe v3
Duality Cache
Processor
Intel Xeon E5-2597 v3, 2.6GHz,
28 cores, 56 threads
NVIDIA GP100, 1.6GHz, 3840 cuda
cores
35MB Duality Cache, 2.6 GHz
On-chip Memory 78.96 MB 9.14 MB 78.96 MB
Off-chip Memory 64 GB DDR412GB GDDR5 +
64GB DDR4 (host)64 GB DDR4
Total System Area 912 mm2 1383 mm2 942 mm2
TDP 290 W 640 W 296 W
Profiler / Simulator(Performance)
perf NVPROFGPU Ocelot
+ Ramulator
Profiler / Simulator(Energy)
Intel RAPL Interface
NVIDIA System Management
Interface
Trace based simulation
Area / Power Overhead
Area (mm2) Power (W) Area Overhead
CPU 456 145 -Compute Cache
Peripheral 3.15 2.96 0.69%
TMU 5.32 0.06 1.17%
Controller / FSM 6.16 0.33 1.35%
MSHR 0.86 0.05 0.19%
Local Crossbar 0.28 0.01 0.06%
Total 471.77 148.4 3.50%
48
Duality Cache has only 3.5% area overhead
System Speedup49
CPU, DC: DDR4, GPU: GDDR5 + memcpy (DDR4 ↔PCIe v3 ↔GDDR5)
Key performance factors• Reduced Memcpy time• Massively parallel execution• Compute / memory access overlap• Flexible cache allocation
Reduced data transfer (memcpy) cost via external bus (PCIe).
Small reuse factor / large working set make memcpy costly for GPU.
Rodinia OpenACC
0.0
1.0
2.0
3.0
4.0
5.0
Rodinia OpenACC
Spee
du
p
CPUGPUDuality Cache
20x
3.6x
2.4x
4.0x
0
200
400
600
800
1000
bac
kpro
p
bfs
b+
tre
ed
wt2
dga
uss
ian
he
artw
all
ho
tsp
ot
ho
tsp
ot3
Dh
ybri
dso
rtla
vaM
Dle
uko
cyte lud
nn
nw
pat
hfi
nd
er
stream
clu…
div
erge
nce
gam
eo
flif
ega
uss
blu
r
gra
die
nt
lap
gsrb
lap
laci
an
mat
vec
tric
ub
ic
tric
ub
ic2
uxx
1ve
cad
dw
ave
13p
tw
his
pe
rin
g
Ave
rage
Ker
nel
Siz
e (#
CTA
s)
2 sockets (560 CTAs)
1 socket (280 CTAs)
Massively Parallel Execution50
PU Utilization
Key performance factors• Reduced Memcpy time• Massively parallel execution• Compute / memory access overlap• Flexible cache allocation
Some kernels have high level of parallelism (e.g. backprop, b+tree, nn, gaussian, gausblur, etc.)
280 CTA/ socket (DC) vs. 60 CTA / card (GPU) (reg. capacity based)
Rodinia OpenACC
2000
0.0
1.0
2.0
3.0
4.0
5.0
Rodinia OpenACC
Spee
du
p
CPUGPUDuality Cache
20x
3.6x
2.4x
4.0x
Kernel Speedup51
DC: GDDR5, GPU: GDDR5 + No memcpy
Key performance factors• Reduced Memcpy time• Massively parallel execution• Compute / memory access overlap• Flexible cache allocation
Rodinia OpenACC
Some kernels with high level of parallelism (e.g. backprop, b+tree, nn, gaussian, gaussblur, etc.) significantly reduce execution time.
Memory bounded applications (hotspot3d, streamcluster) will benefit from GDDR5
0.0
1.0
2.0
3.0
4.0
5.0
Rodinia OpenACC
Spee
du
p
CPUGPUDuality Cache
20x
3.6x
2.4x
4.0x
Operation Latency52
0
500
1000
1500
2000
base opt base opt base opt base opt base opt
mul div fpadd fpmul fpdiv
Late
ncy
(cy
cles
)
Integer multiplication is often used to calculate address or some variables based on induction variables as an operand→ They contain many leading zeros which we can skip by our optimization. (13x faster)
Floating point addition has small dynamic range.→ The number of unique ediff found usually has its peak at 1 in the distribution. (6.1x faster)
Conclusion53
CPU
DualityCache
Contributions
In-cache computing framework for general purpose programming
• Used SIMT programming model for the programming frontend
• Developed a compiler for in-cache computing
• Enhanced computation primitives (all digital)
Results
1450x over CPU52x over GPU
Overall Efficiency
20x over CPU5.8x over GPU
Energy
72x over CPU4x over GPU
Performance
Duality Cache needs 15.7 mm2
== 3.5 % over CPUGPU = 471 mm2
Area
54
Backup Slides
55
DC + CPU Hybrid Execution56
0
0.2
0.4
0.6
0.8
1
1.2
Duality Cache Duality Cache+ CPU
lud
No
rmal
ized
Exe
cuti
on
Tim
e DC: Compute
DC: Memory
CPU
BFS
LUD
Kernel Launch Pattern(time vs. #CTAs launched)
Tim
eTi
me Execute
small kernels on
CPU threads
Enough parallelism to keep all PUs busy
67% better
System Energy57
Because of the reduced execution time, we achieve 5.34× energy efficiency compared to GPU.One core is active during execution to serve instructions. This makes CPU and DRAM access dominant in energy consumption.
Average Power58
0
20
40
60
80
100
120
140
ba
ckp
rop
bfs
b+
tre
e
dw
t2d
ga
ussia
n
he
art
wa
ll
ho
tsp
ot
ho
tsp
ot3
D
hybri
dso
rt
lava
MD
leuko
cyte
lud
nn
nw
pa
thfin
de
r
str
eam
clu
..
div
erg
en
ce
ga
meo
flife
ga
ussb
lur
gra
die
nt
lapg
srb
lapla
cia
n
ma
tve
c
tric
ub
ic
tric
ub
ic2
uxx1
ve
ca
dd
wa
ve
13
pt
Me
an
Av
era
ge
Po
we
r (W
)
Mean: 88.8 WMax: 120.1 W
64
Way
1
Way
20
Way
2
Way
19
CBOXTMU
Transpose
Ro
w D
eco
der
A0[MSB]
A1[MSB]
A2[MSB]
A0[LSB]
A1[LSB]
A2[LSB]
... ...
... ...
...
...
...
...
...
...
Col Decoder
SA SA SA
DR
DR
DR
SADR
SA SA SA
DR
DR
DR
...
...
...
...
...
...
... ...
... ...
...
...
...
...
...
...
...
...
...
...
...
...
Control
SA
SA
SA
SA
SA
SA
SA
SA
SA
SA
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
B0[MSB]
B1[MSB]
B2[MSB]
B0[LSB]
B1[LSB]
B2[LSB]
Regular read/write
Transp
ose
re
ad/w
rite
8-T transpose bit-cell
A2 A1 A0B2 B1 B0C2 C1 C0
TMU A0A1A2
C0C1C2
B0B1B2
Transpose
65
Core 18Core 18
Core 1
Duality Cache Operation Modes
CPU only mode Duality Cache only mode
Hybrid mode - Slice partitioned Hybrid mode - Way partitioned
Core 1
Slice 1
Core 18
Slice 18 Slice 1 Slice 18
Core 1
Slice 1
Core 17
Slice 6 Slice 7 Slice 18
Core 1
Slice 1
Core 17
Slice 1818 Ways 2 Ways
Way mask implemented by CAT
Cores running non-DC threads
Cores running DC-related threads
LLC used for non-DC threads
LLC used for NC-related threads