Duality Cache - Microsoft...Transforming caches into massively parallel vector ALUs 3 18-core Xeon processor 45 MB LLC 1 9 0 2.5MB LLC slice CBOX TMU 32kB data bank 8kB array 8kB SRAM

Duality Cache

for Data Parallel Acceleration

Daichi Fujiki

Scott Malke

Reetuparna Das

University of MichiganMbits Research Group

Duality = Storage + Compute

Why compute in-cache ?

Transforming caches into massively parallel vector ALUs3

18-core Xeon processor45 MB LLC

Way

1

Way

20

Way

19

2.5MB LLC slice

CBOXTMU

32kB data bank

8kB array

8kB SRAM array

D EN

QC

A&B

A^B

SCout

Cin

Vref

C_EN

~A & ~B

SA SA

BL BLB

DR

S = A^B^C

Bitline ALU

18 LLC slices 360 ways 5760 arrays 1,474,560 ALUs

WLBit-Slice 3Bit-Slice 2Bit-Slice 1Bit-Slice 0

Bit-Slice 3Bit-Slice 2Bit-Slice 1Bit-Slice 0

Row decoders

0

255

255

= A + B

BL/BLB

Logic

Arr

ay

AA

rray

B

0

1

1

0

0

0

1

1

1

0

0

1A +

B

Way

2

Way

2

Transforming caches into massively parallel vector ALUs

4

18-core Xeon processor45 MB LLC

Way

1

Way

20

Wa

y 1

9

2.5MB LLC slice

CBOXTMU

32kB data bank

8kB array

8kB SRAM array

WL

Row decoders

0

255

255

= A + B

BL/BLB

Logic

D EN

QC

A&B

A^B

SCout

Cin

Vref

C_EN

~A & ~B

SA SA

BL BLB

DR

S = A^B^C

Bitline ALU

Array AArray B

A + B

Passive Last Level Cache transformed into ∼ 1 million bit-serial active ALUs

✓ Multiply ✓ Divide ✓ Add

Bit-serial operation @2.5 GHz

18 LLC slices 360 ways 5760 arrays 1,474,560 ALUs

Fixed Point (Configurable Precision)

45 MB LLC →

Neural Cache [ISCA ‘18]5

CPU

SRAM

Digital Bit-Serial Operations @ 2.5GHz

Supported Operations

✓ Integer Operations

Programming Model

✓ Manually Mapped CNN KernelsCPU

NeuralCache

1 million bit-line ALUs

Cache Mode Accelerator Mode

35 MB LLC

Neural Network

Duality Cache6

CPU

SRAM

Digital Bit-Serial Operations @ 2.5GHz

Supported Operations

✓ Integer Operations

✓ Floating Point Operations

✓ Transcendental Functions (sin, cos, log etc.)

Programming Model

✓ SIMT & VLIW execution

✓ CUDA / OpenACC programs

CPU

DualityCache

1 million bit-line ALUs

Cache Mode Accelerator Mode

35 MB LLC

Data Parallel Applications

1. Reduced data movement

2. Cost

3. On chip memory capacity

7

Duality Cache Benefits

DDR4

CPU

SRAM

Baseline

✓No memcpy. Efficient serial/parallel interleaving.

+ 471 mm2 +250W (GPU) vs. ✓+30 mm2 +6W = ~3.5% (DC)

✓ ~10x thread capacity. Flexible cache allocation.

GDDR5DDR4

CPU

SRAM

GPU

PC

Ie

A GPU system

DDR4

CPU

DualityCache

Duality Cache

Outline

Background

Integer / Logical Operations

Floating Point Operations

Transcendental Functions

Execution Model

Programming Model

Compiler

Methodology / Results

8

9

BLB0 BL0 BLBn BLn

SA

Ro

w D

eco

der

SADifferential

Sense Amplifiers

Bitlines

Wordlines

Ro

w D

eco

der

-O

Changes

SA SA

Vref

SA SA

Vref Single-endedSense Amplifiers

Additional row decoder

Reconfigurablesense amplifiers

Logical Operations In-SRAM

SA SA

Vref

Logical Operations In-SRAM10

BLB0 BL0 BLBn BLn

Ro

w D

eco

der

Ro

w D

eco

der

SA SA


A AND B

A

B

BA

0 1

0 11 0

1 0

A AND B

10

SA SA

Vref

BLB0 BL0 BLBn BLn

Ro

w D

eco

der

Ro

w D

eco

der

SA SA


A

B

BA

0 1

0 11 0

1 0

A AND B

10

A NOR B

1 0

11

Logical Operations In-SRAM

Bit-Serial Integer Operations

A + B

Row decoders

0

255

255BL/BLB

Sum

Carry

Arr

ay A

Arr

ay B

A +

B

Wo

rd 3

Wo

rd 2

Wo

rd 1

Wo

rd 0

}

}

}

S S S S

Transposed data

0 0 0 0𝑠𝑖 = 𝑎𝑖 ⊕bi⊕ 𝑐𝑖−1𝑐𝑖 = 𝑎𝑖 × bi + 𝑎𝑖 ⊕bi × 𝑐𝑖−1

In-memory operations

D EN

QC

A&B

A^B

SCout

Cin

Vref

C_EN

~A & ~B

SA

SA

BL BLB

DR

S = A^B^C

Bit-Serial Integer Operations13

A + B

Row decoders

0

255

255BL/BLB

Sum

Carry

Arr

ay A

Arr

ay B

A +

B

Wo

rd 3

Wo

rd 2

Wo

rd 1

Wo

rd 0

}

}

}

S S S S

Transposed data

0 0 0 0

A × B

Row decoders

0

255

255BL/BLB

Sum

Carry

Wo

rd 3

Wo

rd 2

Wo

rd 1

Wo

rd 0

S S S S

Transposed data

0 0 0 0𝑠𝑖 = 𝑎𝑖 ⊕bi⊕ 𝑐𝑖−1𝑐𝑖 = 𝑎𝑖 × bi + 𝑎𝑖 ⊕bi × 𝑐𝑖−1

In-memory operations Tag 0 0 0 0

Arr

ay A

Arr

ay B

A ×

B

}

}

}

𝒑𝒑(𝑖) = 𝑎𝑖 × 𝒃 × 2𝑖

𝒑(1) = 𝒑𝒑(1)𝒑(2) = 𝒑(1) + 𝒑𝒑(2)…𝒑(𝑛) = 𝒑(𝑛−1) + 𝒑𝒑(𝑛)

𝑎𝑖 ≔ 𝑖th bit of bit vector 𝒂

𝒂(𝑖) ≔ bit vector 𝒂 after 𝑖th iteration

partial product

predication(tag)

Bit-Serial Floating Point Operations14

A + B

+ × 102 9. 25000

+ × 101 2.50000

SIMDlane 1

- × 104 1. 02500

+ × 102 2.50000

SIMDlane 2

+ × 102 9. 25000

+ × 102 0.25000

- × 104 1. 02500

+ × 104 0.02500

1-bit shift

2-bit shift

①Mantissa Denormalization

|ediff| = 1

|ediff| = 2

sgn exp mnt mnt

× 102 9. 25000

× 102 0.25000

× 104 998.975

× 104 0.02500

②Convert mnt into 2’s complement format

Up to mntbit * #lanes cycles for shift ops!

Conversion cost (signed 2’s complement)!


A + B

+ × 102 9. 25000

+ × 101 2.50000

SIMDlane 1

- × 104 1. 02500

+ × 102 2.50000

SIMDlane 2

ediff = 1

ediff = 2

sgn exp mnt

SIMDlane 1

SIMDlane 2

exp

9. 25000

2.50000

998.975

2.50000

msb mnt

× 102

× 101

× 104

× 102

Preemptive conversion to 2’s complement format

exp

9. 25000

2.50000

998.975

2.50000

msb mnt

× 102

× 101

× 104

× 102

Unique ediff enumeration using CAM-search

Uniq ediffs = {1, 2}

+ × 102 9. 25000

+ × 102 0.25000

- × 104 1. 02500

+ × 104 0.02500

1-bit shift

2-bit shift

①Mantissa Denormalization

mnt

× 102 9. 25000

× 102 0.25000

× 104 998.975

× 104 0.02500

②Convert mnt into 2’s complement format


A + B

+ × 102 9. 25000 + × 101 2.50000SIMDlane 1

- × 104 1. 02500 + × 102 2.50000SIMDlane 2

|ediff| = 1

|ediff| = 2

sgn exp mnt

SIMDlane 3

- × 103 1. 02500 + × 101 2.50000 |ediff| = 2

SIMDlane 4

- × 103 1. 02500 + × 101 2.50000 |ediff| = 2

SIMDlane 256

- × 103 1. 00500 + × 102 2.50100 |ediff| = 1

...

Up to mntbit * #lanes cycles for shift ops!

8 cycle ediff read + 23 cycle mnt shift



8 cycle ediff read + 23 cycle mnt shift...


|ediff| = 1

|ediff| = 2

23 cycle mnt shift

23 cycle mnt shift

+ CAM-search cycleUnique ediff enumeration using CAM-search✓

Uniq ediffs = {1, 2}

7,936 cycles (total)


A + B

SIMDlane 1

SIMDlane 2

exp

× 102 9. 25000

× 102 0.25000

× 104 998.975

× 104 0.02500

mnt

× 102 10. 0000

× 104 999.000

+ =

+ =

③Perform addition

+ × 103 1. 0000

- × 104 1.00000

⑤Normalization④Convert back to sign expression

+ × 102 10. 0000

- × 104 1.00000

sgn exp mnt

②Convert mnt into complement format


A + B

SIMDlane 1

SIMDlane 2

exp

× 102 9. 25000

× 102 0.25000

× 104 998.975

× 104 0.02500

mnt

× 102 10. 0000

× 104 999.000

+ =

+ =

③Perform addition

④Convert back to sign expression

+ × 102 10. 0000

- × 104 1.00000

+ × 103 1. 0000

- × 104 1.00000

⑤Normalization

sgn exp mnt

Compiler-directed reconversion (2’s compl→signed)



A + B

SIMDlane 1

SIMDlane 2

exp

× 102 9. 25000

× 102 0.25000

× 104 998.975

× 104 0.02500

mnt

× 102 10. 0000

× 104 999.000

+ =

+ =

③Perform addition

× 103 1. 0000

× 104 999.000

④Partial normalization

exp

9. 25000

2.50000

998.975

2.50000

mnt

× 102

× 101

× 104

× 102

SIMDlane 1

SIMDlane 2

× 102 10. 0000

× 104 999.000

+ =

+ =

③Perform shift+add𝑠𝑖 = 𝑎𝑖 + 𝑏𝑖+ediff


A + B

Row decoders

0

255

255BL/BLB

Sum

Carry

Arr

ay B

A +

B

Wo

rd 3

Wo

rd 2

Wo

rd 1

Wo

rd 0

}

}

S S S S

Transposed data

0 0 0 0

Arr

ay A

}

ediff= exp_a-exp_b

2. Swap operandsCalculate ediffIf ediff[i] < 0 Then swap(A[i], B[i])

1

msb

exp

mnt

swapA[1], B[1]

3. Enumerate unique ediffUsing search

uniq_ediff= {1}

001

5. Normalize expIf bit_overflowThen exp_c=exp_a+1

mnt_c>>=1

4. ARSHADD Foreach uniq_ediffA[i]+(B[i]>>ediff)

+

B[i]>>1

Arr

ay A

Arr

ay B

sgn

exp

mnt

1. Convert into 2’s complement

S =

A +

B

Latency Optimizations (Integer, FP)21

0000101101

0000111101

0000001001

0000000101

SIMD Lanes

Leading Zero Search

Search leading zeros.

Skip inner loop iterations

& reduce inner loop cycles.

001010 000101×

e.g. Multiplication O(n2) →O(na×nb)

na nb

0000101101

0000111101

0000001001

0000000101

SIMD Lanes

0 0 0 0Tag

Zero Tag Search

Check tag bits are all zero.

Skip the inner loop iteration.

Useful for (const) integer multiplications.

Optimizations with CAM Search

SA SA

Vref

BLB0 BL0 BLBn BLn

SA SA

Vref

0 1

1 1

0 0

0 0

0 1

0 000

1

1

0

0

1

0

Search String

0 1Tag

Cycle 1Reduction AND

0 1

2 cycle CAM

Optimizations with CAM Search

SA SA

Vref

BLB0 BL0 BLBn BLn

SA SA

Vref

0 1

1 1

0 0

0 0

0 1

0 0

1

1

0

0

1

0

Search String

0 1Tag

Cycle 2

Reduction NOR

1 1

Ediff enumeration

1. Perform leading zero search to limit the search space for CAM search.

(e.g. ∀ediff < 8 ⇒ 4-bit CAM search)

2. Perform CAM search over ediff vector for values 0 ≤ 𝑖 ≤ 2𝑛.

3. If any hit (wired OR of tag), perform ARSHADD

2 cycle CAM

Supported In-Cache Operations24

Operation Type Algorithm Latency Optimizations

add, sub uint, int [2], Bit-serial O(n) -

mul uint, int [2], Bit-serial O(n2) LZS (multiplicand), ZTS

div, rem uint, int [2], Bit-serial O(n2) LZS (dividend, divisor), ZTS,ZRS

and, or, xor uint [1] O(n)

shl, shr uint, int Bit-serial O(n2) LZS, CAM Search

add, sub float Bit-serial O(n2) LZS, CAM Search

mul float Bit-serial O(n2) ZTS

div float Bit-serial O(n2) ZTS, ZRS

sin, cos fixed point CORDIC O(nk) Parallel CORDIC

exp fixed point CORDIC O(nk) Parallel CORDIC

log fixed point CORDIC O(nk) Parallel CORDIC

sqrt fixed point CORDIC O(nk) Parallel CORDIC

rsqrt float Fast Inv Sqrt O(n2)

Algorithm[1] Compute Caches (HPCA’17)[2] Neural Cache (ISCA’18)

Latencyn: Data Bit Length (bit)k: CORDIC iteration count

OptimizationsLZS: Leading Zero SearchZTS: Zero Tag SearchZRS: Zero Residue Search

Transcendental functions using CORDIC25

Benefits

• No multiplications.

• Can be combined with Bit-serial algorithms.

• Constant static 𝛼𝑖 series for all possible inputs. No LUT required.

Highly parallelizable for large SIMD processor.

• Operation level parallelism.

Express an angle 𝜃 using add/sub of a series. (0 ≤ 𝜃 < 𝜋/2)

𝜃 =±𝛼𝑖

Vector rotation requires multiplications with tan(𝛼𝑖).𝑥𝑖𝑦𝑖

=cos 𝛼𝑖 −sin 𝛼𝑖sin 𝛼𝑖 cos 𝛼𝑖

𝑥𝑖−1𝑦𝑖−1

= cos(𝛼𝑖)1 − tan 𝛼𝑖

tan 𝛼𝑖 1

𝑥𝑖−1𝑦𝑖−1

CORDIC performs pseudo rotation so that tan 𝛼𝑖 = ±2−𝑖

𝑥𝑖𝑦𝑖

= 𝐾𝑥𝑖−1 ∓ 𝑦𝑖−1 × 2

−𝑖

𝑦𝑖−1 ± 𝑥𝑖−1 × 2−1 , 𝜃 = ±arctan2

−𝑖

Outline

Background




Execution Model

Programming Model

Compiler


26

Execution Model27

DualityCache

Bank

Bank

Bank

Bank

Way

Tag

C-Box

Wa

y 1

Way

20

Way

19

CBOXTMU

Wa

y 2

2.5MB LLC slice

Execution Model28

DualityCache

Bank

Bank

Bank

Bank

Way

Tag

C-Box

Cache Subarrays

…32-

bit

8x

32-

bit re

gis

ters

256 threads

32-b

it…

32-b

it…

Execution Model29

DualityCache

Bank

Bank

Bank

Bank

Way

Tag

C-Box

Cache Subarrays

…32-

bit

8x

32-

bit re

gis

ters

256 threads

32-b

it…

32-b

it…

1 thread

32 x 32-bit registers

Execution Model30

DualityCache

Bank

Bank

Bank

Bank

Way

Tag

C-Box

Cache Subarrays

8x 3

2-

bit re

gis

ters

256 threads / Bank

1 thread


4-wide VLIW-style instruction issue

add sub load nop

add sub load nop

Bank2

1024 thread 4-wide VLIW SIMD processor

Bank3Bank4

Execution Model31

DualityCache

Bank

Bank

Bank

Bank

Way

Tag

C-Box

1 thread



Tag

add//!1sub//!2…

Tag compare

Window 1

FSMcmdsel

Window 2

FSMcmdsel

Window 3

FSMcmdsel

Window 4

FSMcmdsel

Decoder

256 threads / Bank

Bank2Bank3Bank4

add sub load nop

add sub load nop

Execution Model 32

DualityCache

Bank

Bank

Bank

Bank

Way

Tag

C-Box

1 thread



add sub load nop

add sub load nop

Bank2Bank3Bank4

Tag

add//!1sub//!2…

Tag compare

Window 1

FSMcmdsel

Window 2

FSMcmdsel

Window 3

FSMcmdsel

Window 4

FSMcmdsel

Decoder

PC+1

0

2047

jmp_en

256 threads / Bank

Execution Model 33

DualityCache

Bank

Bank

Bank

Bank

Way

Tag

C-Box

add sub load nop

Tag

add//!1sub//!2…

Tag compare

Window 1

FSMcmdsel

Window 2

FSMcmdsel

Window 3

FSMcmdsel

Window 4

FSMcmdsel

Decoder

PC+1

0

2047

add sub load nop

C-B

ox

TMU *MSHR

LLC / Mem

Ba

nk

* Transposing Memory Unit

jmp_en

Programming Model34

↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓

Independent CTAs (Thread Blocks)

↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓

Warp

// kernel__global__ void vecadd (const float* A, const float* B, float* C, int n_el) {

int tid = blockDim.x * blockIdx.x + threadIdx.x;if (tid < n_el)

C[tid] = A[tid] + B[tid];}

// kernel invocationvecadd (A, B, C, n_el);

Block

Grid

Programming Model35

Data Movement Management

Parallel Region

Loop Mapping Optimization

Simple and seamless directive-based parallel programming paradigm for heterogeneous architectures.

Target Architectures: Native x86, NVIDIA GPU, OpenCLCompiler Support: PGI, gcc9.1, Riken Omni, OpenUH

Programming Model36

↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓

Independent CTAs (Thread Blocks)

↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓

Inter-thread communication

Bank

Bank

Bank

Bank

Way

Tag

C-Box

Bank

Bank

Bank

Bank

Way

Tag

C-Box

Bank

Bank

Bank

Bank

Way

Tag

C-Box

DualityCache

XBar

Resource Comparison37

CPU

DualityCache

GPU (NVIDIA GP100)

SMs


DualityCache

SMs

Bank

Bank

Bank

Bank

Way

Tag

C-Box

Re

gis

ter

WarpScheduler

Re

gis

ter

WarpScheduler

Shared Mem / L1

Xeon E5 35MB LLC NVIDIA GP100 SMs

SM

Bank× 4 = Register

Cores(32 lanes)

Cores(32 lanes)

Capacity

× 4≥

PU Resources

32 x 32-bit reg. / thread

1,024 threads32 PUs

256 PUs


DualityCache

SMs

Bank

Bank

Bank

Bank

Way

Tag

C-Box

Re

gis

ter

WarpScheduler

Re

gis

ter

WarpScheduler

Shared Mem / L1

Xeon E5 35MB LLC NVIDIA GP100 SMs

SM

Bank× 4 = Register

Cores(32 lanes)

Cores(32 lanes)

Capacity

256 PUs × 4

≥1,024 threads

32 PUs

PU Resources

32 x 32-bit reg. / thread

System Resources(2 socket CPU + GPU)

150x more PUs

10x more Threadsx 560 x 28

Compiler40

nvccNVIDIA CUDA

Compiler

ompccRIKEN Omni

OpenACC compiler frontend

CUDA binary

ELF

PTX

SASS

CUDA Runtime

Duality Cache Compiler

KernelAnalysis

InstructionScheduler

PTXOptimizer

ResourceAllocator

DC binary

ELF

DC-PTX

DC Runtime

Built on GPU Ocelot Framework

Compiler Optimization(1) Register Pressure Aware Instruction Scheduling

41


KernelAnalysis


PTXOptimizer

ResourceAllocator


Instruction Scheduling First

Maximize parallelism utilizing abundant shared register resources in VLIW architecture

Result in frequent register spill

Resource Allocation First

Optimized (small) resource usage and reuse distance

Introduces many false dependencies (less parallelism)

✓

!

✓

!


ResourceAllocator

Interactively perform resource allocation and instruction scheduling.

Bottom-Up Greedy (BUG) based

Linear Scan Register Allocation based

BB-DFG

PU1 PU2 PU1 PU2

Bad :(

Network Delay

BB1

BB3

BB2

BB4

CFG

PU1 PU2

Reg.spill

Alive in @ PU2

Alive out @ PU2

Alive out

Alive in Alive in

Alive outAlive out

Alive in

Good :)

Large Execution Time


Bottom-Up Greedy

Collect candidate assignments

Make final assignments

Minimize data transfer latency by

taking both operand & successor location into consideration

[Ellis 1986]

22

2

1

1

1

1

Time

PU1 PU2Alive in

Alive out


Register pressure

Compiler Optimization(2) PTX Optimizations

44


KernelAnalysis


PTXOptimizer

ResourceAllocator


I. AST Balancing

a + b + c + d+

++

a bc

d +

+c d

+a b

Folded associative Exprs Regenerate balanced AST-subtreetargeting max ILP = 4

Compiler Optimization(2) PTX Optimizations

45


KernelAnalysis


PTXOptimizer

ResourceAllocator


I. AST Balancing

II. Thread Independent Variable Isolation

a + b + c + d+

++

a bc

d +

+c d

+a b

Regenerate balanced AST-subtreetargeting max ILP = 4

__device__ kernel{

int bid = blockIdx.x;for (int i = 0; i < ITER_CNT; ++i){

// Do something}

}

bid and i are thread independent variable.⇓

Loop unroll / const fold.

Do not store them in thread private registers to reduce reg pressure.

Folded associative Exprs

Outline

Background




Execution Model

Programming Model

Compiler


46

Evaluation Methodology47

♦ Benchmarks−Rodinia

backprop

bfs

b+tree

dwt2d

hotspot

hotspot3D

hybridsort

nw

streamcluster

gaussian

heartwall

leukocyte

lud

nn

−PathScaleOpenACCbenchmark

divergence

gradient

lapsgrb

laplacian

tricubic

tricubic2

uxx1

vecadd

wave13pt

gameoflife

gaussblur

matvec

whispering

CPU2-sockets

GPU1-card, PCIe v3

Duality Cache

Processor

Intel Xeon E5-2597 v3, 2.6GHz,

28 cores, 56 threads

NVIDIA GP100, 1.6GHz, 3840 cuda

cores

35MB Duality Cache, 2.6 GHz

On-chip Memory 78.96 MB 9.14 MB 78.96 MB

Off-chip Memory 64 GB DDR412GB GDDR5 +

64GB DDR4 (host)64 GB DDR4

Total System Area 912 mm2 1383 mm2 942 mm2

TDP 290 W 640 W 296 W

Profiler / Simulator(Performance)

perf NVPROFGPU Ocelot

+ Ramulator

Profiler / Simulator(Energy)

Intel RAPL Interface

NVIDIA System Management

Interface

Trace based simulation

Area / Power Overhead

Area (mm2) Power (W) Area Overhead

CPU 456 145 -Compute Cache

Peripheral 3.15 2.96 0.69%

TMU 5.32 0.06 1.17%

Controller / FSM 6.16 0.33 1.35%

MSHR 0.86 0.05 0.19%

Local Crossbar 0.28 0.01 0.06%

Total 471.77 148.4 3.50%

48

Duality Cache has only 3.5% area overhead

System Speedup49

CPU, DC: DDR4, GPU: GDDR5 + memcpy (DDR4 ↔PCIe v3 ↔GDDR5)

Key performance factors• Reduced Memcpy time• Massively parallel execution• Compute / memory access overlap• Flexible cache allocation

Reduced data transfer (memcpy) cost via external bus (PCIe).

Small reuse factor / large working set make memcpy costly for GPU.

Rodinia OpenACC

0.0

1.0

2.0

3.0

4.0

5.0

Rodinia OpenACC

Spee

du

p

CPUGPUDuality Cache

20x

3.6x

2.4x

4.0x

0

200

400

600

800

1000

bac

kpro

p

bfs

b+

tre

ed

wt2

dga

uss

ian

he

artw

all

ho

tsp

ot

ho

tsp

ot3

Dh

ybri

dso

rtla

vaM

Dle

uko

cyte lud

nn

nw

pat

hfi

nd

er

stream

clu…

div

erge

nce

gam

eo

flif

ega

uss

blu

r

gra

die

nt

lap

gsrb

lap

laci

an

mat

vec

tric

ub

ic

tric

ub

ic2

uxx

1ve

cad

dw

ave

13p

tw

his

pe

rin

g

Ave

rage

Ker

nel

Siz

e (#

CTA

s)

2 sockets (560 CTAs)

1 socket (280 CTAs)

Massively Parallel Execution50

PU Utilization


Some kernels have high level of parallelism (e.g. backprop, b+tree, nn, gaussian, gausblur, etc.)

280 CTA/ socket (DC) vs. 60 CTA / card (GPU) (reg. capacity based)

Rodinia OpenACC

2000

0.0

1.0

2.0

3.0

4.0

5.0

Rodinia OpenACC

Spee

du

p

CPUGPUDuality Cache

20x

3.6x

2.4x

4.0x

Kernel Speedup51

DC: GDDR5, GPU: GDDR5 + No memcpy


Rodinia OpenACC

Some kernels with high level of parallelism (e.g. backprop, b+tree, nn, gaussian, gaussblur, etc.) significantly reduce execution time.

Memory bounded applications (hotspot3d, streamcluster) will benefit from GDDR5

0.0

1.0

2.0

3.0

4.0

5.0

Rodinia OpenACC

Spee

du

p

CPUGPUDuality Cache

20x

3.6x

2.4x

4.0x

Operation Latency52

0

500

1000

1500

2000

base opt base opt base opt base opt base opt

mul div fpadd fpmul fpdiv

Late

ncy

(cy

cles

)

Integer multiplication is often used to calculate address or some variables based on induction variables as an operand→ They contain many leading zeros which we can skip by our optimization. (13x faster)

Floating point addition has small dynamic range.→ The number of unique ediff found usually has its peak at 1 in the distribution. (6.1x faster)

Conclusion53

CPU

DualityCache

Contributions

In-cache computing framework for general purpose programming

• Used SIMT programming model for the programming frontend

• Developed a compiler for in-cache computing

• Enhanced computation primitives (all digital)

Results

1450x over CPU52x over GPU

Overall Efficiency

20x over CPU5.8x over GPU

Energy

72x over CPU4x over GPU

Performance

Duality Cache needs 15.7 mm2

== 3.5 % over CPUGPU = 471 mm2

Area

Backup Slides

55

DC + CPU Hybrid Execution56

0

0.2

0.4

0.6

0.8

1

1.2

Duality Cache Duality Cache+ CPU

lud

No

rmal

ized

Exe

cuti

on

Tim

e DC: Compute

DC: Memory

CPU

BFS

LUD

Kernel Launch Pattern(time vs. #CTAs launched)

Tim

eTi

me Execute

small kernels on

CPU threads

Enough parallelism to keep all PUs busy

67% better

System Energy57

Because of the reduced execution time, we achieve 5.34× energy efficiency compared to GPU.One core is active during execution to serve instructions. This makes CPU and DRAM access dominant in energy consumption.

Average Power58

0

20

40

60

80

100

120

140

ba

ckp

rop

bfs

b+

tre

e

dw

t2d

ga

ussia

n

he

art

wa

ll

ho

tsp

ot

ho

tsp

ot3

D

hybri

dso

rt

lava

MD

leuko

cyte

lud

nn

nw

pa

thfin

de

r

str

eam

clu

..

div

erg

en

ce

ga

meo

flife

ga

ussb

lur

gra

die

nt

lapg

srb

lapla

cia

n

ma

tve

c

tric

ub

ic

tric

ub

ic2

uxx1

ve

ca

dd

wa

ve

13

pt

Me

an

Av

era

ge

Po

we

r (W

)

Mean: 88.8 WMax: 120.1 W

64

Way

1

Way

20

Way

2

Way

19

CBOXTMU

Transpose

Ro

w D

eco

der

A0[MSB]

A1[MSB]

A2[MSB]

A0[LSB]

A1[LSB]

A2[LSB]

... ...

... ...

...

...

...

...

...

...

Col Decoder

SA SA SA

DR

DR

DR

SADR

SA SA SA

DR

DR

DR

...

...

...

...

...

...

... ...

... ...

...

...

...

...

...

...

...

...

...

...

...

...

Control

SA

SA

SA

SA

SA

SA

SA

SA

SA

SA

DR

DR

DR

DR

DR

DR

DR

DR

DR

DR

DR

B0[MSB]

B1[MSB]

B2[MSB]

B0[LSB]

B1[LSB]

B2[LSB]

Regular read/write

Transp

ose

re

ad/w

rite

8-T transpose bit-cell

A2 A1 A0B2 B1 B0C2 C1 C0

TMU A0A1A2

C0C1C2

B0B1B2

Transpose

65

Core 18Core 18

Core 1

Duality Cache Operation Modes

CPU only mode Duality Cache only mode

Hybrid mode - Slice partitioned Hybrid mode - Way partitioned

Core 1

Slice 1

Core 18

Slice 18 Slice 1 Slice 18

Core 1

Slice 1

Core 17

Slice 6 Slice 7 Slice 18

Core 1

Slice 1

Core 17

Slice 1818 Ways 2 Ways

Way mask implemented by CAT

Cores running non-DC threads

Cores running DC-related threads

LLC used for non-DC threads

LLC used for NC-related threads

Documents

Duality Cache - Microsoft...Transforming caches into massively parallel vector ALUs 3 18-core Xeon processor 45 MB LLC 1 9 0 2.5MB LLC slice CBOX TMU 32kB data bank 8kB array 8kB SRAM