22
Hardware Transactional Memory for GPU Architectures Wilson W. L. Fung Inderpeet Singh Andrew Brownsword Tor M. Aamodt University of British Columbia In Proc. 2011 ACM/IEEE Int’l Symp. Microarchitecture (MICRO-44)

Hardware Transactional Memory for GPU Architectures Wilson W. L. Fung Inderpeet Singh Andrew Brownsword Tor M. Aamodt University of British Columbia In

Embed Size (px)

Citation preview

Hardware Transactional Memory for GPU Architectures

Wilson W. L. FungInderpeet SinghAndrew BrownswordTor M. Aamodt

University of British ColumbiaIn Proc. 2011 ACM/IEEE Int’l Symp. Microarchitecture (MICRO-44)

Hardware TM for GPU Architectures 2

Motivation Lifetime of GPU Application Development

Time

Functionality Performance

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt

?

Time

Fine-Grained Locking

Time

Transactional Memory

E.g. N-Body with 5M bodies CUDA SDK: O(n2) – 1640 s (barrier)Barnes Hut: O(nLogn) – 5.2 s (locks)

Hardware TM for GPU Architectures 3Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 3

Are TM and GPUs Incompatible?

GPUs different from Multi-Core CPUs 1000s Concurrent Scalar Threads Challenges (from TM perspective)

Our Solution: KILO TM Hardware TM for GPUs

Hardware TM for GPU Architectures 4

T0 T1 T2 T3

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 4

Hardware TM for GPUs Challenge #1: SIMD Hardware On GPUs, scalar threads in a warp/wavefront

execute in lockstep

...TxBeginLD r2,[B]ADD r2,r2,2ST r2,[A]TxCommit...

Committed

A Warp with 4 Scalar Threads

Aborted

Branch Divergence!

T0 T1 T2 T3

T0 T1 T2 T3

...TxBeginLD r2,[B]ADD r2,r2,2ST r2,[A]TxCommit...

Hardware TM for GPU Architectures 5Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 5

KILO TM – Solution to Challenge #1: SIMD Hardware Transaction Abort

Like a Loop Extend SIMT Stack

Abort

Hardware TM for GPU Architectures 6Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 6

Register File

CPU Core10s of

Registers

Hardware TM for GPUs Challenge #2: Transaction Rollback

Checkpoint Register File

@ TX Entry

@ TX Abort

Register File

GPU Core (SM)

32k RegistersWarpWarpWarpWarpWarpWarpWarpWarp

2MB TotalOn-Chip Storage

Checkpoint?

Hardware TM for GPU Architectures 7Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 7

KILO TM – Solution toChallenge #2: Transaction Rollback SW Register Checkpoint

Most TX: Registers overwritten at first use TX in Barnes Hut: Checkpoint 2 registers

TxBeginLD r2,[B]ADD r2,r2,2ST r2,[A]TxCommit

Abort

Overwritten

Hardware TM for GPU Architectures 8Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 8

Hardware TM for GPUs Challenge #3: Conflict Detection

Existing HTMs use Cache Coherence Protocol Not Available on GPUs No Private Data Cache per Thread

Signatures? 1024-bit / Thread 3.8MB / 30k Threads

Hardware TM for GPU Architectures 9

GPU Core (SM)

L1 Data CacheWarpWarpWarpWarpWarpWarpWarp

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 9

Hardware TM for GPUs Challenge #4: Write Buffer

1024-1536 Threads Fermi’s L1 Data Cache (48kB)= 384 X 128B Lines

Problem: 384 lines / 1536 threads < 1 line per thread!

Hardware TM for GPU Architectures 10Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 10

TX2atomic

{A=B+2}

Private Memory

Read-Log

Write-Log

KILO TM: Value-Based Conflict Detection

TX1atomic

{B=A+1}

Private Memory

Read-Log

Write-Log

GlobalMemory

A=1

B=2

TxBeginLD r1,[A]ADD r1,r1,1ST r1,[B]TxCommit

A=1

B=0

A=1

B=0

A=2

B=2

B=2

Self-Validation + Abort: Only detects existence of conflict (not identity)

TxBeginLD r2,[B]ADD r2,r2,2ST r2,[A]TxCommit

TX1atomic

{B=A+1}

TX2atomic

{A=B+2}

Private Memory

Read-Log

Write-Log

Hardware TM for GPU Architectures 11Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 11

Parallel Validation?

Private Memory

Read-Log

Write-Log

GlobalMemory

A=1

B=2

A=1

B=0

A=1

B=0

A=2

B=0

B=2

A=2

Tx1 then Tx2:

A=4,B=2

Tx2 then Tx1:A=2,B=3

OR

Data Race!?!

Hardware TM for GPU Architectures 12Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 12

Serialize Validation?

GlobalMemory

CommitUnit

Benefit #1: No Data Race Benefit #2: No Live Lock Drawback: Serializes Non-Conflicting Transactions

(“collateral damage”)

TX1 TX2V + C Stall

Tim

e

V + C

V = ValidationC = Commit

Hardware TM for GPU Architectures 13

Solution: Speculative Validation

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 13

GlobalMemory

CommitUnit

TX3TX1V+C Stall

Tim

e

V+CTX2V+C

Key Idea: Split Conflict Detection into two parts1. Recently Committed TX in Parallel

2. Concurrently Committing TX in Commit Order Approximate

RSRSRS

Conflict Rare Good Commit Parallelism

V = ValidationC = Commit

Hardware TM for GPU Architectures 14Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 14

KILO TM Implementation

SIMT Core

ConstantCache

TextureCache

MemoryPort

L1 DataCache

SIMTStacksSIMT Core

ConstantCache

TextureCache

MemoryPort

L1 DataCache

SIMTStacks

Memory PartitionMemory Partition

CPUKernelLaunch

InterconnectionN

etwork

Memory Partition

Last-LevelCache Bank

Off-ChipDRAM Channel

Atomic Op.Unit

CommitUnit

SIMT Core

Thread Block

Thread Block

Shared

Mem

ory

ConstantCache

TextureCache

MemoryPort

RegisterF

ile

L1 DataCache

TXLogUnit

SIMTStacksSIMT

StacksCommit

UnitTXLogUnit

Minimal Modification to Existing GPU Arch.

Hardware TM for GPU Architectures 15Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 15

Evaluation Methodology

GPGPU-Sim 3.0 (BSD license) Detailed: IPC Correlation of 0.93 vs GT 200 KILO TM (Timing-Driven Memory Accesses)

GPU TM Applications Hash Table (HT-H, HT-L) Bank Account (ATM) Cloth Physics (CL) Barnes Hut (BH) CudaCuts (CC) Data Mining (AP)

Hardware TM for GPU Architectures 16Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 16

Serializing TX ≈ Coarse-Grained Locks

1.14

1.04

0

100

200

300

400

500

600

700

HT-H HT-L ATM CL BH CC AP AVG

Sp

eed

up

ove

r S

eria

lizi

ng

Tx

Ideal TM

FG Lock

Performance (vs. Serializing TX)

Higher is Better

0

1

2

3

HT-H HT-L ATM CL BH CC AP

No

rma

lize

d E

xec

. T

i me

Hardware TM for GPU Architectures 17Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 17

Performance (Exec. Time)

Captures 59% of FG Lock Performance

Ideal TM

KILO TM

FG Lock

Lower is Better

Hardware TM for GPU Architectures 18Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 18

Implementation Complexity

Logs in Private Memory @ L1 Data Cache Commit Unit

5kB Last Writer History Unit 19kB Transaction Status 32kB Read-Set and Write-Set Buffer

CACTI 5.3 @ 40nm 0.40mm2 x 6 Memory Partition 0.5% of 520mm2

Hardware TM for GPU Architectures 19Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 19

Summary

KILO TM: Hardware TM for GPUs 1000s of Concurrent Scalar TXs Handles Scalar TX Abort No cache coherence protocol dependency Word-level conflict detection Unbounded Transaction 59% Fine-Grained Locking Performance

128X Faster than Serializing TX Execution 0.5% Area Overhead

Question?

Hardware TM for GPU Architectures 20Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 20

Backup Slides

Hardware TM for GPU Architectures 21

ABA Problem?

Classic Example: Linked List Based Stack

Thread 0 – pop():while (true) { t = top; Next = t->Next;

// thread 2: pop A, pop B, push A

if (atomicCAS(&top, t, next) == t) break; // succeeds!

}

top ANext

BNext

CNext Null

top ANext

CNext Null

top BNext

CNext Null

top BNext

CNext Null

Next Bt A

top CNext Null

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 21

Hardware TM for GPU Architectures 22

ABA Problem?

atomicCAS protects only a single word Only part of the data structure

Value-based conflict detection protects all relevant parts of the data structure

top ANext

BNext

CNext Null

while (true) { t = top; Next = t->Next; if (atomicCAS(&top, t, next) == t) break; // succeeds!}

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 22