Rahul Sharma (Stanford) Michael Bauer (NVIDIA Research) Alex Aiken (Stanford) Verification of Producer-Consumer Synchronization in GPU Programs June 15,

Rahul Sharma (Stanford)

Michael Bauer (NVIDIA Research)

Alex Aiken (Stanford)

Verification of Producer-Consumer Synchronization in GPU Programs

June 15, 2015 Rahul Sharma Michael Bauer

Outline

GPU background

Motivating examples

Verification algorithm and implementation

Results

GPU background

GPU

Off-Chip Global Memory

SM SM SM SM

Streaming Multiprocessors

On-Chip

ALU

ALU

ALU

ALU

ALU

ALU

Shared Memory(up to 48 KB)

Threadblock (CTA) ~ 100s of threads

Load Data from Global to Shared

__syncthreads() barrier

Compute on data in Shared

__syncthreads() barrier

Store Data from Shared to Global

Warp = 32 threads

SM SM SM SM

Named barriersSynchronization primitive

Built into hardware

16 named barriers per SM

Two instructionsSync: blocking

Arrive: non-blocking

Specify participating count

__synchthreads is a special case of named barriers

Sync 0, N

Encode producer-consumer patterns

Producer Warp

Consumer Warp

Named Barrier 0

Sync 0,64Arrive 0,64

CudaDMA library (SC 2011)

Simple library to abstract data movement between GPU memories

Global to shared

Shared to global

Specialize warpsCompute warps: do math

DMA warps: move data

Use named barriers to synchronize transfers

Use more named barriers for double buffering

Compute Warps

DMA Warps

start_xfer (arrive 0,N) wait_start (sync 0,N)

finish_xfer (arrive 1,N)

wait_start (sync 0,N)

wait_finish (sync 1,N)

start_xfer (arrive 0,N)

Load Data into Shared Buffer

Compute on Shared Buffer

Load Data into Shared Buffer

wait_finish (sync 1,N)finish_xfer (arrive 1,N)

Singe compiler (PPoPP 2014)DSL compiler for combustion chemistry

Up to 4X speedup

Kernels contain 10K lines

Maps static dataflow graphs onto warps

Use shared memory for communication

Assign synchronization points to named barriers

Analogous to register allocation

Manage passing of data through shared memory

Warp 0 Warp 1 Warp 2 Warp 3

A

B C D

E

GF

IH J

2

0 13

2

Named barrier challenges

Three challenges:

Named barrier reuseMust prove that it is safe to recycle named barriers

Need happens-before relationship

Must be self-consistent

Deadlock

Shared memory racesTwo accesses to the same location with at least one being a write

Warp 0 Warp 1 Warp 2 Warp 3

A

B C D

E

GF

IH J

2

0 13

2

Warp 0 Warp 1

sync 0arrive 1

sync 1arrive 0

WEFT architecture

GPUkernel

GPUkernel

compile 0

Thread programs

1

…n

Happens Before Improper barrier recycling

Shared memory data racesWEFT

Deadlocks

Threadblock (n threads)

Thread programs

Omit statements irrelevant to properties

Straight line programs: sequences of commands

Commandssync b [m]

arrive b [m]

read a

write a

Restrictive, but followed by the majority of GPU code

Well synchronization

“Synchronization pattern is deterministic”Same commands synchronize, no double duty

Obey generations

Subsumes deadlock freedom and safe recycling

Producer Consumer

sync 0 sync 0

write a sync 1

arrive 1 read a

sync 0 sync 0

Generation 1 of barrier 0



Check well synchronization

Need to knowWhich commands synchronize together

What is the generation of the corresponding barrier

First challenge: how to infer this information?

Generations are invariant over all executions

Statically emulate one executionRecord synchronization

Check that all executions respect the generations

Happens before

HB relation: reachabilityA happens before B if path from A from B

The path has at least one black edge

Check successive generations have HB relationship

Main result: HB relation is sound and precise

Producer Consumer

sync 0 sync 0

write a sync 1

arrive 1 read a

sync 0 sync 0

sync 0

write a

arrive 1

sync 0

sync 0

sync 1

read a

sync 0

gen 1

gen 2

Data races

For every two commands that can race

check an HB relationship

Sound and complete for race detection

Implementation

Naïve implementation does not scale

Extensive optimizationsFour orders of magnitude improvement

: total commands across all thread programs

Memory:

Time:

Evaluation (Singe kernels)

Discovered bugs

Write-after-read

Benign data races

All kernels were well synchronized

Conclusion

GPUs are much more flexible than people realizeCan use GPUs in new ways with named barriers

Use of named barriers can create many complicationsDeadlock, improper recycling, data races

Providing good software verification is importantNecessary to make named barriers easy to use

WEFT verifies code with named barriersAlgorithm is both sound and complete

Handles real production code efficiently

https://github.com/lightsighter/Weft

Documents

Rahul Sharma (Stanford) Michael Bauer (NVIDIA Research) Alex Aiken (Stanford) Verification of Producer-Consumer Synchronization in GPU Programs June 15,