124
Irregular Parallelism on the GPU: Algorithms and Data Structures John Owens Associate Professor, Electrical and Computer Engineering Institute for Data Analysis and Visualization SciDAC Institute for Ultrascale Visualization University of California, Davis

[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

  • Upload
    npinto

  • View
    1.522

  • Download
    1

Embed Size (px)

DESCRIPTION

http://cs264.org http://j.mp/gUKodD

Citation preview

Page 1: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Irregular Parallelism on the GPU: Algorithms and Data Structures

John OwensAssociate Professor, Electrical and Computer

EngineeringInstitute for Data Analysis and VisualizationSciDAC Institute for Ultrascale Visualization

University of California, Davis

Page 2: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Outline

• Irregular parallelism

• Get a new data structure! (sparse matrix-vector)

• Reuse a good data structure! (composite)

• Get a new algorithm! (hash tables)

• Convert serial to parallel! (task queue)

• Concentrate on communication not computation! (MapReduce)

• Broader thoughts

Page 3: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Sparse Matrix-Vector Multiply

Page 4: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Sparse Matrix-Vector Multiply: What’s

• Dense approach is wasteful

• Unclear how to map work to parallel processors

• Irregular data access

“Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors”, Bell & Garland, SC ’09

Page 5: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Sparse Matrix Formats

Structured Unstructured

(DIA

) Diag

onal

(ELL

) EL

LPAC

K

(COO

) Co

ordi

nate

(CSR

) Co

mpr

esse

d Ro

w(H

YB) H

ybrid

Page 6: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Diagonal Matrices

• Diagonals should be mostly populated

• Map one thread per row

• Good parallel efficiency

• Good memory behavior [column-major storage]

-2 0 1

Page 7: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Diagonal Matrices

• Diagonals should be mostly populated

• Map one thread per row

• Good parallel efficiency

• Good memory behavior [column-major storage]

-2 0 1

We’ve recently done work on

tridiagonal systems—PPoPP ’ 10

Page 8: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Irregular Matrices: ELL

• Assign one thread per row again

• But now:

• Load imbalance hurts parallel efficiency

0 2

0 1 2 3 4 5

0 2 3

0 1 2

1 2 3 4 5

5

padding

Page 9: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Irregular Matrices: COO

• General format; insensitive to sparsity pattern, but ~3x slower than ELL

• Assign one thread per element, combine results from all elements in a row to get output element

• Req segmented reduction, communication btwn threads

00 02

10 11 12 13 14 15

20 22 23

30 31 32

41 42 43 44 45

55no padding!

Page 10: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Thread-per-

Page 11: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Irregular Matrices: HYB

• Combine regularity of ELL + flexibility of COO

0 2

0 1 2

0 2 3

0 1 2

1 2 3

5

13 14 15

44 45

“Typical” “Exceptional”

Page 12: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

SpMV: Summary

• Ample parallelism for large matrices

• Structured matrices (dense, diagonal): straightforward

• Sparse matrices: Tradeoff between parallel efficiency and load balance

• Take-home message: Use data structure appropriate to your matrix

• Insight: Irregularity is manageable if you regularize the common case

Page 13: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Compositing

Page 14: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

The Programmable

Input Assembly

Tess. Shading

Vertex Shading

Geom. Shading Raster Frag.

Shading Compose

Split Dice Shading Sampling Composition

Ray Generation ShadingRay Traversal Ray-Primitive Intersection

Page 15: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

CompositionSamples / Fragments

Subpixels locations

Pixel

S1

S2S3

S4

Fragment Generation

Composite

Filter

Final pixels

Subpixels(position, color)

Fragments (position, depth, color)

Anjul Patney, Stanley Tzeng, and John D. Owens. Fragment-Parallel Composite and Filter. Computer Graphics Forum (Proceedings of the Eurographics Symposium on Rendering), 29(4):1251–1258, June 2010.

Page 16: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Why?

Page 17: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Pixel-Parallel

Pixel i+1Pixel i

Subpixel 0 Subpixel 1 Subpixel 0 Subpixel 1

Page 18: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Sample-Parallel

Pixel i+1Pixel i

Subpixel 0 Subpixel 1 Subpixel 0 Subpixel 1

Pixel i+1Pixel i

Segmented Scan

Segmented Reduction

Subpixel 0 Subpixel 1 Subpixel 0 Subpixel 1

Page 19: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

1M randomly-distributed

Page 20: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Nested Data Parallelism

Pixel i+1Pixel i

Segmented Scan

Segmented Reduction

Subpixel 0 Subpixel 1 Subpixel 0 Subpixel 1

00 02

10 11 12 13 14 15

20 22 23

30 31 32

41 42 43 44 45

55

=

Page 21: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Hash Tables

Page 22: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Hash Tables & Sparsity

Lefebvre and Hoppe, Siggraph 2006

Page 23: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Scalar Hashing

key

#

Linear Probing

key

#2

#1

Double Probing

key

#1

#2

Chaining

Page 24: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Scalar Hashing: Parallel

key

#

key

#2

#1

key

#1

#2

• Construction and Lookup

• Variable time/work per entry

• Construction

• Synchronization / shared access to data structure

Page 25: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Parallel Hashing: The Problem

• Hash tables are good for sparse data.

• Input: Set of key-value pairs to place in the hash table

• Output: Data structure that allows:

• Determining if key has been placed in hash table

• Given the key, fetching its value

• Could also:

• Sort key-value pairs by key (construction)

• Binary-search sorted list (lookup)

• Recalculate at every change

Page 26: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Parallel Hashing: What We

• Fast construction time

• Fast access time

• O(1) for any element, O(n) for n elements in parallel

• Reasonable memory usage

• Algorithms and data structures may sit at different places in this space

• Perfect spatial hashing has good lookup times and memory usage but is very slow to construct

Page 27: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Level 1: Distribute into Keys

Data distributed into buckets

Dan A. Alcantara, Andrei Sharf, Fatemeh Abbasinejad, Shubhabrata Sengupta, Michael Mitzenmacher, John D. Owens, and Nina Amenta. Real-Time Parallel Hashing on the GPU. Siggraph Asia 2009 (ACM TOG 28(5)), pp. 154:1–154:9, Dec. 2009.

Page 28: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Level 1: Distribute into Keys

Data distributed into buckets

Bucket ids

Dan A. Alcantara, Andrei Sharf, Fatemeh Abbasinejad, Shubhabrata Sengupta, Michael Mitzenmacher, John D. Owens, and Nina Amenta. Real-Time Parallel Hashing on the GPU. Siggraph Asia 2009 (ACM TOG 28(5)), pp. 154:1–154:9, Dec. 2009.

Page 29: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Level 1: Distribute into Keys

Data distributed into buckets

h

Bucket ids

Dan A. Alcantara, Andrei Sharf, Fatemeh Abbasinejad, Shubhabrata Sengupta, Michael Mitzenmacher, John D. Owens, and Nina Amenta. Real-Time Parallel Hashing on the GPU. Siggraph Asia 2009 (ACM TOG 28(5)), pp. 154:1–154:9, Dec. 2009.

Page 30: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Level 1: Distribute into Keys

Bucket sizes

Data distributed into buckets

Atomic add

Local offsets

8 5 6 8

1 3 7 0 5 2 4 6

Bucket ids

Dan A. Alcantara, Andrei Sharf, Fatemeh Abbasinejad, Shubhabrata Sengupta, Michael Mitzenmacher, John D. Owens, and Nina Amenta. Real-Time Parallel Hashing on the GPU. Siggraph Asia 2009 (ACM TOG 28(5)), pp. 154:1–154:9, Dec. 2009.

Page 31: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Level 1: Distribute into Keys

Bucket sizes

Data distributed into buckets

Local offsets

8 5 6 8

1 3 7 0 5 2 4 6 8 13 190

Bucket ids

Global offsets

Dan A. Alcantara, Andrei Sharf, Fatemeh Abbasinejad, Shubhabrata Sengupta, Michael Mitzenmacher, John D. Owens, and Nina Amenta. Real-Time Parallel Hashing on the GPU. Siggraph Asia 2009 (ACM TOG 28(5)), pp. 154:1–154:9, Dec. 2009.

Page 32: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Level 1: Distribute into Keys

Bucket sizes

Data distributed into buckets

Local offsets

8 5 6 8

1 3 7 0 5 2 4 6 8 13 190

Bucket ids

Global offsets

Dan A. Alcantara, Andrei Sharf, Fatemeh Abbasinejad, Shubhabrata Sengupta, Michael Mitzenmacher, John D. Owens, and Nina Amenta. Real-Time Parallel Hashing on the GPU. Siggraph Asia 2009 (ACM TOG 28(5)), pp. 154:1–154:9, Dec. 2009.

Page 33: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Parallel Hashing: Level 1

• FKS (Level 1) is good for a coarse categorization

• Possible performance issue: atomics

• Bad for a fine categorization

• Space requirement for n elements to (probabilistically) guarantee no collisions is O(n2)

Page 34: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Hashing in Parallel

0123

0123

Page 35: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Hashing in Parallel

0123

ha

0123

Page 36: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Hashing in Parallel

0123

ha

1

3

2

0

0123

Page 37: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Hashing in Parallel

0123

ha

1

3

2

0

0123

Page 38: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Hashing in Parallel

0123

ha

1

3

2

0

hb

1

3

2

1

0123

Page 39: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Hashing in Parallel

0123

ha

1

3

2

0

hb

1

3

2

1

0123

Page 40: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Hashing in Parallel

0123

ha

1

3

2

0

hb

1

3

2

1

0123

Page 41: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Cuckoo Hashing

01

01

1

0

1

0

h1

1

1

0

0

h2 T1 T2

• Lookup procedure: in parallel, for each element:

• Calculate h1 & look in T1;

• Calculate h2 & look in T2; still O(1) lookup

Page 42: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Cuckoo Hashing

01

01

1

0

1

0

h1

1

1

0

0

h2 T1 T2

• Lookup procedure: in parallel, for each element:

• Calculate h1 & look in T1;

• Calculate h2 & look in T2; still O(1) lookup

Page 43: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Cuckoo Hashing

01

01

1

0

1

0

h1

1

1

0

0

h2 T1 T2

• Lookup procedure: in parallel, for each element:

• Calculate h1 & look in T1;

• Calculate h2 & look in T2; still O(1) lookup

Page 44: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Cuckoo Hashing

01

01

1

0

1

0

h1

1

1

0

0

h2 T1 T2

• Lookup procedure: in parallel, for each element:

• Calculate h1 & look in T1;

• Calculate h2 & look in T2; still O(1) lookup

Page 45: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Cuckoo Hashing

01

01

1

0

1

0

h1

1

1

0

0

h2 T1 T2

• Lookup procedure: in parallel, for each element:

• Calculate h1 & look in T1;

• Calculate h2 & look in T2; still O(1) lookup

Page 46: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Cuckoo Construction

• Level 1 created buckets of no more than 512 items

• Average: 409; probability of overflow: < 10-6

• Level 2: Assign each bucket to a thread block, construct cuckoo hash per bucket entirely within shared memory

• Semantic: Multiple writes to same location must have one and only one winner

• Our implementation uses 3 tables of 192 elements each (load factor: 71%)

• What if it fails? New hash functions & start over.

Page 47: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Timings on random voxel

Page 48: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Why did we pick this

• Alternative: single stage, cuckoo hash only

• Problem 1: Uncoalesced reads/writes to global memory (and may require atomics)

• Problem 2: Penalty for failure is large

Page 49: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

But it turns out …

• … this simpler approach is faster.

• Single hash table in global memory (1.4x data size),4 different hash functions into it

• For each thread:

• Atomically exchange my item with item in hash table

• If my new item is empty, I’m done

• If my new item used hash function n, use hash functionn+1 to reinsert

Thanks to expert reviewer (and current collaborator) Vasily

Volkov!

Page 50: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Hashing: Big Ideas

• Classic serial hashing techniques are a poor fit for a GPU.

• Serialization, load balance

• Solving this problem required a different algorithm

• Both hashing algorithms were new to the parallel literature

• Hybrid algorithm was entirely new

Page 51: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Task Queues

Page 52: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Representing Smooth • Polygonal Meshes• Lack view adaptivity

• Inefficient storage/transfer

• Parametric Surfaces• Collections of smooth patches

• Hard to animate

• Subdivision Surfaces• Widely popular, easy for

modeling

• Flexible animation

[Boubekeur and Schlick 2007]

Page 53: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Parallel Traversal

Anjul Patney and John D. Owens. Real-Time Reyes-Style Adaptive Surface Subdivision. ACM Transactions on Graphics, 27(5):143:1–143:8, December 2008.

Page 54: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Parallel Traversal

Anjul Patney and John D. Owens. Real-Time Reyes-Style Adaptive Surface Subdivision. ACM Transactions on Graphics, 27(5):143:1–143:8, December 2008.

Page 55: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Parallel Traversal

Anjul Patney and John D. Owens. Real-Time Reyes-Style Adaptive Surface Subdivision. ACM Transactions on Graphics, 27(5):143:1–143:8, December 2008.

Page 56: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Cracks & T-Junctions Uniform Refinement Adaptive

Refinement

Previous work was post-process after all steps; ours was inherent within subdivision process

Page 57: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Cracks & T-Junctions Uniform Refinement Adaptive

Refinement

Previous work was post-process after all steps; ours was inherent within subdivision process

Page 58: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Cracks & T-Junctions Uniform Refinement Adaptive

Refinement

Previous work was post-process after all steps; ours was inherent within subdivision process

Page 59: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Cracks & T-Junctions Uniform Refinement Adaptive

Refinement

Previous work was post-process after all steps; ours was inherent within subdivision process

Crack

T-Junction

Page 60: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Recursive Subdivision is

2 2

22

2

22

1

11

1

1

1

1

1

2

1

1

1

1

11

11

22

2 2Patney, Ebeida, and Owens. “Parallel View-Dependent Tessellation of Catmull-Clark Subdivision Surfaces”. HPG ’09.

Page 61: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
Page 62: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
Page 63: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Static Task List

Input Input Input Input Input

SM SM SM SM SM

Output

Atomic Ptr

restartkernel

Daniel Cederman and Philippas Tsigas, On

Dynamic Load Balancing on Graphics Processors.

Graphics Hardware 2008, June 2008.

Page 64: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Private Work Queue

• Allocate private work queue of tasks per core

• Each core can add to or remove work from its local queue

• Cores mark self as idle if {queue exhausts storage, queue is empty}

• Cores periodically check global idle counter

• If global idle counter reaches threshold, rebalance work gProximity: Fast Hierarchy

Operations on GPU Architectures, Lauterbach, Mo, and Manocha, EG ’10

Page 65: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Work Stealing & Donating

I/ODeque

I/ODeque

I/ODeque

I/ODeque

I/O Deque

SM SM SM SM SM

Lock Lock ...

• Cederman and Tsigas: Stealing == best performance and scalability (follows Arora CPU-based work)

• We showed how to do this with multiple kernels in an uberkernel and persistent-thread programming style

• We added donating to minimize memory usage

Page 66: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Ingredients for Our Scheme

44

Page 67: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Ingredients for Our Scheme

44

Implementation questions that we need to address:

Page 68: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Ingredients for Our Scheme

44

What is the proper granularity for tasks?

How many threads to launch?

How to avoid global synchronizations?How to distribute

tasks evenly?

Implementation questions that we need to address:

Page 69: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Ingredients for Our Scheme

44

What is the proper granularity for tasks?

How many threads to launch?

How to avoid global synchronizations?How to distribute

tasks evenly?

Warp Size Work Granularity

Uberkernels

Persistent Threads

Task Donation

Implementation questions that we need to address:

Page 70: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Warp Size Work Granularity

45

Page 71: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

• Problem: We want to emulate task level parallelism on the GPU without loss in efficiency.

Warp Size Work Granularity

45

Page 72: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

• Problem: We want to emulate task level parallelism on the GPU without loss in efficiency.

• Solution: we choose block sizes of 32 threads / block.• Removes messy synchronization barriers.• Can view each block as a MIMD thread. We call these blocks processors

Warp Size Work Granularity

45

Page 73: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

• Problem: We want to emulate task level parallelism on the GPU without loss in efficiency.

• Solution: we choose block sizes of 32 threads / block.• Removes messy synchronization barriers.• Can view each block as a MIMD thread. We call these blocks processors

Warp Size Work Granularity

45

Physical Processor

Physical Processor

Physical Processor

Page 74: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

• Problem: We want to emulate task level parallelism on the GPU without loss in efficiency.

• Solution: we choose block sizes of 32 threads / block.• Removes messy synchronization barriers.• Can view each block as a MIMD thread. We call these blocks processors

Warp Size Work Granularity

45

Physical Processor

P P P

Physical Processor

Physical Processor

P P P P P P

Page 75: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Uberkernel Processor Utilization

46

Page 76: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

• Problem: Want to eliminate global kernel barriers for better processor utilization

Uberkernel Processor Utilization

46

Page 77: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

• Problem: Want to eliminate global kernel barriers for better processor utilization

Uberkernel Processor Utilization

46

Pipeline Stage 1

Pipeline Stage 2

Data Flow

Page 78: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

• Problem: Want to eliminate global kernel barriers for better processor utilization

• Uberkernels pack multiple execution routes into one kernel.

Uberkernel Processor Utilization

46

Pipeline Stage 1

Pipeline Stage 2

Data Flow

Page 79: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

• Problem: Want to eliminate global kernel barriers for better processor utilization

• Uberkernels pack multiple execution routes into one kernel.

Uberkernel Processor Utilization

46

Pipeline Stage 1

Pipeline Stage 2

Data Flow

Uberkernel

Stage 1

Stage 2

Data Flow

Page 80: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Persistent Thread Scheduler Emulation

47

Page 81: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Persistent Thread Scheduler Emulation

47

• Problem: If input is irregular? How many threads do we launch?

Page 82: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Persistent Thread Scheduler Emulation

47

• Problem: If input is irregular? How many threads do we launch?

• Launch enough to fill the GPU, and keep them alive so they keep fetching work.

Page 83: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Life of a Thread:

Persistent Thread Scheduler Emulation

47

Spawn Fetch Data

Process Data

Write Output Death

• Problem: If input is irregular? How many threads do we launch?

• Launch enough to fill the GPU, and keep them alive so they keep fetching work.

Page 84: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Persistent Thread Scheduler Emulation

47

Spawn Fetch Data

Process Data

Write Output Death

Life of a Persistent Thread:

• Problem: If input is irregular? How many threads do we launch?

• Launch enough to fill the GPU, and keep them alive so they keep fetching work.

Page 85: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Persistent Thread Scheduler Emulation

47

Spawn Fetch Data

Process Data

Write Output Death

Life of a Persistent Thread:

How do we know when to stop?

• Problem: If input is irregular? How many threads do we launch?

• Launch enough to fill the GPU, and keep them alive so they keep fetching work.

Page 86: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Persistent Thread Scheduler Emulation

47

Spawn Fetch Data

Process Data

Write Output Death

Life of a Persistent Thread:

How do we know when to stop?When there is no more work left

• Problem: If input is irregular? How many threads do we launch?

• Launch enough to fill the GPU, and keep them alive so they keep fetching work.

Page 87: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

• When a bin is full, processor can give work to someone else.

Task Donation

48

P1 P2 P3 P4

Page 88: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

• When a bin is full, processor can give work to someone else.

Task Donation

48

This processor ‘s bin is full and donates

its work to someone else’s bin

P1 P2 P3 P4

Page 89: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

• When a bin is full, processor can give work to someone else.

Task Donation

48

This processor ‘s bin is full and donates

its work to someone else’s bin

P1 P2 P3 P4

Smaller memory usage

Page 90: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

• When a bin is full, processor can give work to someone else.

Task Donation

48

This processor ‘s bin is full and donates

its work to someone else’s bin

P1 P2 P3 P4

Smaller memory usage

More complicated

Page 91: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Queuing SchemesS. Tzeng, A. Patney and J.D. Owens / Task Management for Irregular-Parallel Workloads on the GPU

that uses blended sample values to calculate final pixel col-

ors and writes them to the output pixel buffer.

5. Results

Experimental Setup We measure the performance of our

scheduling routines by looking at several factors: utilization

in terms of processor idle time, load balance in terms of work

units processed per processor, and memory usage in terms

of memory used versus donation. We will discuss why these

factors are relevant to our setup.

Processor idle time measures two things: how long a pro-

cessor waits for a lock, and the number of iterations that the

processor is idle. The latter is a measure of processor uti-

lization, for the less a processor is idle, the more work it

is processing. We measure performance (throughput) by ex-

amining the total amount of work units processed for one

run over the time it took to process it. The resulting num-

ber is a good indicator of the overall efficiency of a task-

management scheme. To effectively measure the memory

usage benefit of task donation, we examine its performance

over varying bin sizes. The smaller the bin size, the more

likely a processor will overfill its bin and have to donate;

this implies a slower overall runtime.

Synthetic Work Generator Figure 6 shows the analysis of

our load balancing in terms of work processed with idle time

drawn over. From the results, we can see that when there is a

single block queue, all processors process the same amount

of work. Block queuing has perfect load balancing, but at

the cost of an alarming amount of wasted time, due to high

contention for the lock to write data back to the head. Dis-

tributed queuing also has high idle time for a different rea-

son: it has no contention but high wasted time due load im-

balance. Most processors wait for only a few processors to

finish. Note only a single processor has no idle time. Task

stealing and donation show similar performance character-

istics, but with more than two orders of magnitude less idle

time. Stealing and donation are excellent methods to main-

tain high processor utilization.

Figure 7 shows the analysis of performance as we use p

to vary the workload characteristics (see section 4.1) from

highly regular to highly irregular. We observe that as the

workload becomes more and more irregular, performance

of a single block queue is significantly slower than that of

task stealing or task donation. As workload is more irregular

more work units must be written back to the queue and this

creates high lock contention.

As mentioned before, task donation permits much lower

bin sizes, which is extremely important for implementations

where queues need to remain on on-chip memory, perhaps in

a cache. But what is the cost of ensuring on-chip access? Fig-

ure 8 shows the impact of task donation on the performance

and memory behavior for an irregular rendering workload.

! "! #! $! %! &!! &"!'()*+,,)(-./

!

0!

&!!

&0!

"!!

1)(2-'()*+,,+3

4)5'6(7,)8-)9-:)(2-'()*+,,+3-683-73;+-<75+-'+(-'()*+,,)(

!

0!!!

&!!!!

&0!!!

"!!!!

"0!!!

.3;+-7<+(6<7)8,

1)(2.3;+

(a) Block Queuing

! "! #! $! %! &!! &"!'()*+,,)(-./

!

0!!

&!!!

&0!!

"!!!

"0!!

1)(2-'()*+,,+3

4)5'6(7,)8-)9-:)(2-'()*+,,+3-683-73;+-<75+-'+(-'()*+,,)(

!

0!!!

&!!!!

&0!!!

"!!!!

"0!!!

.3;+-7<+(6<7)8,

1)(2.3;+

(b) Distributed Queuing

! "! #! $! %! &!! &"!'()*+,,)(-./

!

&!!

"!!

0!!

#!!

1!!

$!!

2)(3-'()*+,,+4

5)6'7(8,)9-):-;)(3-'()*+,,+4-794-84<+-=86+-'+(-'()*+,,)(

!

"!

#!

$!

%!

&!!

&"!

&#!

&$!

.4<+-8=+(7=8)9,

2)(3.4<+

(c) Task Stealing

! "! #! $! %! &!! &"!'()*+,,)(-./

!

&!!

"!!

0!!

#!!

1!!

2)(3-'()*+,,+4

5)6'7(8,)9-):-;)(3-'()*+,,+4-794-84<+-=86+-'+(-'()*+,,)(

!

"!

#!

$!

%!

&!!

&"!

&#!

.4<+-8=+(7=8)9,

2)(3.4<+

(d) Task Donating

Figure 6: Synthetic work processed as a bar graph in grey

with processor idle time as a line graph in black. Note that

the two graphs on top have a different idle time scale (a

factor of 150 times greater) than the graphs on the bottom.

This is because there is much more idle time in the block

queue due to lock contention and there is much more idle

time in distributed queuing due to all processors waiting for

one processor to finish. With task stealing and task donation,

there is much less idle time due to processors being to steal

work to keep themselves busy.

We simulate a scenario with increasingly constrained upper-

limit of bin sizes, and redistribute overflowing bins using ei-

ther the proposed Task Donation or an approach based on

dynamic queue resizing, which we achieve using an addi-

tional overflow buffer. Rather than resizing the queues on the

GPU, which at the time of this writing is not supported, the

overflow buffer servers as a block queue where processors

with their bin full can offload work into and other processors

can steal from there.

As bin sizes are reduced, we see that dynamic queue re-

allocation suffers from significant performance overheads,

due to both imbalance in bin sizes and extra synchronization

required in the overflow buffer. Task Donation does not re-

quire such extra synchronization and still manages to main-

tain balance in bin sizes. As a result, subdivision time using

task donation is only slightly affected.

Reyes Our GPU implementation of work management is

able to achieve real-time performance for most scenes tested.

Table 1 shows statistics for some of the scenes tested. Fig-

ures 1, 4 and 5 show the obtained images. We are able to

achieve interactive framerates for scenes with up to 16–30

jittered samples per pixel. Detailed performance of the sub-

division step, which has been most extensively studied, com-

c� The Eurographics Association 2010.

Gray bars: Load balance

Black line: Idle time

Stanley Tzeng, Anjul Patney, and John D. Owens. Task Management for Irregular-Parallel Workloads on the GPU. In Proceedings of High Performance Graphics 2010, pages 29–37, June 2010.

Page 92: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Queuing SchemesS. Tzeng, A. Patney and J.D. Owens / Task Management for Irregular-Parallel Workloads on the GPU

that uses blended sample values to calculate final pixel col-

ors and writes them to the output pixel buffer.

5. Results

Experimental Setup We measure the performance of our

scheduling routines by looking at several factors: utilization

in terms of processor idle time, load balance in terms of work

units processed per processor, and memory usage in terms

of memory used versus donation. We will discuss why these

factors are relevant to our setup.

Processor idle time measures two things: how long a pro-

cessor waits for a lock, and the number of iterations that the

processor is idle. The latter is a measure of processor uti-

lization, for the less a processor is idle, the more work it

is processing. We measure performance (throughput) by ex-

amining the total amount of work units processed for one

run over the time it took to process it. The resulting num-

ber is a good indicator of the overall efficiency of a task-

management scheme. To effectively measure the memory

usage benefit of task donation, we examine its performance

over varying bin sizes. The smaller the bin size, the more

likely a processor will overfill its bin and have to donate;

this implies a slower overall runtime.

Synthetic Work Generator Figure 6 shows the analysis of

our load balancing in terms of work processed with idle time

drawn over. From the results, we can see that when there is a

single block queue, all processors process the same amount

of work. Block queuing has perfect load balancing, but at

the cost of an alarming amount of wasted time, due to high

contention for the lock to write data back to the head. Dis-

tributed queuing also has high idle time for a different rea-

son: it has no contention but high wasted time due load im-

balance. Most processors wait for only a few processors to

finish. Note only a single processor has no idle time. Task

stealing and donation show similar performance character-

istics, but with more than two orders of magnitude less idle

time. Stealing and donation are excellent methods to main-

tain high processor utilization.

Figure 7 shows the analysis of performance as we use p

to vary the workload characteristics (see section 4.1) from

highly regular to highly irregular. We observe that as the

workload becomes more and more irregular, performance

of a single block queue is significantly slower than that of

task stealing or task donation. As workload is more irregular

more work units must be written back to the queue and this

creates high lock contention.

As mentioned before, task donation permits much lower

bin sizes, which is extremely important for implementations

where queues need to remain on on-chip memory, perhaps in

a cache. But what is the cost of ensuring on-chip access? Fig-

ure 8 shows the impact of task donation on the performance

and memory behavior for an irregular rendering workload.

! "! #! $! %! &!! &"!'()*+,,)(-./

!

0!

&!!

&0!

"!!

1)(2-'()*+,,+3

4)5'6(7,)8-)9-:)(2-'()*+,,+3-683-73;+-<75+-'+(-'()*+,,)(

!

0!!!

&!!!!

&0!!!

"!!!!

"0!!!

.3;+-7<+(6<7)8,

1)(2.3;+

(a) Block Queuing

! "! #! $! %! &!! &"!'()*+,,)(-./

!

0!!

&!!!

&0!!

"!!!

"0!!

1)(2-'()*+,,+3

4)5'6(7,)8-)9-:)(2-'()*+,,+3-683-73;+-<75+-'+(-'()*+,,)(

!

0!!!

&!!!!

&0!!!

"!!!!

"0!!!

.3;+-7<+(6<7)8,

1)(2.3;+

(b) Distributed Queuing

! "! #! $! %! &!! &"!'()*+,,)(-./

!

&!!

"!!

0!!

#!!

1!!

$!!

2)(3-'()*+,,+4

5)6'7(8,)9-):-;)(3-'()*+,,+4-794-84<+-=86+-'+(-'()*+,,)(

!

"!

#!

$!

%!

&!!

&"!

&#!

&$!

.4<+-8=+(7=8)9,

2)(3.4<+

(c) Task Stealing

! "! #! $! %! &!! &"!'()*+,,)(-./

!

&!!

"!!

0!!

#!!

1!!

2)(3-'()*+,,+4

5)6'7(8,)9-):-;)(3-'()*+,,+4-794-84<+-=86+-'+(-'()*+,,)(

!

"!

#!

$!

%!

&!!

&"!

&#!

.4<+-8=+(7=8)9,

2)(3.4<+

(d) Task Donating

Figure 6: Synthetic work processed as a bar graph in grey

with processor idle time as a line graph in black. Note that

the two graphs on top have a different idle time scale (a

factor of 150 times greater) than the graphs on the bottom.

This is because there is much more idle time in the block

queue due to lock contention and there is much more idle

time in distributed queuing due to all processors waiting for

one processor to finish. With task stealing and task donation,

there is much less idle time due to processors being to steal

work to keep themselves busy.

We simulate a scenario with increasingly constrained upper-

limit of bin sizes, and redistribute overflowing bins using ei-

ther the proposed Task Donation or an approach based on

dynamic queue resizing, which we achieve using an addi-

tional overflow buffer. Rather than resizing the queues on the

GPU, which at the time of this writing is not supported, the

overflow buffer servers as a block queue where processors

with their bin full can offload work into and other processors

can steal from there.

As bin sizes are reduced, we see that dynamic queue re-

allocation suffers from significant performance overheads,

due to both imbalance in bin sizes and extra synchronization

required in the overflow buffer. Task Donation does not re-

quire such extra synchronization and still manages to main-

tain balance in bin sizes. As a result, subdivision time using

task donation is only slightly affected.

Reyes Our GPU implementation of work management is

able to achieve real-time performance for most scenes tested.

Table 1 shows statistics for some of the scenes tested. Fig-

ures 1, 4 and 5 show the obtained images. We are able to

achieve interactive framerates for scenes with up to 16–30

jittered samples per pixel. Detailed performance of the sub-

division step, which has been most extensively studied, com-

c� The Eurographics Association 2010.

Gray bars: Load balance

Black line: Idle time

160 140

25k 25k

Stanley Tzeng, Anjul Patney, and John D. Owens. Task Management for Irregular-Parallel Workloads on the GPU. In Proceedings of High Performance Graphics 2010, pages 29–37, June 2010.

Page 93: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Smooth Surfaces, High Detail

16 samples per pixel>15 frames per second on GeForce GTX280

Page 94: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

MapReduce

Page 95: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

MapReduce

http://m.blog.hu/dw/dwbi/image/2009/Q4/mapreduce_small.png

Page 96: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Why MapReduce?

• Simple programming model

• Parallel programming model

• Scalable

• Previous GPU work: neither multi-GPU nor out-of-core

Page 97: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Block Diagram!"#$%#$&'($

!"#$%#$)'($

*+,$-$'+./+0$

!12"31$-$'+.//%#$

45#$

6%.7$

&'($8$+#2$)'($8$

&'($9$+#2$)'($9$

:$$:$$:$

63;12"01.$

*+,$-$'+./+0$

!12"31$-$'+.//%#$

45#$

6%.7$

63;12"01.$

63;12"01.$ 63;12"01.$

!12"31$ !12"31$

Jeff A. Stuart and John D. Owens. Multi-GPU MapReduce on GPU Clusters. In Proceedings of the 25th IEEE International Parallel and Distributed Processing Symposium, May 2011.

Page 98: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

!"#$%#$&'($

!"#$%#$)'($

*+,$-$'+./+0$

!12"31$-$'+.//%#$

45#$

6%.7$

&'($8$+#2$)'($8$

&'($9$+#2$)'($9$

:$$:$$:$

63;12"01.$

*+,$-$'+./+0$

!12"31$-$'+.//%#$

45#$

6%.7$

63;12"01.$

63;12"01.$ 63;12"01.$

!12"31$ !12"31$

• Process data in chunks

• More efficient transmission & computation

• Also allows out of core

• Overlap computation and communication

• Accumulate

• Partial Reduce

Keys to Performance

Page 99: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

!"#$%#$&'($

!"#$%#$)'($

*+,$-$'+./+0$

!12"31$-$'+.//%#$

45#$

6%.7$

&'($8$+#2$)'($8$

&'($9$+#2$)'($9$

:$$:$$:$

63;12"01.$

*+,$-$'+./+0$

!12"31$-$'+.//%#$

45#$

6%.7$

63;12"01.$

63;12"01.$ 63;12"01.$

!12"31$ !12"31$

• User-specified function combines pairs with the same key

• Each map chunk spawns a partial reduction on that chunk

• Only good when cost of reduction is less than cost of transmission

• Good for ~larger number of keys

• Reduces GPU->CPU bw

Partial Reduce

Page 100: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

!"#$%#$&'($

!"#$%#$)'($

*+,$-$'+./+0$

!12"31$-$'+.//%#$

45#$

6%.7$

&'($8$+#2$)'($8$

&'($9$+#2$)'($9$

:$$:$$:$

63;12"01.$

*+,$-$'+./+0$

!12"31$-$'+.//%#$

45#$

6%.7$

63;12"01.$

63;12"01.$ 63;12"01.$

!12"31$ !12"31$

• Mapper has explicit knowledge about nature of key-value pairs

• Each map chunk accumulates its key-value outputs with GPU-resident key-value accumulated outputs

• Good for small number of keys

• Also reduces GPU->CPU bw

Accumulate

Page 101: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Benchmarks—Which

• Matrix Multiplication (MM)

• Word Occurrence (WO)

• Sparse-Integer Occurrence (SIO)

• Linear Regression (LR)

• K-Means Clustering (KMC)

• (Volume Renderer—presented@ MapReduce ’10)

Jeff A. Stuart, Cheng-Kai Chen, Kwan-Liu Ma, and John D. Owens. Multi-GPU Volume Rendering using MapReduce. In HPDC '10: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing / MAPREDUCE '10: The First International Workshop on MapReduce and its Applications, pages 841–848, June 2010

Page 102: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Benchmarks—Why

• Needed to stress aspects of GPMR

• Unbalanced work (WO)

• Multiple emits/Non-uniform number of emits (LR, KMC, WO)

• Sparsity of keys (SIO)

• Accumulation (WO, LR, KMC)

• Many key-value pairs (SIO)

• Compute Bound Scalability (MM)

Page 103: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Benchmarks—Results

Page 104: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Benchmarks—Results

9

MM SIO WO KMC LR

Input Element Size — 4 bytes 1 byte 16 bytes 8 bytes

# Elems in first set (×106) 1024

2, 2048

2, 4096

2, 16384

21, 8, 32, 128 1, 16, 64, 512 1, 8, 32, 512 1, 16, 64, 512

# Elems in second set (×106/GPU) — 1, 2, 4, 1, 2, 4, 8, 16, 1, 2, 4, 1, 2, 4, 8, 16

8, 16, 32 32, 64, 128, 256 8, 16, 32 32, 64

TABLE 1: Dataset Sizes for all four benchmarks. We tested Phoenix against the first input set for SIO, KMC, LR, and the second set for

WO. We test GPMR against all available input sets.

MM KMC LR SIO WO

1-GPU Speedup 162.712 2.991 1.296 1.450 11.080

4-GPU Speedup 559.209 11.726 4.085 2.322 18.441

TABLE 2: Speedup for GPMR over Phoenix on our large (second-

biggest) input data from our first set. The exception is MM, for which

we use our small input set (Phoenix required almost twenty seconds

to multiply two 1024×1024 matrices).

MM KMC WO

1-GPU Speedup 2.695 37.344 3.098

4-GPU Speedup 10.760 129.425 11.709

TABLE 3: Speedup for GPMR over Mars on 4096× 4096 Matrix

Multiplication, an 8M-point K-Means Clustering, and a 512 MB

Word Occurrence. These sizes represent the largest problems that

can meet the in-core memory requirements of Mars.

summarizes speedup results over Phoenix, while Table 3 gives

speedup results of GPMR over Mars. Note that GPMR, even

in the one-GPU configuration, is faster on all benchmarks that

either Phoenix or Mars, and GPMR shows good scalability to

four GPUs as well.

Source code size is another important metric. One signif-

icant benefit of MapReduce in general is its high level of

abstraction: as a result, code sizes are small and development

time is reduced, since the developer does not have to focus

on the low-level details of communication and scheduling but

instead on the algorithm. Table 4 shows the different number

of lines required for each of three benchmarks implemented

in Phoenix, Mars, and GPMR. We would also like to show

developer time required to implement each benchmark for

each platform, but neither Mars nor Phoenix published such

information (and we wanted to use the applications provided

so as not to introduce bias in Mars’s or Phoenix’s runtimes). As

a frame of reference, the lead author of this paper implemented

and tested MM in GPMR in three hours, SIO in half an hour,

KMC in two hours, LR in two hours, and WO in four hours.

KMC, LR, and WO were then later modified in about half an

hour each to add Accumulation.

7 CONCLUSION

GPMR offers many benefits to MapReduce programmers.

The most important is scalability. While it is unrealistic to

expect perfect scalability from all but the most compute-bound

tasks, GPMR’s minimal overhead and transfer costs position

MM KMC WO

Phoenix 317 345 231

Mars 235 152 140

GPMR 214 129 397

TABLE 4: Lines of source code for three common benchmarks

written in Phoenix, Mars, and GPMR. We exclude setup code from

all counts as it was roughly the same for all benchmarks and had

little to do with the actual MapReduce code. For GPMR we included

boilerplate code in the form of class header files and C++ wrapper

functions that invoke CUDA kernels. If we excluded these files,

GPMR’s totals would be even smaller. Also, WO is so large because

of the hashing required in GPMR’s implementation.

Fig. 2: GPMR runtime breakdowns on our the largest datasets.

This figure shows how each application exhibits different runtime

characteristics, and also how exhibited characteristics change as we

increase the number of GPUs.

it well in comparison to other MapReduce implementations.

GPMR also offers flexibility to developers in several areas,

particularly when compared with Mars. GPMR allows flexible

mappings between threads and keys and customization of the

MapReduce pipeline with additional communication-reducing

stages while still providing sensible default implementations.

Our results demonstrate that even difficult applications that

have not traditionally been addressed by GPUs can still show

vs. CPU

vs. GPU

Page 105: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Benchmarks—Results

Good

Page 106: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Benchmarks—Results

Good

Page 107: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Benchmarks - Results

Good

Page 108: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

MapReduce Summary

• Time is right to explore diverse programming models on GPUs

• Few threads -> many, heavy threads -> light

• Lack of {bandwidth, GPU primitives for communication, access to NIC} are challenges

• What happens if GPUs get a lightweight serial processor?

• Future hybrid hardware is exciting!

Page 109: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Broader Thoughts

Page 110: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Nested Data Parallelism

Pixel i+1Pixel i

Segmented Scan

Segmented Reduction

Subpixel 0 Subpixel 1 Subpixel 0 Subpixel 1

00 02

10 11 12 13 14 15

20 22 23

30 31 32

41 42 43 44 45

55

=

How do we generalize this? How do we

abstract it? What are design alternatives?

Page 111: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Hashing Tradeoffs between construction time, access

time, and space?Incremental data structures?

Page 112: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Work Stealing & Donating

I/ODeque

I/ODeque

I/ODeque

I/ODeque

I/O Deque

SM SM SM SM SM

Lock Lock ...

• Cederman and Tsigas: Stealing == best performance and scalability

• We showed how to do this with an uberkernel and persistent thread programming style

• We added donating to minimize memory usage

Generalize and abstract task queues? [structural pattern?]

Persistent threads / other sticky CUDA constructs?

Whole pipelines?Priorities / real-time

Page 113: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Horizontal vs. Vertical

Applications

Higher-Level Algorithm/Data

Structure Libraries

Programming System Primitives

Hardware

Programming System Primitives

Hardware

App 1

App 2

App 3

CPU GPU (Usual)

GPU (Our Goal) [CUDPP

library]

Programming System Primitives

Hardware

App 1

App 2

App 3

Primitive Libraries (Domain-specific, Algorithm, Data Structure, etc.)

Little code

reuse!

Page 114: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Energy Cost per Energy Cost of Operations

Operation Energy (pJ) 64b Floating FMA (2 ops) 100 64b Integer Add 1 Write 64b DFF 0.5 Read 64b Register (64 x 32 bank) 3.5 Read 64b RAM (64 x 2K) 25 Read tags (24 x 2K) 8 Move 64b 1mm 6 Move 64b 20mm 120 Move 64b off chip 256 Read 64b from DRAM 2000

From Bill Dally talk, September 2009

Page 115: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

13 Dwarfs• 1. Dense Linear

Algebra

• 2. Sparse Linear Algebra

• 3. Spectral Methods

• 4. N-Body Methods

• 5. Structured Grids

• 6. Unstructured Grids

• 7. MapReduce

• 8. Combinational Logic

• 9. Graph Traversal

• 10. Dynamic Programming

• 11. Backtrack and Branch-and-Bound

• 12. Graphical Models

• 13. Finite State Machines

Page 116: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

13 Dwarfs• 1. Dense Linear

Algebra

• 2. Sparse Linear Algebra

• 3. Spectral Methods

• 4. N-Body Methods

• 5. Structured Grids

• 6. Unstructured Grids

• 7. MapReduce

• 8. Combinational Logic

• 9. Graph Traversal

• 10. Dynamic Programming

• 11. Backtrack and Branch-and-Bound

• 12. Graphical Models

• 13. Finite State Machines

Page 117: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

The Programmable

Input Assembly

Tess. Shading

Vertex Shading

Geom. Shading Raster Frag.

Shading Compose

Split Dice Shading Sampling Composition

Ray Generation ShadingRay Traversal Ray-Primitive Intersection

Bricks & mortar: how do we allow programmers to

build stages without worrying about assembling

them together?

Page 118: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

The Programmable

Input Assembly

Tess. Shading

Vertex Shading

Geom. Shading Raster Frag.

Shading Compose

Split Dice Shading Sampling Composition

Ray Generation ShadingRay Traversal Ray-Primitive Intersection

Heterogeneit

[Irregular] Parallelism

What does it look like when you build applications with

GPUs as the primary

Page 119: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Owens Group Interests

• Irregular parallelism

• Fundamental data structures and algorithms

• Computational patterns for parallelism

• Optimization: self-tuning and tools

• Abstract models of GPU computation

• Multi-GPU computing: programming models and abstractions

Page 120: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Thanks to ...

• Nathan Bell, Michael Garland, David Luebke, Sumit Gupta, Dan Alcantara, Anjul Patney, and Stanley Tzeng for helpful comments and slide material.

• Funding agencies: Department of Energy (SciDAC Institute for Ultrascale Visualization, Early Career Principal Investigator Award), NSF, Intel Science and Technology Center for Visual Computing, LANL, BMW, NVIDIA, HP, UC MICRO, Microsoft, ChevronTexaco, Rambus

Page 121: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

“If you were plowing a field, which would you rather use?

Two strong oxen or 1024 chickens?”

—Seymour Cray

Page 122: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
Page 123: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Parallel Prefix Sum (Scan)• Given an array A = [a0, a1, …, an-1]

and a binary associative operator ⊕ with identity I,

• scan(A) = [I, a0, (a0 ⊕ a1), …, (a0 ⊕ a1 ⊕ … ⊕ an-2)]

• Example: if ⊕ is addition, then scan on the set

• [3 1 7 0 4 1 6 3]

• returns the set

• [0 3 4 11 11 15 16 22]

Page 124: [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Segmented Scan

• Example: if ⊕ is addition, then scan on the set

• [3 1 7 | 0 4 1 | 6 3]

• returns the set

• [0 3 4 | 0 0 4 | 0 6]

• Same computational complexity as scan, but additionally have to keep track of segments (we use head flags to mark which elements are segment heads)

• Useful for nested data parallelism (quicksort)