106
Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Embed Size (px)

Citation preview

Page 1: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Parallel Algorithms

Patrick CozziUniversity of PennsylvaniaCIS 565 - Fall 2013

Page 2: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Announcements

Project 1Due Thursday 09/19

RemindersCommit oftenMake a great README.md

Philly Transit Hackathon this weekend http://www.meetup.com/Code-for-America-Philly/events/136363492/

Page 3: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Review

SP, SM Kernel, thread, warp, block, grid

3

Page 4: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Agenda

Parallel AlgorithmsParallel ReductionScanStream CompressionSummed Area TablesRadix Sort

4

Page 5: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Parallel Reduction

Given an array of numbers, design a parallel algorithm to find the sum.

Consider: Arithmetic intensity: compute to memory access ratio

5

Page 6: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Parallel Reduction

Given an array of numbers, design a parallel algorithm to find: The sum The maximum value The product of values The average value

How different are these algorithms?

6

Page 7: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Parallel Reduction

Reduction: An operation that computes a single result from a set of data

Examples:Minimum/maximum valueAverage, sum, product, etc.

Parallel Reduction: Do it in parallel. Obviously

7

Page 8: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Parallel Reduction

0 1 5 2 3 4 6 7

Example. Find the sum:

8

Page 9: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Parallel Reduction

0 1 5 2 3 4 6 7

1 5 9 13

9

Page 10: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Parallel Reduction

0 1 5 2 3 4 6 7

1 5 9 13

6 22

10

Page 11: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Parallel Reduction

0 1 5 2 3 4 6 7

1 5 9 13

6 22

28

11

Page 12: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Parallel Reduction

Similar to brackets for a basketball tournament log(n) passes for n elements

12

Page 13: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

All-Prefix-Sums

All-Prefix-Sums Input

Array of n elements: Binary associate operator: Identity: I

Outputs the array:

Images from http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html 13

Page 14: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

All-Prefix-Sums

Example If is addition, the array

[3 1 7 0 4 1 6 3] is transformed to

[0 3 4 11 11 15 16 22]

Seems sequential, but there is an efficient parallel solution

14

Page 15: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Scan

Scan: all-prefix-sums operation on an array of data

Exclusive Scan: Element j of the result does not include element j of the input:

In: [3 1 7 0 4 1 6 3] Out: [0 3 4 11 11 15 16 22]

Inclusive Scan (Prescan): All elements including j are summed

In: [3 1 7 0 4 1 6 3] Out: [3 4 11 11 15 16 22 25]

15

Page 16: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Scan

How do you generate an exclusive scan from an inclusive scan?

Input: [3 1 7 0 4 1 6 3] Inclusive: [3 4 11 11 15 16 22 25] Exclusive: [0 3 4 11 11 15 16 22]

// Shift right, insert identity

How do you go in the opposite direction?

16

Page 17: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Scan

Use cases Stream compaction Summed-area tables for variable width image processing Radix sort …

17

Page 18: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Scan

Used to convert certain sequential computation into equivalent parallel computation

Image from http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html 18

Page 19: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Scan

Design a parallel algorithm for exclusive scanIn: [3 1 7 0 4 1 6 3]Out: [0 3 4 11 11 15 16 22]

Consider: Total number of additions

19

Page 20: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Scan

Sequential Scan: single thread, trivial

n adds for an array of length n How many adds will our parallel version

have?

Image from http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html 20

Page 21: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Scan

Naive Parallel Scan

Image from http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/scan/doc/scan.pdf

Is this exclusive or inclusive? Each thread

Writes one sum Reads two values

for d = 1 to log2n

for all k in parallel if (k >= 2d-1) x[k] = x[k – 2d-1] + x[k];

21

Page 22: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Scan

Naive Parallel Scan: Input

0 1 5 2 3 4 6 7

22

Page 23: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Scan

Naive Parallel Scan: d = 1, 2d-1 = 1

0 1 5 2 3 4 6 7

0

for d = 1 to log2n

for all k in parallel if (k >= 2d-1) x[k] = x[k – 2d-1] + x[k];

23

Page 24: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Scan

Naive Parallel Scan: d = 1, 2d-1 = 1

0 1 5 2 3 4 6 7

0 1

for d = 1 to log2n

for all k in parallel if (k >= 2d-1) x[k] = x[k – 2d-1] + x[k];

24

Page 25: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Scan

Naive Parallel Scan: d = 1, 2d-1 = 1

0 1 5 2 3 4 6 7

0 1 3

for d = 1 to log2n

for all k in parallel if (k >= 2d-1) x[k] = x[k – 2d-1] + x[k];

25

Page 26: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Scan

Naive Parallel Scan: d = 1, 2d-1 = 1

0 1 5 2 3 4 6 7

0 1 3 5

for d = 1 to log2n

for all k in parallel if (k >= 2d-1) x[k] = x[k – 2d-1] + x[k];

26

Page 27: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Scan

Naive Parallel Scan: d = 1, 2d-1 = 1

0 1 5 2 3 4 6 7

0 1 3 5 7

for d = 1 to log2n

for all k in parallel if (k >= 2d-1) x[k] = x[k – 2d-1] + x[k];

27

Page 28: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Scan

Naive Parallel Scan: d = 1, 2d-1 = 1

0 1 5 2 3 4 6 7

0 1 9 3 5 7

for d = 1 to log2n

for all k in parallel if (k >= 2d-1) x[k] = x[k – 2d-1] + x[k];

28

Page 29: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Scan

Naive Parallel Scan: d = 1, 2d-1 = 1

0 1 5 2 3 4 6 7

0 1 9 3 5 7 11

for d = 1 to log2n

for all k in parallel if (k >= 2d-1) x[k] = x[k – 2d-1] + x[k];

29

Page 30: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Scan

Naive Parallel Scan: d = 1, 2d-1 = 1

0 1 5 2 3 4 6 7

0 1 9 3 5 7 11 13

for d = 1 to log2n

for all k in parallel if (k >= 2d-1) x[k] = x[k – 2d-1] + x[k];

30

Page 31: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Scan

Naive Parallel Scan: d = 1, 2d-1 = 1

0 1 5 2 3 4 6 7

Recall, it runs in parallel! for d = 1 to log2n

for all k in parallel if (k >= 2d-1) x[k] = x[k – 2d-1] + x[k];

31

Page 32: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Scan

Naive Parallel Scan: d = 1, 2d-1 = 1

0 1 5 2 3 4 6 7

0 1 9 3 5 7 11 13

Recall, it runs in parallel! for d = 1 to log2n

for all k in parallel if (k >= 2d-1) x[k] = x[k – 2d-1] + x[k];

32

Page 33: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Scan

Naive Parallel Scan: d = 2, 2d-1 = 2

0 1 5 2 3 4 6 7

0 1 9 3 5 7 11 13

after d = 1

for d = 1 to log2n

for all k in parallel if (k >= 2d-1) x[k] = x[k – 2d-1] + x[k];

33

Page 34: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Scan

Naive Parallel Scan: d = 2, 2d-1 = 2

0 1 5 2 3 4 6 7

0 1 9 3 5 7 11 13

22

after d = 1

Consider only k = 7for d = 1 to log2n

for all k in parallel if (k >= 2d-1) x[k] = x[k – 2d-1] + x[k];

34

Page 35: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Scan

Naive Parallel Scan: d = 2, 2d-1 = 2

0 1 5 2 3 4 6 7

0 1 9 3 5 7 11 13

0 1 14 3 6 10 18 22

after d = 1

after d = 2

for d = 1 to log2n

for all k in parallel if (k >= 2d-1) x[k] = x[k – 2d-1] + x[k];

35

Page 36: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Scan

Naive Parallel Scan: d = 3, 2d-1 = 4

0 1 5 2 3 4 6 7

0 1 9 3 5 7 11 13 after d = 1

after d = 2

0 1 14 3 6 10 18 22

for d = 1 to log2n

for all k in parallel if (k >= 2d-1) x[k] = x[k – 2d-1] + x[k];

36

Page 37: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Scan

Naive Parallel Scan: d = 3, 2d-1 = 4

0 1 5 2 3 4 6 7

0 1 9 3 5 7 11 13 after d = 1

after d = 2

28

0 1 14 3 6 10 18 22

Consider only k = 7for d = 1 to log2n

for all k in parallel if (k >= 2d-1) x[k] = x[k – 2d-1] + x[k];

37

Page 38: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Scan

Naive Parallel Scan: Final

0 1 5 2 3 4 6 7

0 1 9 3 5 7 11 13

0 1 14 3 6 10 18 22

0 1 15 3 6 10 21 28

38

Page 39: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Stream Compaction

Stream CompactionGiven an array of elements

Create a new array with elements that meet a certain criteria, e.g. non null

Preserve order

a b f c d e g h

39

Page 40: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Stream Compaction

Stream CompactionGiven an array of elements

Create a new array with elements that meet a certain criteria, e.g. non null

Preserve order

a b f c d e g h

a c d g

40

Page 41: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Stream Compaction

Stream CompactionUsed in path tracing, collision detection, sparse

matrix compression, etc.Can reduce bandwidth from GPU to CPU

a b f c d e g h

a c d g

41

Page 42: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Stream Compaction

Stream CompactionStep 1: Compute temporary array containing

1 if corresponding element meets criteria 0 if element does not meet criteria

a b f c d e g h

42

Page 43: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Stream Compaction

Stream CompactionStep 1: Compute temporary array

a b f c d e g h

1

43

Page 44: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Stream Compaction

Stream CompactionStep 1: Compute temporary array

a b f c d e g h

1 0

44

Page 45: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Stream Compaction

Stream CompactionStep 1: Compute temporary array

a b f c d e g h

1 0 1

45

Page 46: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Stream Compaction

Stream Compaction Step 1: Compute temporary array

a b f c d e g h

1 0 0 1 1 0 1 0

46

Page 47: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Stream Compaction

Stream Compaction Step 1: Compute temporary array

a b f c d e g h

It runs in parallel!

47

Page 48: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Stream Compaction

Stream Compaction Step 1: Compute temporary array

a b f c d e g h

1 0 0 1 1 0 1 0

It runs in parallel! 48

Page 49: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Stream Compaction

Stream Compaction Step 2: Run exclusive scan on temporary array

a b f c d e g h

1 0 0 1 1 0 1 0

Scan result:

49

Page 50: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Stream Compaction

Stream Compaction Step 2: Run exclusive scan on temporary array

Scan runs in parallelWhat can we do with the results?

a b f c d e g h

1 0 0 1 1 0 1 0

0 1 3 1 2 3 3 4Scan result:

50

Page 51: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Stream Compaction

Stream Compaction Step 3: Scatter

Result of scan is index into final arrayOnly write an element if temporary

array has a 1

51

Page 52: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Stream Compaction

Stream Compaction Step 3: Scatter

a b f c d e g h

1 0 0 1 1 0 1 0

0 1 3 1 2 3 3 4Scan result:

Final array:

0 1 2 352

Page 53: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Stream Compaction

Stream Compaction Step 3: Scatter

a b f c d e g h

1 0 0 1 1 0 1 0

0 1 3 1 2 3 3 4Scan result:

a Final array:

0 1 2 353

Page 54: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Stream Compaction

Stream Compaction Step 3: Scatter

a b f c d e g h

1 0 0 1 1 0 1 0

0 1 3 1 2 3 3 4Scan result:

a c Final array:

0 1 2 354

Page 55: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Stream Compaction

Stream Compaction Step 3: Scatter

a b f c d e g h

1 0 0 1 1 0 1 0

0 1 3 1 2 3 3 4Scan result:

a c d Final array:

0 1 2 355

Page 56: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Stream Compaction

Stream Compaction Step 3: Scatter

a b f c d e g h

1 0 0 1 1 0 1 0

0 1 3 1 2 3 3 4Scan result:

a c d gFinal array:

0 1 2 356

Page 57: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Stream Compaction

Stream Compaction Step 3: Scatter

a b f c d e g h

1 0 0 1 1 0 1 0

0 1 3 1 2 3 3 4Scan result:

Final array:

Scatter runs in parallel! 0 1 2 357

Page 58: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Stream Compaction

Stream Compaction Step 3: Scatter

a b f c d e g h

1 0 0 1 1 0 1 0

0 1 3 1 2 3 3 4Scan result:

a c d gFinal array:

0 1 2 3 Scatter runs in parallel!58

Page 59: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Summed Area Table

Summed Area Table (SAT): 2D table where each element stores the sum of all elements in an input image between the lower left corner and the entry location.

59

Page 60: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Summed Area Table

1 1 0 2

1 2 1 0

0 1 2 0

2 1 0 0

Input image

1 2 2 4

2 5 6 8

2 6 9 11

4 9 12 14

SAT

(1 + 1 + 0) + (1 + 2 + 1) + (0 + 1 + 2) = 9

Example:

60

Page 61: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Summed Area Table

BenefitUsed to perform different width filters at every

pixel in the image in constant time per pixelJust sample four pixels in SAT:

Image from http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html 61

Page 62: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Summed Area Table

UsesApproximate depth

of fieldGlossy

environment reflections and refractions

Image from http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html 62

Page 63: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Summed Area Table

1 1 0 2

1 2 1 0

0 1 2 0

2 1 0 0

Input image

SAT

63

Page 64: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Summed Area Table

1 1 0 2

1 2 1 0

0 1 2 0

2 1 0 0

Input image

1

SAT

64

Page 65: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Summed Area Table

1 1 0 2

1 2 1 0

0 1 2 0

2 1 0 0

Input image

1 2

SAT

65

Page 66: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Summed Area Table

1 1 0 2

1 2 1 0

0 1 2 0

2 1 0 0

Input image

1 2 2

SAT

66

Page 67: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Summed Area Table

1 1 0 2

1 2 1 0

0 1 2 0

2 1 0 0

Input image

1 2 2 4

SAT

67

Page 68: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Summed Area Table

1 1 0 2

1 2 1 0

0 1 2 0

2 1 0 0

Input image

1 2 2 4

2

SAT

68

Page 69: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Summed Area Table

1 1 0 2

1 2 1 0

0 1 2 0

2 1 0 0

Input image

1 2 2 4

2 5

SAT

69

Page 70: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Summed Area Table

70

Page 71: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Summed Area Table

1 1 0 2

1 2 1 0

0 1 2 0

2 1 0 0

Input image

1 2 2 4

2 5 6 8

2 6 9 11

4 9

SAT

71

Page 72: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Summed Area Table

1 1 0 2

1 2 1 0

0 1 2 0

2 1 0 0

Input image

1 2 2 4

2 5 6 8

2 6 9 11

4 9 12

SAT

72

Page 73: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Summed Area Table

1 1 0 2

1 2 1 0

0 1 2 0

2 1 0 0

Input image

1 2 2 4

2 5 6 8

2 6 9 11

4 9 12 14

SAT

73

Page 74: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Summed Area Table

How would implement this on the GPU?

74

Page 75: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Summed Area Table

How would compute a SAT on the GPU using

inclusive scan?75

Page 76: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Summed Area Table

1 1 0 2

1 2 1 0

0 1 2 0

2 1 0 0

Input image

1 2 2 4

1 3 4 4

0 1 3 3

2 3 3 3

Partial SAT

One inclusive scan for each row

Step 1 of 2:

76

Page 77: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Summed Area Table

1 2 2 4

1 3 4 4

0 1 3 3

2 3 3 3

Partial SAT

One inclusive scan for eachcolumn, bottom to top

Step 2 of 2:

1 2 2 4

2 5 6 8

2 6 9 11

4 9 12 14

Final SAT

77

Page 78: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Radix Sort

Efficient for small sort keysk-bit keys require k passes

78

Page 79: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Radix Sort

Each radix sort pass partitions its input based on one bit

First pass starts with the least significant bit (LSB). Subsequent passes move towards the most significant bit (MSB)

010LSBMSB

79

Page 80: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Radix Sort

100

Example from http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html

111 010 110 011 101 001 000

Example input:

80

Page 81: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Radix Sort

100 111 010 110 011 101 001 000

100 010 110 000 111 011 101 001

First pass: partition based on LSB

LSB == 0 LSB == 1

81

Page 82: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Radix Sort

100 111 010 110 011 101 001 000

100 010 110 000 111 011 101 001

Second pass: partition based on middle bit

bit == 0 bit == 1

100 010 110000 111 011101 001

82

Page 83: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Radix Sort

100 111 010 110 011 101 001 000

100 010 110 000 111 011 101 001

Final pass: partition based on MSB

MSB == 0 MSB == 1

100 010 110000 111 011101 001

000 100 101001 110 111010 011

83

Page 84: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Radix Sort

100 111 010 110 011 101 001 000

100 010 110 000 111 011 101 001

Completed:

100 010 110000 111 011101 001

000 100 101001 110 111010 011

84

Page 85: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Radix Sort

4 7 2 6 3 5 1 0

4 2 6 0 7 3 5 1

Completed:

4 2 60 7 35 1

0 4 51 6 72 3

85

Page 86: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Parallel Radix Sort

Where is the parallelism?

86

Page 87: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Parallel Radix Sort

1. Break input arrays into tilesEach tile fits into shared memory for an SM

2. Sort tiles in parallel with radix sort

3. Merge pairs of tiles using a parallel bitonic merge until all tiles are merged.

Our focus is on Step 2

87

Page 88: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Parallel Radix Sort

Where is the parallelism?Each tile is sorted in parallelWhere is the parallelism within a tile?

88

Page 89: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Parallel Radix Sort

Where is the parallelism?Each tile is sorted in parallelWhere is the parallelism within a tile?

Each pass is done in sequence after the previous pass. No parallelism

Can we parallelize an individual pass? How?Merge also has parallelism

89

Page 90: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Parallel Radix Sort

Implement spilt. Given:Array, i, at pass n:

Array, b, which is true/false for bit n:

Output array with false keys before true keys:

100 111 010 110 011 101 001 000

0 1 0 0 1 1 1 0

100 010 110 000 111 011 101 001

90

Page 91: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Parallel Radix Sort

100 111 010 110 011 101 001 000

0 1 0 0 1 1 1 0

i array

b array

Step 1: Compute e array

1 0 1 1 0 0 0 1 e array

91

Page 92: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Parallel Radix Sort

100 111 010 110 011 101 001 000

0 1 0 0 1 1 1 0 b array

Step 2: Exclusive Scan e

1 0 1 1 0 0 0 1 e array

0 1 1 2 3 3 3 3 f array

i array

92

Page 93: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Parallel Radix Sort

100 111 010 110 011 101 001 000

0 1 0 0 1 1 1 0

i array

b array

Step 3: Compute totalFalses

1 0 1 1 0 0 0 1 e array

0 1 1 2 3 3 3 3 f array

totalFalses = e[n – 1] + f[n – 1]totalFalses = 1 + 3totalFalses = 4

93

Page 94: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Parallel Radix Sort

100 111 010 110 011 101 001 000

0 1 0 0 1 1 1 0

i array

b array

Step 4: Compute t array

1 0 1 1 0 0 0 1 e array

0 1 1 2 3 3 3 3 f array

t array

t[i] = i – f[i] + totalFalses

totalFalses = 494

Page 95: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Parallel Radix Sort

0 1 0 0 1 1 1 0

i array

b array

Step 4: Compute t array

1 0 1 1 0 0 0 1 e array

0 1 1 2 3 3 3 3 f array

4 t array

t[0] = 0 – f[0] + totalFalsest[0] = 0 – 0 + 4t[0] = 4 totalFalses = 4

100 111 010 110 011 101 001 000

95

Page 96: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Parallel Radix Sort

0 1 0 0 1 1 1 0

i array

b array

Step 4: Compute t array

1 0 1 1 0 0 0 1 e array

0 1 1 2 3 3 3 3 f array

4 4 t array

t[1] = 1 – f[1] + totalFalsest[1] = 1 – 1 + 4t[1] = 4 totalFalses = 4

100 111 010 110 011 101 001 000

96

Page 97: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Parallel Radix Sort

0 1 0 0 1 1 1 0

i array

b array

Step 4: Compute t array

1 0 1 1 0 0 0 1 e array

0 1 1 2 3 3 3 3 f array

4 4 5 t array

t[2] = 2 – f[2] + totalFalsest[2] = 2 – 1 + 4t[2] = 5 totalFalses = 4

100 111 010 110 011 101 001 000

97

Page 98: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Parallel Radix Sort

0 1 0 0 1 1 1 0

i array

b array

Step 4: Compute t array

1 0 1 1 0 0 0 1 e array

0 1 1 2 3 3 3 3 f array

4 4 5 5 5 6 7 8 t array

totalFalses = 4

t[i] = i – f[i] + totalFalses

100 111 010 110 011 101 001 000

98

Page 99: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Parallel Radix Sort

0 1 0 0 1 1 1 0

i array

b array

Step 5: Scatter based on address d

1 0 1 1 0 0 0 1 e array

0 1 1 2 3 3 3 3 f array

4 4 5 5 5 6 7 8 t array

0 d[i] = b[i] ? t[i] : f[i]

100 111 010 110 011 101 001 000

99

Page 100: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Parallel Radix Sort

0 1 0 0 1 1 1 0

i array

b array

Step 5: Scatter based on address d

1 0 1 1 0 0 0 1 e array

0 1 1 2 3 3 3 3 f array

4 4 5 5 5 6 7 8 t array

0 4 d[i] = b[i] ? t[i] : f[i]

100 111 010 110 011 101 001 000

100

Page 101: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Parallel Radix Sort

0 1 0 0 1 1 1 0

i array

b array

Step 5: Scatter based on address d

1 0 1 1 0 0 0 1 e array

0 1 1 2 3 3 3 3 f array

4 4 5 5 5 6 7 8 t array

0 4 1 d[i] = b[i] ? t[i] : f[i]

100 111 010 110 011 101 001 000

101

Page 102: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Parallel Radix Sort

0 1 0 0 1 1 1 0

i array

b array

Step 5: Scatter based on address d

1 0 1 1 0 0 0 1 e array

0 1 1 2 3 3 3 3 f array

4 4 5 5 5 6 7 8 t array

d[i] = b[i] ? t[i] : f[i]

100 111 010 110 011 101 001 000

0 4 1 2 5 6 7 3102

Page 103: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Parallel Radix Sort

i array

Step 5: Scatter based on address d

0 4 1 2 5 6 7 3 d

100 111 010 110 011 101 001 000

output

103

Page 104: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Parallel Radix Sort

i array

Step 5: Scatter based on address d

0 4 1 2 5 6 7 3 d

100 111 010 110 011 101 001 000

100 010 110 000 111 011 101 001 output

104

Page 105: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Parallel Radix Sort

Given k-bit keys, how do we sort using our new split function?

Once each tile is sorted, how do we merge tiles to provide the final sorted array?

105

Page 106: Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

Summary

Parallel reduction, scan, and sort are building blocks for many algorithms

An understanding of parallel programming and GPU architecture yields efficient GPU implementations

106