Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the...

Preview:

DESCRIPTION

Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution: –Divide program into parts –Run each part on separate CPUs of larger machine

Citation preview

Parallel Processing

Chapter 9

• Problem:– Branches, cache misses, dependencies limit

the (Instruction Level Parallelism) ILP available

• Solution:

• Problem:– Branches, cache misses, dependencies limit

the (Instruction Level Parallelism) ILP available

• Solution:– Divide program into parts– Run each part on separate CPUs of larger

machine

Motivations

Motivations

• Desktops are incredibly cheap– Custom high-performance uniprocessor – Hook up 100 desktops

• Squeezing out more ILP is difficult

Motivations

• Desktops are incredibly cheap– Custom high-performance uniprocessor – Hook up 100 desktops

• Squeezing out more ILP is difficult– More complexity/power required each time– Would require change in cooling technology

Challenges

• Parallelizing code is not easy• Communication can be costly• Requires HW support

Challenges

• Parallelizing code is not easy– Languages, software engineering, software

verification issue – beyond scope of class• Communication can be costly• Requires HW support

Challenges

• Parallelizing code is not easy– Languages, software engineering, software

verification issue – beyond scope of class• Communication can be costly

– Performance analysis ignores caches - these costs are much higher

• Requires HW support

Challenges

• Parallelizing code is not easy– Languages, software engineering, software

verification issue – beyond scope of class• Communication can be costly

– Performance analysis ignores caches - these costs are much higher

• Requires HW support– Multiple processes modifying the same data causes

race conditions, and out of order processors arbitrarily reorder things.

Performance - Speedup

• _____________________• 70% of the program is parallelizable• What is the highest speedup possible?

• What is the speedup with 100 processors?

Speedup

• Amdahl’s Law!!!!!!• 70% of the program is parallelizable• What is the highest speedup possible?

• What is the speedup with 100 processors?

Speedup

• Amdahl’s Law!!!!!!• 70% of the program is parallelizable• What is the highest speedup possible?

– 1 / (.30 + .70 / ) = 1 / .30 = 3.33

• What is the speedup with 100 processors?

8

Speedup

• Amdahl’s Law!!!!!!• 70% of the program is parallelizable• What is the highest speedup possible?

– 1 / (.30 + .70 / ) = 1 / .30 = 3.33

• What is the speedup with 100 processors?– 1 / (.30 + .70/100) = 1 / .307 = 3.26

8

Taxonomy

• SISD – single instruction, single data

• SIMD – single instruction, multiple data

• MISD – multiple instruction, single data

• MIMD – multiple instruction, multiple data

Taxonomy

• SISD – single instruction, single data– uniprocessor

• SIMD – single instruction, multiple data

• MISD – multiple instruction, single data

• MIMD – multiple instruction, multiple data

Taxonomy

• SISD – single instruction, single data– uniprocessor

• SIMD – single instruction, multiple data– vector, MMX extensions, graphics cards

• MISD – multiple instruction, single data

• MIMD – multiple instruction, multiple data

P

Controller SIMD

D

P D

P D

P D

PD

PD

PD

PD

Controller fetches instructionsAll processors execute the same instructionConditional instructions only way for variation

Taxonomy

• SISD – single instruction, single data– uniprocessor

• SIMD – single instruction, multiple data– vector, MMX extensions, graphics cards

• MISD – multiple instruction, single data

• MIMD – multiple instruction, multiple data

Taxonomy

• SISD – single instruction, single data– uniprocessor

• SIMD – single instruction, multiple data– vector, MMX extensions, graphics cards

• MISD – multiple instruction, single data– Never built – pipeline architectures?!?

• MIMD – multiple instruction, multiple data

Taxonomy

• SISD – single instruction, single data– uniprocessor

• SIMD – single instruction, multiple data– vector, MMX extensions, graphics cards

• MISD – multiple instruction, single data– Streaming apps?

• MIMD – multiple instruction, multiple data– Most multiprocessors– Cheap, flexible

Example

• Sum the elements in A[] and place result in sum

int sum=0;int i;for(i=0;i<n;i++)

sum = sum + A[i];

Parallel versionShared Memory

Parallel versionShared Memory

int A[NUM];int numProcs;int sum;int sumArray[numProcs];myFunction( (input arguments) ){ int myNum - …….; int mySum = 0; for (i = (NUM/numProcs)*myNum; i < (NUM/numProcs)*(myNum+1);i++)

mySum += A[i]; sumArray[myNum] = mySum; barrier(); if (myNum == 0) {

for(i=0;i<numProcs;i++)sum += sumArray[i];

}}

Why Synchronization?

• Why can’t you figure out when proc x will finish work?

Why Synchronization?

• Why can’t you figure out when proc x will finish work?– Cache misses– Different control flow– Context switches

Supporting Parallel Programs

• Synchronization• Cache Coherence• False Sharing

Synchronization

• Sum += A[i];• Two processors, i = 0, i = 50• Before the action:

– Sum = 5– A[0] = 10– A[50] = 33

• What is the proper result?

Synchronization

• Sum = Sum + A[i];

• Assembly for this equation, assuming – A[i] is already in $t0:– &Sum is already in $s0

lw $t1, 0($s0)

add $t1, $t1, $t0

sw $t1, 0($s0)

SynchronizationOrdering #1

P1 inst Effect P2 inst Effect

Given $t0 = 10 Given $t0 = 33

Lw $t1 =

Lw $t1 =

add $t1 = Add $t1 =

Sw Sum =

Sw Sum =

lw $t1, 0($s0)

add $t1, $t1, $t0

sw $t1, 0($s0)

5

38155

15

38

SynchronizationOrdering #2

P1 inst Effect P2 inst Effect

Given $t0 = 10 Given $t0 = 33

Lw $t1 =

Lw $t1 =

add $t1 = Add $t1 =

Sw Sum =

Sw Sum =

lw $t1, 0($s0)

add $t1, $t1, $t0

sw $t1, 0($s0)

5

38155

15

38

Synchronization Problem

• Reading and writing memory is a non-atomic operation– You can not read and write a memory location

in a single operation • We need hardware primitives that allow us

to read and write without interruption

Solution

• Software Solution– “lock” – function that allows one processor to

leave, all others to loop– “unlock” – releases the next looping processor

(or resets to allow next arriving proc to leave)• Hardware

– Provide primitives that read & write in order to implement lock and unlock

SoftwareUsing lock and unlock

lock(&balancelock)Sum += A[i]unlock(&balancelock)

HardwareImplementing lock & unlock

• Swap $1, 100($2)– Swap the contents of $1 and M[$2+100]

Hardware: Implementing lock & unlock with swap

Lock:Li $t0, 1 Loop: swap $t0, 0($a0)

bne $t0, $0, loop

• If lock has 0, it is free• If lock has 1, it is held

Hardware: Implementing lock & unlock with swap

Lock:Li $t0, 1 Loop: swap $t0, 0($a0)

bne $t0, $0, loop

Unlock:sw $0, 0($a0)

• If lock has 0, it is free• If lock has 1, it is held

Outline

• Synchronization• Cache Coherence• False Sharing

Cache Coherence

$$$ $$$

P1 P2Current a value in:P1$ P2$ DRAM

* * 71. P2: Rd a 2. P2: Wr a, 53. P1: Rd a4. P2: Wr a, 35. P1: Rd a

DRAM

P1,P2 are write-back caches

Cache Coherence

$$$ $$$

P1 P2Current a value in:P1$ P2$ DRAM

* * 71. P2: Rd a 2. P2: Wr a, 53. P1: Rd a4. P2: Wr a, 35. P1: Rd a

DRAM

1

P1,P2 are write-back caches

Cache Coherence

$$$ $$$

P1 P2Current a value in:P1$ P2$ DRAM

* * 71. P2: Rd a * 7 72. P2: Wr a, 53. P1: Rd a4. P2: Wr a, 35. P1: Rd a

DRAM

1

P1,P2 are write-back caches

Cache Coherence

$$$ $$$

P1 P2Current a value in:P1$ P2$ DRAM

* * 71. P2: Rd a * 7 72. P2: Wr a, 53. P1: Rd a4. P2: Wr a, 35. P1: Rd a

DRAM

1

2

P1,P2 are write-back caches

Cache Coherence

$$$ $$$

P1 P2Current a value in:P1$ P2$ DRAM

* * 71. P2: Rd a * 7 72. P2: Wr a, 5 * 5 3. P1: Rd a4. P2: Wr a, 35. P1: Rd a

DRAM

1

2

P1,P2 are write-back caches

Cache Coherence

$$$ $$$

P1 P2Current a value in:P1$ P2$ DRAM

* * 71. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a4. P2: Wr a, 35. P1: Rd a

DRAM

1

2

P1,P2 are write-back caches

Cache Coherence

$$$ $$$

P1 P2Current a value in:P1$ P2$ DRAM

* * 71. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a 5 5 54. P2: Wr a, 35. P1: Rd a

DRAM

13

2

P1,P2 are write-back caches

Cache Coherence

$$$ $$$

P1 P2Current a value in:P1$ P2$ DRAM

* * 71. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a 5 5 54. P2: Wr a, 35 3 55. P1: Rd a

DRAM

13

2

AAAAAAAAAAAAAAAAAAAAAH! Inconsistency!

4

Cache Coherence

$$$ $$$

P1 P2Current a value in:P1$ P2$ DRAM

* * 71. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a 5 5 54. P2: Wr a, 35 3 55. P1: Rd a

DRAM

13

2

AAAAAAAAAAAAAAAAAAAAAH! Inconsistency!What will P1 receive from its load?

4

Cache Coherence

$$$ $$$

P1 P2Current a value in:P1$ P2$ DRAM

* * 71. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a 5 5 54. P2: Wr a, 35 3 55. P1: Rd a

DRAM

13

2

AAAAAAAAAAAAAAAAAAAAAH! Inconsistency!What will P1 receive from its load? 5What should P1 receive from its load?

4

Cache Coherence

$$$ $$$

P1 P2Current a value in:P1$ P2$ DRAM

* * 71. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a 5 5 54. P2: Wr a, 35 3 55. P1: Rd a

DRAM

13

2

AAAAAAAAAAAAAAAAAAAAAH! Inconsistency!What will P1 receive from its load? 5What should P1 receive from its load? 3

4

Whatever are we to do?

• Write-Invalidate– Invalidate that value in all others’ caches– Set the valid bit to 0

• Write-Update– Update the value in all others’ caches

Write Invalidate

$$$ $$$

P1 P2Current a value in:P1$ P2$ DRAM

* * 71. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a 5 5 54. P2: Wr a, 35. P1: Rd a

DRAM

13

2

P1,P2 are write-back caches

4

Write Invalidate

$$$ $$$

P1 P2Current a value in:P1$ P2$ DRAM

* * 71. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a 5 5 54. P2: Wr a, 3* 3 55. P1: Rd a

DRAM

13

2

P1,P2 are write-back caches

4

Write Invalidate

$$$ $$$

P1 P2Current a value in:P1$ P2$ DRAM

* * 71. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a 5 5 54. P2: Wr a, 3* 3 55. P1: Rd a 3 3 3

DRAM

13,5

2

P1,P2 are write-back caches

4

Write Update

$$$ $$$

P1 P2Current a value in:P1$ P2$ DRAM

* * 71. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a 5 5 54. P2: Wr a, 35. P1: Rd a

DRAM

13,42

P1,P2 are write-back caches

4

Write Update

$$$ $$$

P1 P2Current a value in:P1$ P2$ DRAM

* * 71. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a 5 5 54. P2: Wr a, 33 3 35. P1: Rd a

DRAM

13,42

P1,P2 are write-back caches

4

Write Update

$$$ $$$

P1 P2Current a value in:P1$ P2$ DRAM

* * 71. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a 5 5 54. P2: Wr a, 33 3 35. P1: Rd a 3 3 3

DRAM

13,42

P1,P2 are write-back caches

4

Outline

• Synchronization• Cache Coherence• False Sharing

Cache CoherenceFalse Sharing w/ Invalidate

$$$ $$$

P1 P2Current contents in:P1$ P2$

* *1. P2: Rd A[0] 2. P1: Rd A[1]3. P2: Wr A[0], 54. P1: Wr A[1], 3

DRAM

P1,P2 cacheline size: 4 words

Look closely at example

• P1 and P2 do not access the same element

• A[0] and A[1] are in the same cache block, so if they are in one cache, they are in the other cache.

Cache Coherence False Sharing w/ Invalidate

$$$ $$$

P1 P2Current contents in:P1$ P2$

* *1. P2: Rd A[0] * A[0-3]2. P1: Rd A[1]3. P2: Wr A[0], 54. P1: Wr A[1], 3

DRAM

P1,P2 cacheline size: 4 words

Cache Coherence False Sharing w/ Invalidate

$$$ $$$

P1 P2Current contents in:P1$ P2$

* *1. P2: Rd A[0] * A[0-3]2. P1: Rd A[1] A[0-3]A[0-3]3. P2: Wr A[0], 5 4. P1: Wr A[1], 3

DRAM

P1,P2 cacheline size: 4 words

Cache Coherence False Sharing w/ Invalidate

$$$ $$$

P1 P2Current contents in:P1$ P2$

* *1. P2: Rd A[0] * A[0-3]2. P1: Rd A[1] A[0-3]A[0-3]3. P2: Wr A[0], 5 * A[0-3]4. P1: Wr A[1], 3

DRAM

P1,P2 cacheline size: 4 words

Cache Coherence False Sharing w/ Invalidate

$$$ $$$

P1 P2Current contents in:P1$ P2$

* *1. P2: Rd A[0] * A[0-3]2. P1: Rd A[1] A[0-3]A[0-3]3. P2: Wr A[0], 5 * A[0-3]4. P1: Wr A[1], 3 A[0-3] *

DRAM

P1,P2 cacheline size: 4 words

False Sharing

• Different/same processors access different/same items in different/same cache block

• Leads to ___________ misses

False Sharing

• Different processors access different items in same cache block

• Leads to___________ misses

False Sharing

• Different processors access different items in same cache block

• Leads to coherence cache misses

Cache Performance

// Pn = my processor number (rank)// NumProcs = total active processors// N = total number of elements// NElem = N / NumProcs

For(i=0;i<N;i++) A[NumProcs*i+Pn] = f(i);

Vs

For(i=(Pn*NElem);i<(Pn+1)*NElem;i++) A[i] = f(i);

Which is better?

• Both access the same number of elements• No processors access the same elements

as each other

Why is the second better?

• Both access the same number of elements• No processors access the same elements

as each other• Better Spatial Locality

Why is the second better?

• Both access the same number of elements• No processors access the same elements

as each other• Better Spatial Locality• Less False Sharing

Recommended