Upload
isabella-cole
View
228
Download
0
Embed Size (px)
DESCRIPTION
Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution: –Divide program into parts –Run each part on separate CPUs of larger machine
Citation preview
Parallel Processing
Chapter 9
• Problem:– Branches, cache misses, dependencies limit
the (Instruction Level Parallelism) ILP available
• Solution:
• Problem:– Branches, cache misses, dependencies limit
the (Instruction Level Parallelism) ILP available
• Solution:– Divide program into parts– Run each part on separate CPUs of larger
machine
Motivations
Motivations
• Desktops are incredibly cheap– Custom high-performance uniprocessor – Hook up 100 desktops
• Squeezing out more ILP is difficult
Motivations
• Desktops are incredibly cheap– Custom high-performance uniprocessor – Hook up 100 desktops
• Squeezing out more ILP is difficult– More complexity/power required each time– Would require change in cooling technology
Challenges
• Parallelizing code is not easy• Communication can be costly• Requires HW support
Challenges
• Parallelizing code is not easy– Languages, software engineering, software
verification issue – beyond scope of class• Communication can be costly• Requires HW support
Challenges
• Parallelizing code is not easy– Languages, software engineering, software
verification issue – beyond scope of class• Communication can be costly
– Performance analysis ignores caches - these costs are much higher
• Requires HW support
Challenges
• Parallelizing code is not easy– Languages, software engineering, software
verification issue – beyond scope of class• Communication can be costly
– Performance analysis ignores caches - these costs are much higher
• Requires HW support– Multiple processes modifying the same data causes
race conditions, and out of order processors arbitrarily reorder things.
Performance - Speedup
• _____________________• 70% of the program is parallelizable• What is the highest speedup possible?
• What is the speedup with 100 processors?
Speedup
• Amdahl’s Law!!!!!!• 70% of the program is parallelizable• What is the highest speedup possible?
• What is the speedup with 100 processors?
Speedup
• Amdahl’s Law!!!!!!• 70% of the program is parallelizable• What is the highest speedup possible?
– 1 / (.30 + .70 / ) = 1 / .30 = 3.33
• What is the speedup with 100 processors?
8
Speedup
• Amdahl’s Law!!!!!!• 70% of the program is parallelizable• What is the highest speedup possible?
– 1 / (.30 + .70 / ) = 1 / .30 = 3.33
• What is the speedup with 100 processors?– 1 / (.30 + .70/100) = 1 / .307 = 3.26
8
Taxonomy
• SISD – single instruction, single data
• SIMD – single instruction, multiple data
• MISD – multiple instruction, single data
• MIMD – multiple instruction, multiple data
Taxonomy
• SISD – single instruction, single data– uniprocessor
• SIMD – single instruction, multiple data
• MISD – multiple instruction, single data
• MIMD – multiple instruction, multiple data
Taxonomy
• SISD – single instruction, single data– uniprocessor
• SIMD – single instruction, multiple data– vector, MMX extensions, graphics cards
• MISD – multiple instruction, single data
• MIMD – multiple instruction, multiple data
P
Controller SIMD
D
P D
P D
P D
PD
PD
PD
PD
Controller fetches instructionsAll processors execute the same instructionConditional instructions only way for variation
Taxonomy
• SISD – single instruction, single data– uniprocessor
• SIMD – single instruction, multiple data– vector, MMX extensions, graphics cards
• MISD – multiple instruction, single data
• MIMD – multiple instruction, multiple data
Taxonomy
• SISD – single instruction, single data– uniprocessor
• SIMD – single instruction, multiple data– vector, MMX extensions, graphics cards
• MISD – multiple instruction, single data– Never built – pipeline architectures?!?
• MIMD – multiple instruction, multiple data
Taxonomy
• SISD – single instruction, single data– uniprocessor
• SIMD – single instruction, multiple data– vector, MMX extensions, graphics cards
• MISD – multiple instruction, single data– Streaming apps?
• MIMD – multiple instruction, multiple data– Most multiprocessors– Cheap, flexible
Example
• Sum the elements in A[] and place result in sum
int sum=0;int i;for(i=0;i<n;i++)
sum = sum + A[i];
Parallel versionShared Memory
Parallel versionShared Memory
int A[NUM];int numProcs;int sum;int sumArray[numProcs];myFunction( (input arguments) ){ int myNum - …….; int mySum = 0; for (i = (NUM/numProcs)*myNum; i < (NUM/numProcs)*(myNum+1);i++)
mySum += A[i]; sumArray[myNum] = mySum; barrier(); if (myNum == 0) {
for(i=0;i<numProcs;i++)sum += sumArray[i];
}}
Why Synchronization?
• Why can’t you figure out when proc x will finish work?
Why Synchronization?
• Why can’t you figure out when proc x will finish work?– Cache misses– Different control flow– Context switches
Supporting Parallel Programs
• Synchronization• Cache Coherence• False Sharing
Synchronization
• Sum += A[i];• Two processors, i = 0, i = 50• Before the action:
– Sum = 5– A[0] = 10– A[50] = 33
• What is the proper result?
Synchronization
• Sum = Sum + A[i];
• Assembly for this equation, assuming – A[i] is already in $t0:– &Sum is already in $s0
lw $t1, 0($s0)
add $t1, $t1, $t0
sw $t1, 0($s0)
SynchronizationOrdering #1
P1 inst Effect P2 inst Effect
Given $t0 = 10 Given $t0 = 33
Lw $t1 =
Lw $t1 =
add $t1 = Add $t1 =
Sw Sum =
Sw Sum =
lw $t1, 0($s0)
add $t1, $t1, $t0
sw $t1, 0($s0)
5
38155
15
38
SynchronizationOrdering #2
P1 inst Effect P2 inst Effect
Given $t0 = 10 Given $t0 = 33
Lw $t1 =
Lw $t1 =
add $t1 = Add $t1 =
Sw Sum =
Sw Sum =
lw $t1, 0($s0)
add $t1, $t1, $t0
sw $t1, 0($s0)
5
38155
15
38
Synchronization Problem
• Reading and writing memory is a non-atomic operation– You can not read and write a memory location
in a single operation • We need hardware primitives that allow us
to read and write without interruption
Solution
• Software Solution– “lock” – function that allows one processor to
leave, all others to loop– “unlock” – releases the next looping processor
(or resets to allow next arriving proc to leave)• Hardware
– Provide primitives that read & write in order to implement lock and unlock
SoftwareUsing lock and unlock
lock(&balancelock)Sum += A[i]unlock(&balancelock)
HardwareImplementing lock & unlock
• Swap $1, 100($2)– Swap the contents of $1 and M[$2+100]
Hardware: Implementing lock & unlock with swap
Lock:Li $t0, 1 Loop: swap $t0, 0($a0)
bne $t0, $0, loop
• If lock has 0, it is free• If lock has 1, it is held
Hardware: Implementing lock & unlock with swap
Lock:Li $t0, 1 Loop: swap $t0, 0($a0)
bne $t0, $0, loop
Unlock:sw $0, 0($a0)
• If lock has 0, it is free• If lock has 1, it is held
Outline
• Synchronization• Cache Coherence• False Sharing
Cache Coherence
$$$ $$$
P1 P2Current a value in:P1$ P2$ DRAM
* * 71. P2: Rd a 2. P2: Wr a, 53. P1: Rd a4. P2: Wr a, 35. P1: Rd a
DRAM
P1,P2 are write-back caches
Cache Coherence
$$$ $$$
P1 P2Current a value in:P1$ P2$ DRAM
* * 71. P2: Rd a 2. P2: Wr a, 53. P1: Rd a4. P2: Wr a, 35. P1: Rd a
DRAM
1
P1,P2 are write-back caches
Cache Coherence
$$$ $$$
P1 P2Current a value in:P1$ P2$ DRAM
* * 71. P2: Rd a * 7 72. P2: Wr a, 53. P1: Rd a4. P2: Wr a, 35. P1: Rd a
DRAM
1
P1,P2 are write-back caches
Cache Coherence
$$$ $$$
P1 P2Current a value in:P1$ P2$ DRAM
* * 71. P2: Rd a * 7 72. P2: Wr a, 53. P1: Rd a4. P2: Wr a, 35. P1: Rd a
DRAM
1
2
P1,P2 are write-back caches
Cache Coherence
$$$ $$$
P1 P2Current a value in:P1$ P2$ DRAM
* * 71. P2: Rd a * 7 72. P2: Wr a, 5 * 5 3. P1: Rd a4. P2: Wr a, 35. P1: Rd a
DRAM
1
2
P1,P2 are write-back caches
Cache Coherence
$$$ $$$
P1 P2Current a value in:P1$ P2$ DRAM
* * 71. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a4. P2: Wr a, 35. P1: Rd a
DRAM
1
2
P1,P2 are write-back caches
Cache Coherence
$$$ $$$
P1 P2Current a value in:P1$ P2$ DRAM
* * 71. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a 5 5 54. P2: Wr a, 35. P1: Rd a
DRAM
13
2
P1,P2 are write-back caches
Cache Coherence
$$$ $$$
P1 P2Current a value in:P1$ P2$ DRAM
* * 71. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a 5 5 54. P2: Wr a, 35 3 55. P1: Rd a
DRAM
13
2
AAAAAAAAAAAAAAAAAAAAAH! Inconsistency!
4
Cache Coherence
$$$ $$$
P1 P2Current a value in:P1$ P2$ DRAM
* * 71. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a 5 5 54. P2: Wr a, 35 3 55. P1: Rd a
DRAM
13
2
AAAAAAAAAAAAAAAAAAAAAH! Inconsistency!What will P1 receive from its load?
4
Cache Coherence
$$$ $$$
P1 P2Current a value in:P1$ P2$ DRAM
* * 71. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a 5 5 54. P2: Wr a, 35 3 55. P1: Rd a
DRAM
13
2
AAAAAAAAAAAAAAAAAAAAAH! Inconsistency!What will P1 receive from its load? 5What should P1 receive from its load?
4
Cache Coherence
$$$ $$$
P1 P2Current a value in:P1$ P2$ DRAM
* * 71. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a 5 5 54. P2: Wr a, 35 3 55. P1: Rd a
DRAM
13
2
AAAAAAAAAAAAAAAAAAAAAH! Inconsistency!What will P1 receive from its load? 5What should P1 receive from its load? 3
4
Whatever are we to do?
• Write-Invalidate– Invalidate that value in all others’ caches– Set the valid bit to 0
• Write-Update– Update the value in all others’ caches
Write Invalidate
$$$ $$$
P1 P2Current a value in:P1$ P2$ DRAM
* * 71. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a 5 5 54. P2: Wr a, 35. P1: Rd a
DRAM
13
2
P1,P2 are write-back caches
4
Write Invalidate
$$$ $$$
P1 P2Current a value in:P1$ P2$ DRAM
* * 71. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a 5 5 54. P2: Wr a, 3* 3 55. P1: Rd a
DRAM
13
2
P1,P2 are write-back caches
4
Write Invalidate
$$$ $$$
P1 P2Current a value in:P1$ P2$ DRAM
* * 71. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a 5 5 54. P2: Wr a, 3* 3 55. P1: Rd a 3 3 3
DRAM
13,5
2
P1,P2 are write-back caches
4
Write Update
$$$ $$$
P1 P2Current a value in:P1$ P2$ DRAM
* * 71. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a 5 5 54. P2: Wr a, 35. P1: Rd a
DRAM
13,42
P1,P2 are write-back caches
4
Write Update
$$$ $$$
P1 P2Current a value in:P1$ P2$ DRAM
* * 71. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a 5 5 54. P2: Wr a, 33 3 35. P1: Rd a
DRAM
13,42
P1,P2 are write-back caches
4
Write Update
$$$ $$$
P1 P2Current a value in:P1$ P2$ DRAM
* * 71. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a 5 5 54. P2: Wr a, 33 3 35. P1: Rd a 3 3 3
DRAM
13,42
P1,P2 are write-back caches
4
Outline
• Synchronization• Cache Coherence• False Sharing
Cache CoherenceFalse Sharing w/ Invalidate
$$$ $$$
P1 P2Current contents in:P1$ P2$
* *1. P2: Rd A[0] 2. P1: Rd A[1]3. P2: Wr A[0], 54. P1: Wr A[1], 3
DRAM
P1,P2 cacheline size: 4 words
Look closely at example
• P1 and P2 do not access the same element
• A[0] and A[1] are in the same cache block, so if they are in one cache, they are in the other cache.
Cache Coherence False Sharing w/ Invalidate
$$$ $$$
P1 P2Current contents in:P1$ P2$
* *1. P2: Rd A[0] * A[0-3]2. P1: Rd A[1]3. P2: Wr A[0], 54. P1: Wr A[1], 3
DRAM
P1,P2 cacheline size: 4 words
Cache Coherence False Sharing w/ Invalidate
$$$ $$$
P1 P2Current contents in:P1$ P2$
* *1. P2: Rd A[0] * A[0-3]2. P1: Rd A[1] A[0-3]A[0-3]3. P2: Wr A[0], 5 4. P1: Wr A[1], 3
DRAM
P1,P2 cacheline size: 4 words
Cache Coherence False Sharing w/ Invalidate
$$$ $$$
P1 P2Current contents in:P1$ P2$
* *1. P2: Rd A[0] * A[0-3]2. P1: Rd A[1] A[0-3]A[0-3]3. P2: Wr A[0], 5 * A[0-3]4. P1: Wr A[1], 3
DRAM
P1,P2 cacheline size: 4 words
Cache Coherence False Sharing w/ Invalidate
$$$ $$$
P1 P2Current contents in:P1$ P2$
* *1. P2: Rd A[0] * A[0-3]2. P1: Rd A[1] A[0-3]A[0-3]3. P2: Wr A[0], 5 * A[0-3]4. P1: Wr A[1], 3 A[0-3] *
DRAM
P1,P2 cacheline size: 4 words
False Sharing
• Different/same processors access different/same items in different/same cache block
• Leads to ___________ misses
False Sharing
• Different processors access different items in same cache block
• Leads to___________ misses
False Sharing
• Different processors access different items in same cache block
• Leads to coherence cache misses
Cache Performance
// Pn = my processor number (rank)// NumProcs = total active processors// N = total number of elements// NElem = N / NumProcs
For(i=0;i<N;i++) A[NumProcs*i+Pn] = f(i);
Vs
For(i=(Pn*NElem);i<(Pn+1)*NElem;i++) A[i] = f(i);
Which is better?
• Both access the same number of elements• No processors access the same elements
as each other
Why is the second better?
• Both access the same number of elements• No processors access the same elements
as each other• Better Spatial Locality
Why is the second better?
• Both access the same number of elements• No processors access the same elements
as each other• Better Spatial Locality• Less False Sharing