Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the...

Parallel Processing

Chapter 9

• Problem:– Branches, cache misses, dependencies limit

the (Instruction Level Parallelism) ILP available

• Solution:

• Problem:– Branches, cache misses, dependencies limit

the (Instruction Level Parallelism) ILP available

• Solution:– Divide program into parts– Run each part on separate CPUs of larger

machine

Motivations

• Desktops are incredibly cheap– Custom high-performance uniprocessor – Hook up 100 desktops

• Squeezing out more ILP is difficult

Motivations

• Desktops are incredibly cheap– Custom high-performance uniprocessor – Hook up 100 desktops

• Squeezing out more ILP is difficult– More complexity/power required each time– Would require change in cooling technology

Challenges

• Parallelizing code is not easy• Communication can be costly• Requires HW support

Challenges

• Parallelizing code is not easy– Languages, software engineering, software

verification issue – beyond scope of class• Communication can be costly• Requires HW support

Challenges

verification issue – beyond scope of class• Communication can be costly

– Performance analysis ignores caches - these costs are much higher

• Requires HW support

Challenges

verification issue – beyond scope of class• Communication can be costly

– Performance analysis ignores caches - these costs are much higher

• Requires HW support– Multiple processes modifying the same data causes

race conditions, and out of order processors arbitrarily reorder things.

Performance - Speedup

• _____________________• 70% of the program is parallelizable• What is the highest speedup possible?

• What is the speedup with 100 processors?

Speedup

• Amdahl’s Law!!!!!!• 70% of the program is parallelizable• What is the highest speedup possible?

Speedup

– 1 / (.30 + .70 / ) = 1 / .30 = 3.33

Speedup

– 1 / (.30 + .70 / ) = 1 / .30 = 3.33

• What is the speedup with 100 processors?– 1 / (.30 + .70/100) = 1 / .307 = 3.26

Taxonomy

• SISD – single instruction, single data

• SIMD – single instruction, multiple data

• MISD – multiple instruction, single data

• MIMD – multiple instruction, multiple data

Taxonomy

• SISD – single instruction, single data– uniprocessor

• SIMD – single instruction, multiple data

Taxonomy

• SIMD – single instruction, multiple data– vector, MMX extensions, graphics cards

Controller SIMD

Controller fetches instructionsAll processors execute the same instructionConditional instructions only way for variation

Taxonomy

• MISD – multiple instruction, single data– Never built – pipeline architectures?!?

Taxonomy

• MISD – multiple instruction, single data– Streaming apps?

• MIMD – multiple instruction, multiple data– Most multiprocessors– Cheap, flexible

Example

• Sum the elements in A[] and place result in sum

int sum=0;int i;for(i=0;i<n;i++)

sum = sum + A[i];

Parallel versionShared Memory

int A[NUM];int numProcs;int sum;int sumArray[numProcs];myFunction( (input arguments) ){ int myNum - …….; int mySum = 0; for (i = (NUM/numProcs)*myNum; i < (NUM/numProcs)*(myNum+1);i++)

mySum += A[i]; sumArray[myNum] = mySum; barrier(); if (myNum == 0) {

for(i=0;i<numProcs;i++)sum += sumArray[i];

Why Synchronization?

• Why can’t you figure out when proc x will finish work?

Why Synchronization?

• Why can’t you figure out when proc x will finish work?– Cache misses– Different control flow– Context switches

Supporting Parallel Programs

• Synchronization• Cache Coherence• False Sharing

Synchronization

• Sum += A[i];• Two processors, i = 0, i = 50• Before the action:

– Sum = 5– A[0] = 10– A[50] = 33

• What is the proper result?

Synchronization

• Sum = Sum + A[i];

• Assembly for this equation, assuming – A[i] is already in $t0:– &Sum is already in $s0

lw $t1, 0($s0)

add $t1, $t1, $t0

sw $t1, 0($s0)

SynchronizationOrdering #1

P1 inst Effect P2 inst Effect

Given $t0 = 10 Given $t0 = 33

Lw $t1 =

add $t1 = Add $t1 =

Sw Sum =

lw $t1, 0($s0)

add $t1, $t1, $t0

sw $t1, 0($s0)

SynchronizationOrdering #2

P1 inst Effect P2 inst Effect

Given $t0 = 10 Given $t0 = 33

Lw $t1 =

add $t1 = Add $t1 =

Sw Sum =

lw $t1, 0($s0)

add $t1, $t1, $t0

sw $t1, 0($s0)

Synchronization Problem

• Reading and writing memory is a non-atomic operation– You can not read and write a memory location

in a single operation • We need hardware primitives that allow us

to read and write without interruption

Solution

• Software Solution– “lock” – function that allows one processor to

leave, all others to loop– “unlock” – releases the next looping processor

(or resets to allow next arriving proc to leave)• Hardware

– Provide primitives that read & write in order to implement lock and unlock

SoftwareUsing lock and unlock

lock(&balancelock)Sum += A[i]unlock(&balancelock)

HardwareImplementing lock & unlock

• Swap $1, 100($2)– Swap the contents of $1 and M[$2+100]

Hardware: Implementing lock & unlock with swap

Lock:Li $t0, 1 Loop: swap $t0, 0($a0)

bne $t0, $0, loop

• If lock has 0, it is free• If lock has 1, it is held

Hardware: Implementing lock & unlock with swap

Lock:Li $t0, 1 Loop: swap $t0, 0($a0)

bne $t0, $0, loop

Unlock:sw $0, 0($a0)

• If lock has 0, it is free• If lock has 1, it is held

Outline

Cache Coherence

$$$ $$$

P1 P2Current a value in:P1$ P2$ DRAM

* * 71. P2: Rd a 2. P2: Wr a, 53. P1: Rd a4. P2: Wr a, 35. P1: Rd a

P1,P2 are write-back caches

Cache Coherence

$$$ $$$

* * 71. P2: Rd a 2. P2: Wr a, 53. P1: Rd a4. P2: Wr a, 35. P1: Rd a

Cache Coherence

$$$ $$$

* * 71. P2: Rd a * 7 72. P2: Wr a, 53. P1: Rd a4. P2: Wr a, 35. P1: Rd a

Cache Coherence

$$$ $$$

* * 71. P2: Rd a * 7 72. P2: Wr a, 53. P1: Rd a4. P2: Wr a, 35. P1: Rd a

Cache Coherence

$$$ $$$

* * 71. P2: Rd a * 7 72. P2: Wr a, 5 * 5 3. P1: Rd a4. P2: Wr a, 35. P1: Rd a

Cache Coherence

$$$ $$$

* * 71. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a4. P2: Wr a, 35. P1: Rd a

Cache Coherence

$$$ $$$

* * 71. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a 5 5 54. P2: Wr a, 35. P1: Rd a

Cache Coherence

$$$ $$$

* * 71. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a 5 5 54. P2: Wr a, 35 3 55. P1: Rd a

AAAAAAAAAAAAAAAAAAAAAH! Inconsistency!

Cache Coherence

$$$ $$$

AAAAAAAAAAAAAAAAAAAAAH! Inconsistency!What will P1 receive from its load?

Cache Coherence

$$$ $$$

AAAAAAAAAAAAAAAAAAAAAH! Inconsistency!What will P1 receive from its load? 5What should P1 receive from its load?

Cache Coherence

$$$ $$$

AAAAAAAAAAAAAAAAAAAAAH! Inconsistency!What will P1 receive from its load? 5What should P1 receive from its load? 3

Whatever are we to do?

• Write-Invalidate– Invalidate that value in all others’ caches– Set the valid bit to 0

• Write-Update– Update the value in all others’ caches

Write Invalidate

$$$ $$$

Write Invalidate

$$$ $$$

* * 71. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a 5 5 54. P2: Wr a, 3* 3 55. P1: Rd a

Write Invalidate

$$$ $$$

* * 71. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a 5 5 54. P2: Wr a, 3* 3 55. P1: Rd a 3 3 3

Write Update

$$$ $$$

Write Update

$$$ $$$

Write Update

$$$ $$$

* * 71. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a 5 5 54. P2: Wr a, 33 3 35. P1: Rd a 3 3 3

Outline

Cache CoherenceFalse Sharing w/ Invalidate

$$$ $$$

P1 P2Current contents in:P1$ P2$

* *1. P2: Rd A[0] 2. P1: Rd A[1]3. P2: Wr A[0], 54. P1: Wr A[1], 3

P1,P2 cacheline size: 4 words

Look closely at example

• P1 and P2 do not access the same element

• A[0] and A[1] are in the same cache block, so if they are in one cache, they are in the other cache.

Cache Coherence False Sharing w/ Invalidate

$$$ $$$

* *1. P2: Rd A[0] * A[0-3]2. P1: Rd A[1]3. P2: Wr A[0], 54. P1: Wr A[1], 3

$$$ $$$

* *1. P2: Rd A[0] * A[0-3]2. P1: Rd A[1] A[0-3]A[0-3]3. P2: Wr A[0], 5 4. P1: Wr A[1], 3

$$$ $$$

* *1. P2: Rd A[0] * A[0-3]2. P1: Rd A[1] A[0-3]A[0-3]3. P2: Wr A[0], 5 * A[0-3]4. P1: Wr A[1], 3

$$$ $$$

* *1. P2: Rd A[0] * A[0-3]2. P1: Rd A[1] A[0-3]A[0-3]3. P2: Wr A[0], 5 * A[0-3]4. P1: Wr A[1], 3 A[0-3] *

False Sharing

• Different/same processors access different/same items in different/same cache block

• Leads to ___________ misses

False Sharing

• Different processors access different items in same cache block

• Leads to___________ misses

False Sharing

• Different processors access different items in same cache block

• Leads to coherence cache misses

Cache Performance

// Pn = my processor number (rank)// NumProcs = total active processors// N = total number of elements// NElem = N / NumProcs

For(i=0;i<N;i++) A[NumProcs*i+Pn] = f(i);

For(i=(Pn*NElem);i<(Pn+1)*NElem;i++) A[i] = f(i);

Which is better?

• Both access the same number of elements• No processors access the same elements

as each other

Why is the second better?

as each other• Better Spatial Locality

Why is the second better?

as each other• Better Spatial Locality• Less False Sharing

Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the...

Documents

ILP and Honours

ILP PRESENTATION.pptx

What you misses!

ILP-intro-notesbhagiweb/cs211/lectures/ilp-intro.pdf · ILP: Instruction-Level Parallelism • ILP is is a measure of the amount of inter-dependencies between instructions • Average

ILP Scheduling

iOS7: Hits & Misses

Mommy misses you

ILP: VLIW Architectures

Penygarn Community Primary School Cornerstones Curriculum … · 2018-02-16 · Penygarn Community Primary School Cornerstones Curriculum 2017/18 Year ILP 1 ILP 2 ILP 3 ILP 4 ILP

DIGITAL LUMENS ENABLED UFO PARKING GARAGE - 75WLED...ufo parking garage - 75wled outdoor color guide ilp-wht ilp-slv ilp-brz ilp-brn ilp-blk color name and description sheen product

Instruction-Level Parallel Processors {Objective: executing two or more instructions in parallel} 4.1 Evolution and overview of ILP-processors 4.2 Dependencies

ILP-intro-notesbhagiweb/cs211/lectures/ilp-intro.pdfILP: Instruction-Level Parallelism • ILP is is a measure of the amount of inter-dependencies between instructions • Average

Great Lakes Ilp

Introduction to ILP

ILP 4.3 CCarter

Understanding the IA-64 ArchitectureIntel Labs Memory dependencies further limit ILPMemory dependencies further limit ILP Today’s Architecture Challenges Memory Dependencies lLoads

Memory Hierarchy— Misses, 3 Cs and 7 Ways to Reduce Misses

Building stuff with monadic dependencies + unchanging dependencies + polymorphic dependencies + abstraction Neil Mitchell

EPIC Architecture ILP

ILP Slides