High Performance Computing – CISC 811

High High Performance Performance Computing – Computing –

CISC 811CISC 811Dr Rob ThackerDr Rob Thacker

Dept of Physics (308A)Dept of Physics (308A)

thacker@astrothacker@astro

LINPACK numbersLINPACK numbers

Without optimization a reasonable figure for Without optimization a reasonable figure for the 1000the 10001000 problem is 200-300 Mflops1000 problem is 200-300 Mflops CPU = Athlon 64, 3200+CPU = Athlon 64, 3200+ Core 2 CPUs should do even betterCore 2 CPUs should do even better

Turning on –O2 should close to double this Turning on –O2 should close to double this numbernumber Athlon 64, 3200+ gives around 510 MflopsAthlon 64, 3200+ gives around 510 Mflops

Optimizations beyond O2 don’t add that much Optimizations beyond O2 don’t add that much more improvementmore improvement Unrolling loops helps (5% improvement) on some Unrolling loops helps (5% improvement) on some

platforms but not othersplatforms but not others

Variable size LINPACK Variable size LINPACK versus HINTversus HINT

HINT’s performance profile can be seen in other programs.

MainMemory

L2L1

Assignment 1 solutions available

Top 500Top 500

Doubling time for the number of CPUs in a Doubling time for the number of CPUs in a system really is ~1.5 yearssystem really is ~1.5 years

Example: 2000 SHARCNET places 310 on Example: 2000 SHARCNET places 310 on top 500 with 128 processorstop 500 with 128 processors

2004(.5) SHARCNET wants to have a higher 2004(.5) SHARCNET wants to have a higher position, RFP asks for 1,500 processors (1.5 position, RFP asks for 1,500 processors (1.5 year doubling time suggests 1024 CPUs year doubling time suggests 1024 CPUs would get them back to 310)would get them back to 310) They actually placed at 116 with 1536 processorsThey actually placed at 116 with 1536 processors

‘‘e-waste’ anyone?e-waste’ anyone?

Quick note on Quick note on Assignment Q2Assignment Q2

Grid overlays particles, weights from all particles accumulate

(In 2d) nearestgrid points

Weight 1

Weight 2Weight 3

Weight 4

Today’s LectureToday’s LectureShared Memory Parallelism I

Part 1: Shared memory Part 1: Shared memory programming conceptsprogramming concepts

Part 2: Shared memory Part 2: Shared memory architecturesarchitectures

Part 3: OpenMP IPart 3: OpenMP I

Part 1: Shared Memory Part 1: Shared Memory Programming ConceptsProgramming Concepts

Comparison of shared memory Comparison of shared memory versus distributed memory versus distributed memory programming paradigmsprogramming paradigms

Administration of threadsAdministration of threads Data dependenciesData dependencies Race conditionsRace conditions

Shared Address ModelShared Address Model

Each processor can access every physical Each processor can access every physical memory location in the machinememory location in the machine

Each process is aware of all data it Each process is aware of all data it shares with other processesshares with other processes

Data communication between processes Data communication between processes is implicit: memory locations are updatedis implicit: memory locations are updated

Processes are allowed to have local Processes are allowed to have local variables that are not visible by other variables that are not visible by other processesprocesses

Comparison MPI vs Comparison MPI vs OpenMPOpenMP

Feature OpenMP MPIFeature OpenMP MPI

Apply parallelism in stepsApply parallelism in steps YES NO YES NO

Scale to large number of processors Scale to large number of processors MAYBE YES MAYBE YESCode Complexity Code Complexity small increasesmall increase major major

increaseincrease

Code length increase Code length increase 2-80% 30-2-80% 30-500%500%

Runtime environment*Runtime environment* $$ compilers FREE $$ compilers FREE

Cost of hardware Cost of hardware $$$$$$$! $$$$$$$! CHEAPCHEAP *gcc (& gfortran) now supports OpenMP

Distinctions: Process vs Distinctions: Process vs threadthread

A process as an OS-level task A process as an OS-level task Operates independently of other processesOperates independently of other processes Has its own process id, memory area, Has its own process id, memory area,

program counter, registers….program counter, registers…. OS has to create a process control block OS has to create a process control block

containing information about the processcontaining information about the process Expensive to createExpensive to create

Heavyweight process = taskHeavyweight process = task Sometimes called a Sometimes called a heavyweight threadheavyweight thread

ThreadThread

A thread (lightweight process) is a A thread (lightweight process) is a subunit of a processsubunit of a process Shares data, code and OS resources Shares data, code and OS resources

with other threadswith other threads Controlled by one process, but one Controlled by one process, but one

process may control more than one process may control more than one threadthread

Has its own program counter, register Has its own program counter, register set, and stack spaceset, and stack space

DiagramaticallyDiagramatically

PC

CODE

HEAP

FILES

STACK

A PROCESS

ThreadsThreads

PC

CODE

HEAP

FILES

STACK

THREADS

PCSTACK

Kernel versus User level Kernel versus User level threadsthreads

Threads can be implemented via two main Threads can be implemented via two main mechanisms (or combinations of both)mechanisms (or combinations of both)

Kernel levelKernel level threads: natively supported by the threads: natively supported by the OS OS Changing between different threads still requires Changing between different threads still requires

action from the OS, adds significant amount of timeaction from the OS, adds significant amount of time Still not as bad as changing between tasks though Still not as bad as changing between tasks though ““Middleweight threads”Middleweight threads”

User-levelUser-level threads implemented via a library threads implemented via a library (e.g. POSIX threads)(e.g. POSIX threads) All issues of control are handled by the task rather All issues of control are handled by the task rather

than the OSthan the OS ““Lightweight threads” (can be switched very Lightweight threads” (can be switched very

quickly)quickly)

Thread HierarchyThread Hierarchy

Process(Task)

Kernel-level thread

User-level thread 10 us

100 us

1 ms

Time to changebetween

Interprocess Interprocess communication (IPC)communication (IPC)

Despite tasks having their own data space, Despite tasks having their own data space, there are methods for communicating there are methods for communicating between thembetween them

““System V shared memory segments” is System V shared memory segments” is most well knownmost well known

Shared regions are created & attached to Shared regions are created & attached to by issuing a specific commands by issuing a specific commands (shmget,shmat)(shmget,shmat)

Still need to use a mechanism to create Still need to use a mechanism to create processes that share this regionprocesses that share this region

Used in SMP version of GAUSSIAN Used in SMP version of GAUSSIAN See section 8.4 of Wilkinson & Allen, Parallel Programming

Threads based executionThreads based execution

Serial execution, interspersed with Serial execution, interspersed with parallelparallel

Parallel Section Parallel Section

Serial Section Serial SectionSerial Section

MasterThread

In practice many compilers block execution of the extra threads during serial sections, this saves the overhead of the `fork-join’ operation

Segue: UNIX Segue: UNIX fork()fork()

The UNIX system call The UNIX system call fork()fork() creates a creates a child child processprocess that is an exact copy of the that is an exact copy of the calling process with a unique process ID calling process with a unique process ID ((wait()wait() joins) joins)

All of the parent variables are copied into All of the parent variables are copied into the childs data spacethe childs data space

pid = fork(); . code to be run by pair .if (pid==0) exit(0); else wait(0);

Success returns 0

Used to create processes for SysV shared memory segments

Not limited to work Not limited to work replication:replication:

pid = fork();if (pid==0) { . code to be run by servant .} else { . code to be run by master .} if (pid==0) exit(0); else wait(0);

Remember: all the variables in the original processare duplicated.

MPMD

OpenMP ProgramsOpenMP Programs

OpenMP programs use a simple API OpenMP programs use a simple API to create threads based codeto create threads based code

At their simplest they simply divide At their simplest they simply divide up iterations of data parallel loops up iterations of data parallel loops among threadsamong threads

What is actually happening What is actually happening in the threads? 4 threads, in the threads? 4 threads,

n=400n=400

X(1:100)

Thread 1: i=1,100,Updates Y(1:100),Reads X(1:100)

Y(1:100)


X(101:200)

Y(101:200)


X(201:300)

Y(201:300)


X(301:400)

Y(301:400)

Race ConditionsRace Conditions

Common operation is to resolve a Common operation is to resolve a spatial position into an array index: spatial position into an array index: consider following loopconsider following loop

Looks innocent enough – but suppose Looks innocent enough – but suppose two particles have the same positions…two particles have the same positions…

C$OMP PARALLEL DOC$OMP& DEFAULT(NONE)C$OMP& PRIVATE(i,j)C$OMP& SHARED(r,A) do i=1,n j=int(r(i)) A(j)=A(j)+1. end do

r(): array of particlepositions

A(): array that is modifiedusing information fromr()

Race Conditions: A Race Conditions: A concurrency problemconcurrency problem

Two different threads of execution Two different threads of execution can concurrently attempt to update can concurrently attempt to update the same memory locationthe same memory location

Thread 1: Puts A(j)=2.

Thread 2: Puts A(j)=2.

time

End State A(j)=2.

INCORRECT

StartA(j)=1.

Thread 1: Gets A(j)=1.Adds 1.A(j)=2.

Thread 2: Gets A(j)=1.Adds 1.A(j)=2.

Dealing with Race Dealing with Race ConditionsConditions

Need mechanism to ensure updates to Need mechanism to ensure updates to single variables occur within a single variables occur within a critical critical sectionsection

Any thread entering a critical section Any thread entering a critical section blocks all othersblocks all others

Critical sections can be established by Critical sections can be established by using:using: Lock variables (single bit variables)Lock variables (single bit variables) Semaphores (Dijkstra 1968) Semaphores (Dijkstra 1968) Monitor procedures (Hoare 1974, used in Java)Monitor procedures (Hoare 1974, used in Java)

Simple (spin) lock usageSimple (spin) lock usagedo while (lock.eq.1) do while (lock.eq.1) spinspinend doend dolock=1lock=1 ..Critical sectionCritical section ..lock=0lock=0

do while (lock.eq.1) do while (lock.eq.1) spinspinend doend do

lock=1lock=1 ..Critical sectionCritical section ..lock=0lock=0

Access is serialized

Deadlocks: The pitfall of Deadlocks: The pitfall of lockinglocking

Must ensure a situation is not created Must ensure a situation is not created where requests in possession create a where requests in possession create a deadlock: deadlock:

Nested locks are a classic example of thisNested locks are a classic example of this Can also create problem with multiple Can also create problem with multiple

processes - `deadly embrace’processes - `deadly embrace’

Resource 1 Resource 2

Process 1 Process 2

Holds

Requests

Data DependenciesData Dependencies

Suppose you try to parallelize the Suppose you try to parallelize the following loopfollowing loop

Won’t work as it is written since Won’t work as it is written since iteration iteration ii, depends upon iteration , depends upon iteration i-1i-1 and thus we can’t start anything in and thus we can’t start anything in parallelparallel

c=0.0do i=1,n

c=c+1.0 Y(i)=cend do

Simple solutionSimple solution

This loop can easily be re-written in a This loop can easily be re-written in a way that can be parallelized:way that can be parallelized:

There is no longer any dependence on There is no longer any dependence on the previous operationthe previous operation

Private variables: Private variables: ii, Shared variables: , Shared variables: Y(),c,nY(),c,n

c=0.0do i=1,n Y(i)=c+iend doc=c+n

Types of Data Types of Data DependenciesDependencies

Suppose we have operations OSuppose we have operations O11,O,O22

True Dependence: True Dependence: OO22 has a true dependence on O has a true dependence on O11 if O if O22 reads a reads a

value written by Ovalue written by O11

Anti Dependence:Anti Dependence: OO22 has an anti-dependence on O has an anti-dependence on O11 if O if O22 writes a writes a

value read by Ovalue read by O11

Output Dependence:Output Dependence: OO22 has an output dependence on O has an output dependence on O11 if O if O22 writes a writes a

variable written by Ovariable written by O11

ExamplesExamples

True dependence:True dependence:

Anti-dependence:Anti-dependence:

Output dependence:Output dependence:

A1=A2+A3B1=A1+B2

B1=A1+B2A1=C2

B1=5B1=2

Bernstein’s ConditionsBernstein’s Conditions Set of conditions that are sufficient to Set of conditions that are sufficient to

determine whether two threads can be determine whether two threads can be executed simultaneouslyexecuted simultaneously

IIii: set of memory locations read by thread P: set of memory locations read by thread P ii

OOjj: set of memory locations altered by thread : set of memory locations altered by thread PPjj

For threads 1 & 2 to be concurrent:For threads 1 & 2 to be concurrent: II11∩O∩O22=Ø (input of 1 cannot intersect output of 2)=Ø (input of 1 cannot intersect output of 2) II22∩O∩O11=Ø (input of 2 cannot intersect output of 1)=Ø (input of 2 cannot intersect output of 1) OO11∩O∩O22=Ø (outputs cannot intersect)=Ø (outputs cannot intersect)

Bernstein, 1966, IEEE. Trans. Elec. Comp. Vol E-15, pp746

ExampleExample

Consider two threads, a=x+y, b=x+zConsider two threads, a=x+y, b=x+z Inputs: I1=(x,y), I2(x,z)Inputs: I1=(x,y), I2(x,z) Outputs: O1=a, O2=bOutputs: O1=a, O2=b All conditions are satifisfied:All conditions are satifisfied:

II11∩O∩O22=Ø =Ø

II22∩O∩O11=Ø =Ø

OO11∩O∩O22=Ø =Ø

Forms the basis for auto-parallelizing Forms the basis for auto-parallelizing compilers: difficult part is determining compilers: difficult part is determining access at compile timeaccess at compile time

Dealing with Data Dealing with Data DependenciesDependencies

Any loop where iterations depend upon the Any loop where iterations depend upon the previous one has a previous one has a potentialpotential problem problem

Any result which depends upon the Any result which depends upon the orderorder of of the iterations will be a problemthe iterations will be a problem

Good first test of whether something can be Good first test of whether something can be parallelized: reverse the loop iteration orderparallelized: reverse the loop iteration order

Not all data dependencies can be eliminatedNot all data dependencies can be eliminated AccumulationsAccumulations of variables ( of variables (e.g.e.g. sum of sum of

elements in an array) can be dealt with easilyelements in an array) can be dealt with easily

Summary Part 1Summary Part 1

The concurrent nature of shared The concurrent nature of shared memory programming entails dealing memory programming entails dealing with two keys issueswith two keys issues Race conditionsRace conditions Data dependenciesData dependencies

Race conditions can be (partially) Race conditions can be (partially) solved by lockingsolved by locking

Dealing with data dependencies Dealing with data dependencies frequently involves algorithmic frequently involves algorithmic changeschanges

Part 2: Shared Memory Part 2: Shared Memory ArchitecturesArchitectures

Consistency modelsConsistency models Cache coherenceCache coherence Snoopy versus directory based Snoopy versus directory based

coherencecoherence

Shared Memory design Shared Memory design issuesissues

Race conditions cannot be avoided on shared memory architectures must be dealt with at the programming

level However, updates to shared variables

must be propagated to other threads Variable values must be propagated

through the machine via hardware For multiple CPUs with caches this is

non-trivial – let’s look at single CPU first

Caching ArchitecturesCaching Architectures

In cache-based architectures we will In cache-based architectures we will often have two copies of a variable: often have two copies of a variable: one in main memory and one in cacheone in main memory and one in cache

How do we deal with memory updates How do we deal with memory updates if there are two copies?if there are two copies?

Two options: either try and keep the Two options: either try and keep the main memory in sync with the cache main memory in sync with the cache (hard) or wait to update ((hard) or wait to update (i.e.i.e. send send bursts/packets of updates)bursts/packets of updates)

Write-through CacheWrite-through Cache

When cache update occurs result is When cache update occurs result is immediately written back up to main immediately written back up to main memorymemory

Advantage: consistent picture of Advantage: consistent picture of memorymemory

Disadvantage: Uses up a lot of Disadvantage: Uses up a lot of memory bandwidthmemory bandwidth

Disadvantage: Writes to main memory Disadvantage: Writes to main memory even slower than reads!even slower than reads!

Write-back CacheWrite-back Cache

Wait until a certain number of updates Wait until a certain number of updates have been made and then write to main have been made and then write to main memorymemory Mark a new cache line as being `clean’Mark a new cache line as being `clean’ When modified an entry becomes `dirty’When modified an entry becomes `dirty’ Clean values can be thrown away, dirty Clean values can be thrown away, dirty

values must be written back to memoryvalues must be written back to memory Disadvantage: Memory is not consistentDisadvantage: Memory is not consistent Advantage: Overcomes problem of Advantage: Overcomes problem of

waiting for main memorywaiting for main memory

Inclusive CachingInclusive Caching Modern systems often Modern systems often

have multiple layers have multiple layers of cacheof cache

However, each cache However, each cache has only one parent has only one parent (but may have more (but may have more than one child)than one child)

Only the parent and Only the parent and children communicate children communicate directlydirectly

Inclusion: Inclusion: 1 ii LaLa

Level 3 Cache

Level 2 Level 2

L1 L1 L1 L1

CPUCPUCPUCPU

Note not all systems are built this way – exclusive caching(Opteron) does not have this relationship

Common ArchitectureCommon Architecture

Very common to have the L1 cache Very common to have the L1 cache as write-through to the L2 cacheas write-through to the L2 cache

L2 cache is then write-back to main L2 cache is then write-back to main memorymemory

For this inclusive architecture For this inclusive architecture increasing the size of the L2 can increasing the size of the L2 can have significant performance gainshave significant performance gains

Multiprocessor design Multiprocessor design issue: Cache (line) issue: Cache (line)

CoherencyCoherency

CPU 2 CPU 1 2 Processors, write-back cache

Individual caches

Main memory

Cache nowdirty but not writtento main memory

Processor 2requests frommain memory

Cache changed

Caches donot agree:Coherencyis lost

End state differencesEnd state differences

Write-back cache: both memory and Write-back cache: both memory and second cache hold stale values (until second cache hold stale values (until dirty cache entries are written)dirty cache entries are written)

Write-through cache: Only the Write-through cache: Only the second cache is stalesecond cache is stale Cannot avoid issue of memory being Cannot avoid issue of memory being

read before update occursread before update occurs Need additional mechanism to Need additional mechanism to

preserve some kind of preserve some kind of cache cache coherencycoherency

Segue: Consistency Segue: Consistency modelsmodels

Strictly Consistent modelStrictly Consistent model Intuitive idea of what memory should beIntuitive idea of what memory should be

Any read to a memory location X Any read to a memory location X returns the value stored by the most returns the value stored by the most recent write operation to Xrecent write operation to X

P1: W(x)1-----------------------P2: R(x)1 R(x)1

Gharachorloo, Lenoski, Ludon, Gibbons, Gupta, Hennessy, 1990,in Proceedings 17th International Symposium on Computer Architecture

Couple more examplesCouple more examples

P1: W(x)1-----------------------P2: R(x)0 R(x)1

P1: W(x)1-----------------------P2: R(x)0 R(x)1

Allowed

Not Allowed

Sequential ConsistencySequential Consistency Lamport 79:Lamport 79:

A multiprocessor system is A multiprocessor system is sequentially sequentially consistentconsistent if the result of any execution is if the result of any execution is the same as if the operations of all the the same as if the operations of all the processors were executed in processors were executed in somesome sequential sequential order, and the operations of each individual order, and the operations of each individual processor appear in this sequence in the processor appear in this sequence in the order specified by its program.order specified by its program.

Note the order of execution is not Note the order of execution is not specified under this modelspecified under this model

Best explained via example Best explained via example

Results under SCResults under SC

R1=XY =1

R2=YX =1

Start: X=Y=0

Is it possible to have R1=R2=1under sequential consistency?

Two threads

Instruction order allowed Instruction order allowed under SCunder SC

R1=X(=0)R2=Y(=0)Y=1X=1--------------R1=0 R2=0

R1=X(=0)R2=Y(=0)X=1Y=1--------------R1=0 R2=0

R1=X(=0)Y=1R2=Y(=1)X=1--------------R1=0 R2=1

R2=Y(=0)R1=X(=0)X=1Y=1--------------R1=0 R2=0

R2=Y(=0)R1=X(=0)Y=1X=1--------------R1=0 R2=0

R2=Y(=0)X=1R1=X(=1)Y=1--------------R1=1 R2=0

NO! – all SC ordered models cannot produce R1=R2=1NO! – all SC ordered models cannot produce R1=R2=1

SC doesn’t save you from SC doesn’t save you from data racesdata races

This is an important point!This is an important point! Even though you have SC, the operations Even though you have SC, the operations

from the threads can still be ordered from the threads can still be ordered differently in the global executiondifferently in the global execution

However, SC does help with some However, SC does help with some situations:situations:

R1=XIf(R1>0) Y =1

R2=YIf (R2>0) X=1

Start: X=Y=0

Race free

Other memory models exist with weaker requirements

Casuality is lost!Casuality is lost!P1: W(x)1-------------------------------------P2: W(x)2-------------------------------------P3: R(x)2 R(x)1-------------------------------------P4: R(x)2 R(x)1

P1: W(x)1 -------------------------------------P2: W(x)2-------------------------------------P3: R(x)2 R(x)1-------------------------------------P4: R(x)2 R(x)1

Allowedresult under

SC

Because we can reorder to

SC is still restrictiveSC is still restrictiveP1: W(x)1-------------------------------------P2: W(x)2-------------------------------------P3: R(x)2 R(x)1-------------------------------------P4: R(x)1 R(x)2

Not allowedunder

SC

Too many changes of x to be possible

Remember: operations on a given variablemust be interleaved to give a total ordering

(Weaker) Consistency (Weaker) Consistency ModelsModels

Causal ConsistencyCausal Consistency Operations that are causally related must Operations that are causally related must

be seen by all processes in the same be seen by all processes in the same order. Concurrent writes may be seen in order. Concurrent writes may be seen in a different order on different threads.a different order on different threads.

““If one event (B) is caused by another If one event (B) is caused by another (A) then all threads must observe A to (A) then all threads must observe A to happen before B”happen before B”

ExamplesExamplesP1: W(x)1 --------------------------------P2: R(x)1 W(x)2--------------------------------P3: R(x)2 R(x)1--------------------------------P4: R(x)1 R(x)2

Not causally

consistent

P1: W(x)1 --------------------------------P2: W(x)2--------------------------------P3: R(x)2 R(x)1--------------------------------P4: R(x)1 R(x)2

This iscausally

consistent

FIFO ConsistencyFIFO Consistency

Writes performed by a single thread Writes performed by a single thread are seen by all other processes in are seen by all other processes in the order in which they were issued, the order in which they were issued, but writes from different processes but writes from different processes may be seen in a different order by may be seen in a different order by different processesdifferent processes

ExampleExample

P1: W(x)1 -------------------------------------------P2: R(x)1 W(x)2 W(x)3-------------------------------------------P3: R(x)2 R(x)1 R(x)3-------------------------------------------P4: R(x)1 R(x)2 R(x)3

Under FIFO consistency this is a validsequence of events

Implementing Sequential Implementing Sequential ConsistencyConsistency

Sequential consistency is the most strict Sequential consistency is the most strict practicalpractical model (strict consistency model (strict consistency requires a global clock)requires a global clock)

To implement it requires that the CPU To implement it requires that the CPU caches all follow a caches all follow a cache coherency cache coherency protocolprotocol

Requires mechanism to keep track of Requires mechanism to keep track of cache-lines across all CPUscache-lines across all CPUs

Two main mechanisms:Two main mechanisms: Snoopy protocolSnoopy protocol Directory-basedDirectory-based

Cache Coherency Cache Coherency Protocol:Protocol:

Need to ensure that before a Need to ensure that before a memory location is written all other memory location is written all other copies are invalidatedcopies are invalidated

Allows multiple copies of memory to Allows multiple copies of memory to exist when being readexist when being read

Forces only one copy to exist when Forces only one copy to exist when being writtenbeing written

Cache Coherency: Cache Coherency: DrawbacksDrawbacks

Coherency only occurs at the cache-Coherency only occurs at the cache-lineline levellevel Access to different words within a cache-line Access to different words within a cache-line

will be serialized (`false sharing’ – see later)will be serialized (`false sharing’ – see later) Cache coherency says nothing about Cache coherency says nothing about

variables promoted to registers variables promoted to registers (although one hopes these wouldn’t be (although one hopes these wouldn’t be shared)shared)

You have to be aware of these issues You have to be aware of these issues when programmingwhen programming

BUT! Most importantlyBUT! Most importantly

Cache coherency does not save you from all race conditions!!!!

Mechanisms to achieve Mechanisms to achieve CCCC

Broadcast Broadcast (Snoopy)(Snoopy)

Processors send Processors send results everywhereresults everywhere

Requires a shared Requires a shared bus for low latencybus for low latency

Point-to-point (Directory)Point-to-point (Directory) Directory keeps track of Directory keeps track of

“interest” in each cache “interest” in each cache line (adds a bit of latency)line (adds a bit of latency)

Addresses are sent only Addresses are sent only to the necessary to the necessary processorsprocessors

P

P

PP

P

P

P

PP

P

Can combine the two: e.g. Sun E15k

Symmetric Multi Symmetric Multi ProcessorProcessor

e.g. SunFire 6800

SHARED BUS

Remember: Uniform memory architecture

Snooping ProtocolSnooping Protocol

Bus-based schemeBus-based scheme Processors actively check for bus activity Processors actively check for bus activity

(‘snoop’ - imagine a nosey neighbour!)(‘snoop’ - imagine a nosey neighbour!) Usually each cache line will have a set of Usually each cache line will have a set of

tags to enable efficient snoopingtags to enable efficient snooping Can’t scale to very large number of Can’t scale to very large number of

processorsprocessors Restricted by broadcast bandwidth requirementsRestricted by broadcast bandwidth requirements

Many possible protocols – we’ll consider Many possible protocols – we’ll consider MESIMESI

MESI ProtocolMESI Protocol 4 states associated with the bits in the cache line 4 states associated with the bits in the cache line

tag:tag: M=cache line is modified (dirty)M=cache line is modified (dirty)

No other cache owns this line, incoherent with memoryNo other cache owns this line, incoherent with memory E=exclusive cache lineE=exclusive cache line

Line is coherent with memory, and held in only one Line is coherent with memory, and held in only one cachecache

S=shared cache lineS=shared cache line Line is coherent with memory, but only held in many Line is coherent with memory, but only held in many

cachescaches I=Invalid lineI=Invalid line

Line is not cachedLine is not cached

Goodman 1983

Examples of state Examples of state changes changes

Read with intent to modifyRead with intent to modify If address matches S or E line then change to I If address matches S or E line then change to I

(you are going to modify, so all other copies are (you are going to modify, so all other copies are invalidated). invalidated).

If address matches M then modified line must be If address matches M then modified line must be written back to memory before proceeding (or written back to memory before proceeding (or maybe passed directly)maybe passed directly)

ReadRead If address is S, then no change. If line is E, then If address is S, then no change. If line is E, then

must change to S. If line is M then must write back must change to S. If line is M then must write back to memory and change to S (or again may pass to memory and change to S (or again may pass directly and change own state to S)directly and change own state to S)

Other snoopy protocolsOther snoopy protocols

IllinoisIllinois BerkeleyBerkeley SynapseSynapse FireflyFirefly ““Write once”Write once”

See Pattersons Berkeley CS Lectures (on line)

Machine ExampleMachine Example

Sun Enterprise Sun Enterprise 1000010000 (Old machine now)(Old machine now)

Up to 64 processorsUp to 64 processors Need to interleave Need to interleave

4 buses to achieve 4 buses to achieve necessary necessary bandwidthbandwidth

Snoop broadcast Snoop broadcast every other cycleevery other cycle

Non-Uniform Shared Non-Uniform Shared MemoryMemory

Network + directory system thatkeeps track of where memory is

Also known as NUMA for Non-Uniform Memory Access (e.g. Altix)

Directory-based schemesDirectory-based schemes

Significantly reduces bandwidth Significantly reduces bandwidth requirements by using point-to-point requirements by using point-to-point messagesmessages

Directories distributed with memory Directories distributed with memory regions regions

Combined with memory controllerCombined with memory controller Total storage requirement: Total storage requirement:

number of memory blocks*Nnumber of memory blocks*Ncpucpu

Can require up to 15% of total memory on Can require up to 15% of total memory on large systemslarge systems

Directories in actionDirectories in action

Modify shared data:Modify shared data: Establish exclusive ownership via directoryEstablish exclusive ownership via directory Controller must notify all owners of this Controller must notify all owners of this

cache linecache line When complete modification may be When complete modification may be

performedperformed Reading modified data:Reading modified data:

Again send request to directoryAgain send request to directory True owner is determined and value is True owner is determined and value is

routed to requesting noderouted to requesting node

Machine Example: SGI Machine Example: SGI AltixAltix

Developed from Developed from Origin system, Origin system, which were which were developed from developed from the Stanford the Stanford DASH prototypeDASH prototype

Ring and fat-Ring and fat-tree topologiestree topologies


Cache coherency is key in the design Cache coherency is key in the design of shared memory systemsof shared memory systems

Cache coherency usually implements Cache coherency usually implements a model of sequential consistencya model of sequential consistency

Snoopy protocols simple, elegant, Snoopy protocols simple, elegant, but bandwidth limitedbut bandwidth limited

Directory based protocols scale Directory based protocols scale betterbetter

Part 3: OpenMP IPart 3: OpenMP I Origins, what does it specify?Origins, what does it specify? Parallel do loopsParallel do loops PragmasPragmas

What is OpenMP?What is OpenMP?

OpenMP is a pragma based API that OpenMP is a pragma based API that provides a simple extension to C/C++ and provides a simple extension to C/C++ and FORTRANFORTRAN

It is exclusively designed for shared It is exclusively designed for shared memory programmingmemory programming

However, some vendors (Intel) are However, some vendors (Intel) are developing virtual shared memory developing virtual shared memory compilers that will support OpenMPcompilers that will support OpenMP

Ultimately, OpenMP is a very simple Ultimately, OpenMP is a very simple interface to threads based programminginterface to threads based programming

Components of OpenMPComponents of OpenMP

Directives(Pragmas)

Runtime Libraryroutines

Environmentvariables

OpenMP: Where did it come OpenMP: Where did it come from?from?

Prior to 1997, vendors all had their own Prior to 1997, vendors all had their own proprietary shared memory programming proprietary shared memory programming commandscommands

Programs were not portable from one SMP to Programs were not portable from one SMP to anotheranother

Researchers were calling for some kind of Researchers were calling for some kind of portabilityportability

ANSI X3H5 (1994) proposal tried to formalize a ANSI X3H5 (1994) proposal tried to formalize a shared memory standard – but ultimately failedshared memory standard – but ultimately failed

OpenMP (1997) worked because the vendors got OpenMP (1997) worked because the vendors got behind it and there was new growth in the shared behind it and there was new growth in the shared memory arenamemory arena

OpenMP Architecture OpenMP Architecture Review BoardReview Board

HPHP IntelIntel IBMIBM SGISGI SunSun FujitsuFujitsu Portland Compiler GroupPortland Compiler Group US DoE ASCI ProgramUS DoE ASCI Program

OpenMP is heavily steered by vendors

http://www.openmp.org

BottomlineBottomline

For OpenMP one only has to worry about For OpenMP one only has to worry about parallelism of workparallelism of work

The global address space enabled by The global address space enabled by shared-memory is the reason for thisshared-memory is the reason for this

In MPI one has to worry both about In MPI one has to worry both about parallelism of the work and also the parallelism of the work and also the placement of dataplacement of data

Data movement is what makes MPI codes Data movement is what makes MPI codes so much longer – it can be highly non-so much longer – it can be highly non-trivialtrivial

A few numbersA few numbers

Interconnect technology Latency(Interconnect technology Latency(s) Price($/port)s) Price($/port)

Gigabit Ethernet 30-50 50Gigabit Ethernet 30-50 50

10Gb Ethernet 10 5000(!)10Gb Ethernet 10 5000(!)

Myrinet 6 1500Myrinet 6 1500

Infiniband 6-7 1500Infiniband 6-7 1500

Quadrics 3 2500Quadrics 3 2500

NUMAflex 0.5-2 10000?NUMAflex 0.5-2 10000?

No surprise GigE is the dominant interconnect it is by far the cheapest!!!

Distributed memory systems

Shared memory systems

Loop Level ParallelismLoop Level Parallelism

Consider the single precision vector Consider the single precision vector add-multiply operation add-multiply operation YY=a=aXX++Y Y (“SAXPY”)(“SAXPY”)

do i=1,n Y(i)=a*X(i)+Y(i)end do

FORTRAN

C$OMP PARALLEL DOC$OMP& DEFAULT(NONE)C$OMP& PRIVATE(i),SHARED(X,Y,n,a) do i=1,n Y(i)=a*X(i)+Y(i) end do

for (i=1;i<=n;++i) { Y[i]+=a*X[i];}

C/C++

#pragma omp parallel for \ private(i) shared(X,Y,n,a)for (i=1;i<=n;++i) { Y[i]+=a*X[i];}

In more detailIn more detail

C$OMP PARALLEL DOC$OMP& DEFAULT(NONE)C$OMP& PRIVATE(i),SHARED(X,Y,n,a) do i=1,n Y(i)=a*X(i)+Y(i) end do

Denotes this is a region of code for parallel execution

Good programming practice, mustdeclare nature of all variables

Thread PRIVATE variables: each threadmust have their own copy of this variable (in this case i is the only private variable)

Thread SHARED variables: all threads can

access these variables, but must not updateindividual memory locations simultaneously

Comment pragmas for FORTRAN - ampersand

necessary for continuation

A quick noteA quick note

To be fully lexically correct you may To be fully lexically correct you may want to include an want to include an C$OMP END C$OMP END PARALLEL DOPARALLEL DO

In f90 programs use In f90 programs use !$OMP!$OMP as a as a sentinelsentinel

First StepsFirst Steps

Loop level parallelism is the simplest Loop level parallelism is the simplest and easiest way to use OpenMPand easiest way to use OpenMP

It allows you to slowly build up It allows you to slowly build up parallelism within your applicationparallelism within your application

However, not all loops are However, not all loops are immediately parallelizeable due to immediately parallelizeable due to data dependencies or race data dependencies or race conditionsconditions

AccumulationsAccumulations

Consider the following loop:Consider the following loop:

It apparently has a data dependency – It apparently has a data dependency – however each thread can sum values of however each thread can sum values of aa independently independently

OpenMP provides an explicit interface OpenMP provides an explicit interface for this kind of operation (“for this kind of operation (“REDUCTIONREDUCTION”)”)

a=0.0do i=1,n a=a+X(i)end do

Another example of a non-Another example of a non-parallelizable loopparallelizable loop

For this example, running the For this example, running the iterations from n to 1 will produce a iterations from n to 1 will produce a different answerdifferent answer

Break it into two parallelizable loops Break it into two parallelizable loops (`loop fission’)(`loop fission’)

do i=1,n a(i)=b(i) c(i)=a(i+1)+d(i)end do

Watch for stridesWatch for strides

do i=2,5 a(i)=c*a(i-1)end do

do i=2,5,2 a(i)=c*a(i-1)end do

dependence no dependence

Strides may remove dependencies – same concept in vectorization

SolutionsSolutions

Need to ensure Bernstein’s Need to ensure Bernstein’s conditions: memory read/writes conditions: memory read/writes occur without any overlapoccur without any overlap

Alternatively, if the access occurs to Alternatively, if the access occurs to a single variable, we can use a a single variable, we can use a critical section:critical section:

call omp_init_lock(lckx)do i=1,n **work** call omp_set_lock(lckx) a=a+1. call omp_unset_lock(lckx)end do

do i=1,n **work**C$OMP CRITICAL(lckx) a=a+1.C$OMP END CRITICAL(lckx) end do

ATOMICATOMIC

If all you want to do is ensure the If all you want to do is ensure the correct update of one variable you correct update of one variable you can use the atomic update facility:can use the atomic update facility:

Exactly the same as a critical section Exactly the same as a critical section around one single update pointaround one single update point

C$OMP PARALLEL DO do i=1,n **work**C$OMP ATOMIC a=a+1. end do

CanCan be inefficient be inefficient

If other threads are If other threads are waiting to enter the waiting to enter the critical section then critical section then the program may even the program may even degenerate to a serial degenerate to a serial code!code!

Make sure there is Make sure there is much more work much more work outside the locked outside the locked region than inside it!region than inside it!

Parallel Section where eachthread waits for the lock before being able to proceed –A complete disaster

= doing work=waiting for lock

Requirements for Requirements for parallel loopsparallel loops

To divide up the work the compiler To divide up the work the compiler needs to know the number of needs to know the number of iterations to be executed – the iterations to be executed – the trip trip countcount must be computable must be computable

DO WHILEDO WHILE is not parallelizable is not parallelizable The loop can only have one exit The loop can only have one exit

point – therefore point – therefore BREAKBREAK or or GOTOGOTOs are s are not allowednot allowed

The Parallel Do PragmasThe Parallel Do Pragmas

So far we’ve considered a small So far we’ve considered a small subset of functionalitysubset of functionality

Besides Besides PRIVATEPRIVATE and and SHAREDSHARED variables there are a number of variables there are a number of other clauses that can be applied to other clauses that can be applied to parallel do loopsparallel do loops

Loop Level Parallelism in Loop Level Parallelism in more detailmore detail

For each parallel do(for) pragma, For each parallel do(for) pragma, the following clauses are possible:the following clauses are possible:

FORTRANPRIVATESHAREDFIRSTPRIVATELASTPRIVATEREDUCTIONORDEREDSCHEDULECOPYINDEFAULT

C/C++privatesharedfirstprivatelastprivatereductionorderedschedulecopyin

Red=most frequently used

SHAREDSHARED and and PRIVATEPRIVATE

Most commonly used directives which Most commonly used directives which are necessary to ensure correct are necessary to ensure correct executionexecution

PRIVATEPRIVATE: any variable declared as : any variable declared as private will be local only to a given private will be local only to a given thread and is inaccesible to others (also thread and is inaccesible to others (also is is uninitializeduninitialized))

SHAREDSHARED: any variable declared as shared : any variable declared as shared will be accessible by all other threads of will be accessible by all other threads of executionexecution

ExampleExample

The SHARED and PRIVATE The SHARED and PRIVATE specifications can be long:specifications can be long:C$OMP& PRIVATE(icb,icol,izt,iyt,icell,iz_off,iy_off,ibz,C$OMP& iby,ibx,i,rxadd,ryadd,rzadd,inx,iny,inz,nb,nebs,ibrf,C$OMP& nbz,nby,nbx,nbrf,nbref,jnbox,jnboxnhc,idt,mdt,iboxd,C$OMP& dedge,idir,redge,is,ie,twoh,dosph,rmind,in,ixyz,C$OMP& redaughter,Ustmp,ngpp,hpp,vpp,apps,epp,hppi,hpp2,C$OMP& rh2,hpp2i,hpp3i,hpp5i,dpp,divpp,dcvpp,nspp,rnspp,C$OMP& rad2torbin,de1,dosphflag,dosphnb,nbzlow,nbzhigh,nbylow,C$OMP& nbyhigh,nbxlow,nbxhigh,nbzadd,nbyadd,r3i,r2i,r1i,C$OMP& dosphnbnb,dogravnb,js,je,j,rad2,rmj,grc,igrc,gfrac,C$OMP& Gr,hppj,jlist,dx,rdv,rcv,v2,radii2,rbin,ibin,fbin,C$OMP& wl1,dwl1,drnspp,hppa,hppji,hppj2i,hppj3i,hppj5i,C$OMP& wl2,dwl2,w,dw,df,dppi,divppr,dcvpp2,dcvppm,divppm,csi,C$OMP& fi,prhoi2,ispp,frcij,rdotv,hpa,rmuij,rhoij,cij,qij,C$OMP& frc3,frc4,hcalc,rath,av,frc2,dr1,dr2,dr3,dr12,dr22,dr32,C$OMP& appg1,appg2,appg3,gdiff,ddiff,d2diff,dv1,dv2,dv3,rpp,C$OMP& Gro)

Default behaviourDefault behaviour

You can actually omit the You can actually omit the SHAREDSHARED and and PRIVATEPRIVATE statements – what is statements – what is the expected behaviour?the expected behaviour?

Scalars are private by defaultScalars are private by default Arrays are shared by defaultArrays are shared by default

Bad practice in my opinion – specify the types for everything

DEFAULTDEFAULT

I recommend using I recommend using DEFAULT(NONE)DEFAULT(NONE) at all at all timestimes

Forces specification of all variable typesForces specification of all variable types Alternatively, can use Alternatively, can use DEFAULT(SHARED)DEFAULT(SHARED), ,

or or DEFAULT(PRIVATE)DEFAULT(PRIVATE) to specify that un- to specify that un-scoped variables will default to the scoped variables will default to the particular type chosen particular type chosen

e.g.e.g. choosing choosing DEFAULT(PRIVATE)DEFAULT(PRIVATE) will will ensure any un-scoped variable is private ensure any un-scoped variable is private

ReductionReduction This clause deals with parallel versions of This clause deals with parallel versions of

the following loopsthe following loops

Outcome is determined by a `reduction’ Outcome is determined by a `reduction’ over all the values for each threadover all the values for each thread

e.g.e.g. max over all of a set, is equivalent to max over all of a set, is equivalent to the max over all max with subsets: the max over all max with subsets:

Max(A) where A=U AMax(A) where A=U Ann= Max(U = Max(U Max(AMax(Ann))))

do i=1,N a=max(a,b(i))end do

do i=1,N a=min(a,b(i))end do

do i=1,n a=a+b(i)end do

ExamplesExamples

Syntax:Syntax: REDUCTION(OP: REDUCTION(OP: variablevariable)) where where OP=max,min,+,-,*OP=max,min,+,-,* (& logic (& logic ops)ops)

C$OMP PARALLEL DOC$OMP& PRIVATE(i), SHARED(b)C$OMP& REDUCTION(max:a)do i=1,N a=max(a,b(i))end do

C$OMP PARALLEL DOC$OMP& PRIVATE(i), SHARED(b)C$OMP& REDUCTION(min:a)do i=1,N a=min(a,b(i))end do

What is REDUCTION What is REDUCTION actually doing?actually doing?

Saving you from writing more codeSaving you from writing more code The reduction clause generates an The reduction clause generates an

array of the reduction variables, and array of the reduction variables, and each thread is responsible for a each thread is responsible for a certain element in the arraycertain element in the array

The final reduction over all the array The final reduction over all the array elements (when the loop is finished) elements (when the loop is finished) is performed transparently to the is performed transparently to the useruser

InitializationInitialization

Reduction variables are initialized as Reduction variables are initialized as follows (from the standard):follows (from the standard):

Operator Initialization+ 0* 1- 0MAX Smallest rep. numberMIN Largest rep. number

NoteNote

While you can do reductions over While you can do reductions over arrays (as part of the OpenMP 2.0 arrays (as part of the OpenMP 2.0 standard) this isn’t always a great standard) this isn’t always a great ideaidea

Was brought in to help with support Was brought in to help with support of f90 array syntaxof f90 array syntax


OpenMP is strongly driven by vendors OpenMP is strongly driven by vendors and will be around for a long time yetand will be around for a long time yet

Evolving API, with useful functionalityEvolving API, with useful functionality Parallelism can be easily exposed via Parallelism can be easily exposed via

loop level parallelism, but other modes loop level parallelism, but other modes are supported within the languageare supported within the language

Most significant part of programming: Most significant part of programming: dealing with race conditions and data dealing with race conditions and data dependenciesdependencies

Next lectureNext lecture

Shared memory parallelism IIShared memory parallelism II More on OpenMP programmingMore on OpenMP programming Iteration scheduling for load balanceIteration scheduling for load balance Using memory effectivelyUsing memory effectively

Documents

High Performance Computing – CISC 811