Upload
leane
View
22
Download
0
Tags:
Embed Size (px)
DESCRIPTION
High Performance Computing – CISC 811. Dr Rob Thacker Dept of Physics (308A) thacker@astro. LINPACK numbers. Without optimization a reasonable figure for the 1000 1000 problem is 200-300 Mflops CPU = Athlon 64, 3200+ Core 2 CPUs should do even better - PowerPoint PPT Presentation
Citation preview
High High Performance Performance Computing – Computing –
CISC 811CISC 811Dr Rob ThackerDr Rob Thacker
Dept of Physics (308A)Dept of Physics (308A)
thacker@astrothacker@astro
LINPACK numbersLINPACK numbers
Without optimization a reasonable figure for Without optimization a reasonable figure for the 1000the 10001000 problem is 200-300 Mflops1000 problem is 200-300 Mflops CPU = Athlon 64, 3200+CPU = Athlon 64, 3200+ Core 2 CPUs should do even betterCore 2 CPUs should do even better
Turning on –O2 should close to double this Turning on –O2 should close to double this numbernumber Athlon 64, 3200+ gives around 510 MflopsAthlon 64, 3200+ gives around 510 Mflops
Optimizations beyond O2 don’t add that much Optimizations beyond O2 don’t add that much more improvementmore improvement Unrolling loops helps (5% improvement) on some Unrolling loops helps (5% improvement) on some
platforms but not othersplatforms but not others
Variable size LINPACK Variable size LINPACK versus HINTversus HINT
HINT’s performance profile can be seen in other programs.
MainMemory
L2L1
Assignment 1 solutions available
Top 500Top 500
Doubling time for the number of CPUs in a Doubling time for the number of CPUs in a system really is ~1.5 yearssystem really is ~1.5 years
Example: 2000 SHARCNET places 310 on Example: 2000 SHARCNET places 310 on top 500 with 128 processorstop 500 with 128 processors
2004(.5) SHARCNET wants to have a higher 2004(.5) SHARCNET wants to have a higher position, RFP asks for 1,500 processors (1.5 position, RFP asks for 1,500 processors (1.5 year doubling time suggests 1024 CPUs year doubling time suggests 1024 CPUs would get them back to 310)would get them back to 310) They actually placed at 116 with 1536 processorsThey actually placed at 116 with 1536 processors
‘‘e-waste’ anyone?e-waste’ anyone?
Quick note on Quick note on Assignment Q2Assignment Q2
Grid overlays particles, weights from all particles accumulate
(In 2d) nearestgrid points
Weight 1
Weight 2Weight 3
Weight 4
Today’s LectureToday’s LectureShared Memory Parallelism I
Part 1: Shared memory Part 1: Shared memory programming conceptsprogramming concepts
Part 2: Shared memory Part 2: Shared memory architecturesarchitectures
Part 3: OpenMP IPart 3: OpenMP I
Part 1: Shared Memory Part 1: Shared Memory Programming ConceptsProgramming Concepts
Comparison of shared memory Comparison of shared memory versus distributed memory versus distributed memory programming paradigmsprogramming paradigms
Administration of threadsAdministration of threads Data dependenciesData dependencies Race conditionsRace conditions
Shared Address ModelShared Address Model
Each processor can access every physical Each processor can access every physical memory location in the machinememory location in the machine
Each process is aware of all data it Each process is aware of all data it shares with other processesshares with other processes
Data communication between processes Data communication between processes is implicit: memory locations are updatedis implicit: memory locations are updated
Processes are allowed to have local Processes are allowed to have local variables that are not visible by other variables that are not visible by other processesprocesses
Comparison MPI vs Comparison MPI vs OpenMPOpenMP
Feature OpenMP MPIFeature OpenMP MPI
Apply parallelism in stepsApply parallelism in steps YES NO YES NO
Scale to large number of processors Scale to large number of processors MAYBE YES MAYBE YESCode Complexity Code Complexity small increasesmall increase major major
increaseincrease
Code length increase Code length increase 2-80% 30-2-80% 30-500%500%
Runtime environment*Runtime environment* $$ compilers FREE $$ compilers FREE
Cost of hardware Cost of hardware $$$$$$$! $$$$$$$! CHEAPCHEAP *gcc (& gfortran) now supports OpenMP
Distinctions: Process vs Distinctions: Process vs threadthread
A process as an OS-level task A process as an OS-level task Operates independently of other processesOperates independently of other processes Has its own process id, memory area, Has its own process id, memory area,
program counter, registers….program counter, registers…. OS has to create a process control block OS has to create a process control block
containing information about the processcontaining information about the process Expensive to createExpensive to create
Heavyweight process = taskHeavyweight process = task Sometimes called a Sometimes called a heavyweight threadheavyweight thread
ThreadThread
A thread (lightweight process) is a A thread (lightweight process) is a subunit of a processsubunit of a process Shares data, code and OS resources Shares data, code and OS resources
with other threadswith other threads Controlled by one process, but one Controlled by one process, but one
process may control more than one process may control more than one threadthread
Has its own program counter, register Has its own program counter, register set, and stack spaceset, and stack space
DiagramaticallyDiagramatically
PC
CODE
HEAP
FILES
STACK
A PROCESS
ThreadsThreads
PC
CODE
HEAP
FILES
STACK
THREADS
PCSTACK
Kernel versus User level Kernel versus User level threadsthreads
Threads can be implemented via two main Threads can be implemented via two main mechanisms (or combinations of both)mechanisms (or combinations of both)
Kernel levelKernel level threads: natively supported by the threads: natively supported by the OS OS Changing between different threads still requires Changing between different threads still requires
action from the OS, adds significant amount of timeaction from the OS, adds significant amount of time Still not as bad as changing between tasks though Still not as bad as changing between tasks though ““Middleweight threads”Middleweight threads”
User-levelUser-level threads implemented via a library threads implemented via a library (e.g. POSIX threads)(e.g. POSIX threads) All issues of control are handled by the task rather All issues of control are handled by the task rather
than the OSthan the OS ““Lightweight threads” (can be switched very Lightweight threads” (can be switched very
quickly)quickly)
Thread HierarchyThread Hierarchy
Process(Task)
Kernel-level thread
User-level thread 10 us
100 us
1 ms
Time to changebetween
Interprocess Interprocess communication (IPC)communication (IPC)
Despite tasks having their own data space, Despite tasks having their own data space, there are methods for communicating there are methods for communicating between thembetween them
““System V shared memory segments” is System V shared memory segments” is most well knownmost well known
Shared regions are created & attached to Shared regions are created & attached to by issuing a specific commands by issuing a specific commands (shmget,shmat)(shmget,shmat)
Still need to use a mechanism to create Still need to use a mechanism to create processes that share this regionprocesses that share this region
Used in SMP version of GAUSSIAN Used in SMP version of GAUSSIAN See section 8.4 of Wilkinson & Allen, Parallel Programming
Threads based executionThreads based execution
Serial execution, interspersed with Serial execution, interspersed with parallelparallel
Parallel Section Parallel Section
Serial Section Serial SectionSerial Section
MasterThread
In practice many compilers block execution of the extra threads during serial sections, this saves the overhead of the `fork-join’ operation
Segue: UNIX Segue: UNIX fork()fork()
The UNIX system call The UNIX system call fork()fork() creates a creates a child child processprocess that is an exact copy of the that is an exact copy of the calling process with a unique process ID calling process with a unique process ID ((wait()wait() joins) joins)
All of the parent variables are copied into All of the parent variables are copied into the childs data spacethe childs data space
pid = fork(); . code to be run by pair .if (pid==0) exit(0); else wait(0);
Success returns 0
Used to create processes for SysV shared memory segments
Not limited to work Not limited to work replication:replication:
pid = fork();if (pid==0) { . code to be run by servant .} else { . code to be run by master .} if (pid==0) exit(0); else wait(0);
Remember: all the variables in the original processare duplicated.
MPMD
OpenMP ProgramsOpenMP Programs
OpenMP programs use a simple API OpenMP programs use a simple API to create threads based codeto create threads based code
At their simplest they simply divide At their simplest they simply divide up iterations of data parallel loops up iterations of data parallel loops among threadsamong threads
What is actually happening What is actually happening in the threads? 4 threads, in the threads? 4 threads,
n=400n=400
X(1:100)
Thread 1: i=1,100,Updates Y(1:100),Reads X(1:100)
Y(1:100)
Thread 2: i=101,200,Updates Y(101:200),Reads X(101:200)
X(101:200)
Y(101:200)
Thread 3: i=201,300,Updates Y(201:300),Reads X(201:300)
X(201:300)
Y(201:300)
Thread 4: i=301,400,Updates Y(301:400),Reads X(301:400)
X(301:400)
Y(301:400)
Race ConditionsRace Conditions
Common operation is to resolve a Common operation is to resolve a spatial position into an array index: spatial position into an array index: consider following loopconsider following loop
Looks innocent enough – but suppose Looks innocent enough – but suppose two particles have the same positions…two particles have the same positions…
C$OMP PARALLEL DOC$OMP& DEFAULT(NONE)C$OMP& PRIVATE(i,j)C$OMP& SHARED(r,A) do i=1,n j=int(r(i)) A(j)=A(j)+1. end do
r(): array of particlepositions
A(): array that is modifiedusing information fromr()
Race Conditions: A Race Conditions: A concurrency problemconcurrency problem
Two different threads of execution Two different threads of execution can concurrently attempt to update can concurrently attempt to update the same memory locationthe same memory location
Thread 1: Puts A(j)=2.
Thread 2: Puts A(j)=2.
time
End State A(j)=2.
INCORRECT
StartA(j)=1.
Thread 1: Gets A(j)=1.Adds 1.A(j)=2.
Thread 2: Gets A(j)=1.Adds 1.A(j)=2.
Dealing with Race Dealing with Race ConditionsConditions
Need mechanism to ensure updates to Need mechanism to ensure updates to single variables occur within a single variables occur within a critical critical sectionsection
Any thread entering a critical section Any thread entering a critical section blocks all othersblocks all others
Critical sections can be established by Critical sections can be established by using:using: Lock variables (single bit variables)Lock variables (single bit variables) Semaphores (Dijkstra 1968) Semaphores (Dijkstra 1968) Monitor procedures (Hoare 1974, used in Java)Monitor procedures (Hoare 1974, used in Java)
Simple (spin) lock usageSimple (spin) lock usagedo while (lock.eq.1) do while (lock.eq.1) spinspinend doend dolock=1lock=1 ..Critical sectionCritical section ..lock=0lock=0
do while (lock.eq.1) do while (lock.eq.1) spinspinend doend do
lock=1lock=1 ..Critical sectionCritical section ..lock=0lock=0
Access is serialized
Deadlocks: The pitfall of Deadlocks: The pitfall of lockinglocking
Must ensure a situation is not created Must ensure a situation is not created where requests in possession create a where requests in possession create a deadlock: deadlock:
Nested locks are a classic example of thisNested locks are a classic example of this Can also create problem with multiple Can also create problem with multiple
processes - `deadly embrace’processes - `deadly embrace’
Resource 1 Resource 2
Process 1 Process 2
Holds
Requests
Data DependenciesData Dependencies
Suppose you try to parallelize the Suppose you try to parallelize the following loopfollowing loop
Won’t work as it is written since Won’t work as it is written since iteration iteration ii, depends upon iteration , depends upon iteration i-1i-1 and thus we can’t start anything in and thus we can’t start anything in parallelparallel
c=0.0do i=1,n
c=c+1.0 Y(i)=cend do
Simple solutionSimple solution
This loop can easily be re-written in a This loop can easily be re-written in a way that can be parallelized:way that can be parallelized:
There is no longer any dependence on There is no longer any dependence on the previous operationthe previous operation
Private variables: Private variables: ii, Shared variables: , Shared variables: Y(),c,nY(),c,n
c=0.0do i=1,n Y(i)=c+iend doc=c+n
Types of Data Types of Data DependenciesDependencies
Suppose we have operations OSuppose we have operations O11,O,O22
True Dependence: True Dependence: OO22 has a true dependence on O has a true dependence on O11 if O if O22 reads a reads a
value written by Ovalue written by O11
Anti Dependence:Anti Dependence: OO22 has an anti-dependence on O has an anti-dependence on O11 if O if O22 writes a writes a
value read by Ovalue read by O11
Output Dependence:Output Dependence: OO22 has an output dependence on O has an output dependence on O11 if O if O22 writes a writes a
variable written by Ovariable written by O11
ExamplesExamples
True dependence:True dependence:
Anti-dependence:Anti-dependence:
Output dependence:Output dependence:
A1=A2+A3B1=A1+B2
B1=A1+B2A1=C2
B1=5B1=2
Bernstein’s ConditionsBernstein’s Conditions Set of conditions that are sufficient to Set of conditions that are sufficient to
determine whether two threads can be determine whether two threads can be executed simultaneouslyexecuted simultaneously
IIii: set of memory locations read by thread P: set of memory locations read by thread P ii
OOjj: set of memory locations altered by thread : set of memory locations altered by thread PPjj
For threads 1 & 2 to be concurrent:For threads 1 & 2 to be concurrent: II11∩O∩O22=Ø (input of 1 cannot intersect output of 2)=Ø (input of 1 cannot intersect output of 2) II22∩O∩O11=Ø (input of 2 cannot intersect output of 1)=Ø (input of 2 cannot intersect output of 1) OO11∩O∩O22=Ø (outputs cannot intersect)=Ø (outputs cannot intersect)
Bernstein, 1966, IEEE. Trans. Elec. Comp. Vol E-15, pp746
ExampleExample
Consider two threads, a=x+y, b=x+zConsider two threads, a=x+y, b=x+z Inputs: I1=(x,y), I2(x,z)Inputs: I1=(x,y), I2(x,z) Outputs: O1=a, O2=bOutputs: O1=a, O2=b All conditions are satifisfied:All conditions are satifisfied:
II11∩O∩O22=Ø =Ø
II22∩O∩O11=Ø =Ø
OO11∩O∩O22=Ø =Ø
Forms the basis for auto-parallelizing Forms the basis for auto-parallelizing compilers: difficult part is determining compilers: difficult part is determining access at compile timeaccess at compile time
Dealing with Data Dealing with Data DependenciesDependencies
Any loop where iterations depend upon the Any loop where iterations depend upon the previous one has a previous one has a potentialpotential problem problem
Any result which depends upon the Any result which depends upon the orderorder of of the iterations will be a problemthe iterations will be a problem
Good first test of whether something can be Good first test of whether something can be parallelized: reverse the loop iteration orderparallelized: reverse the loop iteration order
Not all data dependencies can be eliminatedNot all data dependencies can be eliminated AccumulationsAccumulations of variables ( of variables (e.g.e.g. sum of sum of
elements in an array) can be dealt with easilyelements in an array) can be dealt with easily
Summary Part 1Summary Part 1
The concurrent nature of shared The concurrent nature of shared memory programming entails dealing memory programming entails dealing with two keys issueswith two keys issues Race conditionsRace conditions Data dependenciesData dependencies
Race conditions can be (partially) Race conditions can be (partially) solved by lockingsolved by locking
Dealing with data dependencies Dealing with data dependencies frequently involves algorithmic frequently involves algorithmic changeschanges
Part 2: Shared Memory Part 2: Shared Memory ArchitecturesArchitectures
Consistency modelsConsistency models Cache coherenceCache coherence Snoopy versus directory based Snoopy versus directory based
coherencecoherence
Shared Memory design Shared Memory design issuesissues
Race conditions cannot be avoided on shared memory architectures must be dealt with at the programming
level However, updates to shared variables
must be propagated to other threads Variable values must be propagated
through the machine via hardware For multiple CPUs with caches this is
non-trivial – let’s look at single CPU first
Caching ArchitecturesCaching Architectures
In cache-based architectures we will In cache-based architectures we will often have two copies of a variable: often have two copies of a variable: one in main memory and one in cacheone in main memory and one in cache
How do we deal with memory updates How do we deal with memory updates if there are two copies?if there are two copies?
Two options: either try and keep the Two options: either try and keep the main memory in sync with the cache main memory in sync with the cache (hard) or wait to update ((hard) or wait to update (i.e.i.e. send send bursts/packets of updates)bursts/packets of updates)
Write-through CacheWrite-through Cache
When cache update occurs result is When cache update occurs result is immediately written back up to main immediately written back up to main memorymemory
Advantage: consistent picture of Advantage: consistent picture of memorymemory
Disadvantage: Uses up a lot of Disadvantage: Uses up a lot of memory bandwidthmemory bandwidth
Disadvantage: Writes to main memory Disadvantage: Writes to main memory even slower than reads!even slower than reads!
Write-back CacheWrite-back Cache
Wait until a certain number of updates Wait until a certain number of updates have been made and then write to main have been made and then write to main memorymemory Mark a new cache line as being `clean’Mark a new cache line as being `clean’ When modified an entry becomes `dirty’When modified an entry becomes `dirty’ Clean values can be thrown away, dirty Clean values can be thrown away, dirty
values must be written back to memoryvalues must be written back to memory Disadvantage: Memory is not consistentDisadvantage: Memory is not consistent Advantage: Overcomes problem of Advantage: Overcomes problem of
waiting for main memorywaiting for main memory
Inclusive CachingInclusive Caching Modern systems often Modern systems often
have multiple layers have multiple layers of cacheof cache
However, each cache However, each cache has only one parent has only one parent (but may have more (but may have more than one child)than one child)
Only the parent and Only the parent and children communicate children communicate directlydirectly
Inclusion: Inclusion: 1 ii LaLa
Level 3 Cache
Level 2 Level 2
L1 L1 L1 L1
CPUCPUCPUCPU
Note not all systems are built this way – exclusive caching(Opteron) does not have this relationship
Common ArchitectureCommon Architecture
Very common to have the L1 cache Very common to have the L1 cache as write-through to the L2 cacheas write-through to the L2 cache
L2 cache is then write-back to main L2 cache is then write-back to main memorymemory
For this inclusive architecture For this inclusive architecture increasing the size of the L2 can increasing the size of the L2 can have significant performance gainshave significant performance gains
Multiprocessor design Multiprocessor design issue: Cache (line) issue: Cache (line)
CoherencyCoherency
CPU 2 CPU 1 2 Processors, write-back cache
Individual caches
Main memory
Cache nowdirty but not writtento main memory
Processor 2requests frommain memory
Cache changed
Caches donot agree:Coherencyis lost
End state differencesEnd state differences
Write-back cache: both memory and Write-back cache: both memory and second cache hold stale values (until second cache hold stale values (until dirty cache entries are written)dirty cache entries are written)
Write-through cache: Only the Write-through cache: Only the second cache is stalesecond cache is stale Cannot avoid issue of memory being Cannot avoid issue of memory being
read before update occursread before update occurs Need additional mechanism to Need additional mechanism to
preserve some kind of preserve some kind of cache cache coherencycoherency
Segue: Consistency Segue: Consistency modelsmodels
Strictly Consistent modelStrictly Consistent model Intuitive idea of what memory should beIntuitive idea of what memory should be
Any read to a memory location X Any read to a memory location X returns the value stored by the most returns the value stored by the most recent write operation to Xrecent write operation to X
P1: W(x)1-----------------------P2: R(x)1 R(x)1
Gharachorloo, Lenoski, Ludon, Gibbons, Gupta, Hennessy, 1990,in Proceedings 17th International Symposium on Computer Architecture
Couple more examplesCouple more examples
P1: W(x)1-----------------------P2: R(x)0 R(x)1
P1: W(x)1-----------------------P2: R(x)0 R(x)1
Allowed
Not Allowed
Sequential ConsistencySequential Consistency Lamport 79:Lamport 79:
A multiprocessor system is A multiprocessor system is sequentially sequentially consistentconsistent if the result of any execution is if the result of any execution is the same as if the operations of all the the same as if the operations of all the processors were executed in processors were executed in somesome sequential sequential order, and the operations of each individual order, and the operations of each individual processor appear in this sequence in the processor appear in this sequence in the order specified by its program.order specified by its program.
Note the order of execution is not Note the order of execution is not specified under this modelspecified under this model
Best explained via example Best explained via example
Results under SCResults under SC
R1=XY =1
R2=YX =1
Start: X=Y=0
Is it possible to have R1=R2=1under sequential consistency?
Two threads
Instruction order allowed Instruction order allowed under SCunder SC
R1=X(=0)R2=Y(=0)Y=1X=1--------------R1=0 R2=0
R1=X(=0)R2=Y(=0)X=1Y=1--------------R1=0 R2=0
R1=X(=0)Y=1R2=Y(=1)X=1--------------R1=0 R2=1
R2=Y(=0)R1=X(=0)X=1Y=1--------------R1=0 R2=0
R2=Y(=0)R1=X(=0)Y=1X=1--------------R1=0 R2=0
R2=Y(=0)X=1R1=X(=1)Y=1--------------R1=1 R2=0
NO! – all SC ordered models cannot produce R1=R2=1NO! – all SC ordered models cannot produce R1=R2=1
SC doesn’t save you from SC doesn’t save you from data racesdata races
This is an important point!This is an important point! Even though you have SC, the operations Even though you have SC, the operations
from the threads can still be ordered from the threads can still be ordered differently in the global executiondifferently in the global execution
However, SC does help with some However, SC does help with some situations:situations:
R1=XIf(R1>0) Y =1
R2=YIf (R2>0) X=1
Start: X=Y=0
Race free
Other memory models exist with weaker requirements
Casuality is lost!Casuality is lost!P1: W(x)1-------------------------------------P2: W(x)2-------------------------------------P3: R(x)2 R(x)1-------------------------------------P4: R(x)2 R(x)1
P1: W(x)1 -------------------------------------P2: W(x)2-------------------------------------P3: R(x)2 R(x)1-------------------------------------P4: R(x)2 R(x)1
Allowedresult under
SC
Because we can reorder to
SC is still restrictiveSC is still restrictiveP1: W(x)1-------------------------------------P2: W(x)2-------------------------------------P3: R(x)2 R(x)1-------------------------------------P4: R(x)1 R(x)2
Not allowedunder
SC
Too many changes of x to be possible
Remember: operations on a given variablemust be interleaved to give a total ordering
(Weaker) Consistency (Weaker) Consistency ModelsModels
Causal ConsistencyCausal Consistency Operations that are causally related must Operations that are causally related must
be seen by all processes in the same be seen by all processes in the same order. Concurrent writes may be seen in order. Concurrent writes may be seen in a different order on different threads.a different order on different threads.
““If one event (B) is caused by another If one event (B) is caused by another (A) then all threads must observe A to (A) then all threads must observe A to happen before B”happen before B”
ExamplesExamplesP1: W(x)1 --------------------------------P2: R(x)1 W(x)2--------------------------------P3: R(x)2 R(x)1--------------------------------P4: R(x)1 R(x)2
Not causally
consistent
P1: W(x)1 --------------------------------P2: W(x)2--------------------------------P3: R(x)2 R(x)1--------------------------------P4: R(x)1 R(x)2
This iscausally
consistent
FIFO ConsistencyFIFO Consistency
Writes performed by a single thread Writes performed by a single thread are seen by all other processes in are seen by all other processes in the order in which they were issued, the order in which they were issued, but writes from different processes but writes from different processes may be seen in a different order by may be seen in a different order by different processesdifferent processes
ExampleExample
P1: W(x)1 -------------------------------------------P2: R(x)1 W(x)2 W(x)3-------------------------------------------P3: R(x)2 R(x)1 R(x)3-------------------------------------------P4: R(x)1 R(x)2 R(x)3
Under FIFO consistency this is a validsequence of events
Implementing Sequential Implementing Sequential ConsistencyConsistency
Sequential consistency is the most strict Sequential consistency is the most strict practicalpractical model (strict consistency model (strict consistency requires a global clock)requires a global clock)
To implement it requires that the CPU To implement it requires that the CPU caches all follow a caches all follow a cache coherency cache coherency protocolprotocol
Requires mechanism to keep track of Requires mechanism to keep track of cache-lines across all CPUscache-lines across all CPUs
Two main mechanisms:Two main mechanisms: Snoopy protocolSnoopy protocol Directory-basedDirectory-based
Cache Coherency Cache Coherency Protocol:Protocol:
Need to ensure that before a Need to ensure that before a memory location is written all other memory location is written all other copies are invalidatedcopies are invalidated
Allows multiple copies of memory to Allows multiple copies of memory to exist when being readexist when being read
Forces only one copy to exist when Forces only one copy to exist when being writtenbeing written
Cache Coherency: Cache Coherency: DrawbacksDrawbacks
Coherency only occurs at the cache-Coherency only occurs at the cache-lineline levellevel Access to different words within a cache-line Access to different words within a cache-line
will be serialized (`false sharing’ – see later)will be serialized (`false sharing’ – see later) Cache coherency says nothing about Cache coherency says nothing about
variables promoted to registers variables promoted to registers (although one hopes these wouldn’t be (although one hopes these wouldn’t be shared)shared)
You have to be aware of these issues You have to be aware of these issues when programmingwhen programming
BUT! Most importantlyBUT! Most importantly
Cache coherency does not save you from all race conditions!!!!
Mechanisms to achieve Mechanisms to achieve CCCC
Broadcast Broadcast (Snoopy)(Snoopy)
Processors send Processors send results everywhereresults everywhere
Requires a shared Requires a shared bus for low latencybus for low latency
Point-to-point (Directory)Point-to-point (Directory) Directory keeps track of Directory keeps track of
“interest” in each cache “interest” in each cache line (adds a bit of latency)line (adds a bit of latency)
Addresses are sent only Addresses are sent only to the necessary to the necessary processorsprocessors
P
P
PP
P
P
P
PP
P
Can combine the two: e.g. Sun E15k
Symmetric Multi Symmetric Multi ProcessorProcessor
e.g. SunFire 6800
SHARED BUS
Remember: Uniform memory architecture
Snooping ProtocolSnooping Protocol
Bus-based schemeBus-based scheme Processors actively check for bus activity Processors actively check for bus activity
(‘snoop’ - imagine a nosey neighbour!)(‘snoop’ - imagine a nosey neighbour!) Usually each cache line will have a set of Usually each cache line will have a set of
tags to enable efficient snoopingtags to enable efficient snooping Can’t scale to very large number of Can’t scale to very large number of
processorsprocessors Restricted by broadcast bandwidth requirementsRestricted by broadcast bandwidth requirements
Many possible protocols – we’ll consider Many possible protocols – we’ll consider MESIMESI
MESI ProtocolMESI Protocol 4 states associated with the bits in the cache line 4 states associated with the bits in the cache line
tag:tag: M=cache line is modified (dirty)M=cache line is modified (dirty)
No other cache owns this line, incoherent with memoryNo other cache owns this line, incoherent with memory E=exclusive cache lineE=exclusive cache line
Line is coherent with memory, and held in only one Line is coherent with memory, and held in only one cachecache
S=shared cache lineS=shared cache line Line is coherent with memory, but only held in many Line is coherent with memory, but only held in many
cachescaches I=Invalid lineI=Invalid line
Line is not cachedLine is not cached
Goodman 1983
Examples of state Examples of state changes changes
Read with intent to modifyRead with intent to modify If address matches S or E line then change to I If address matches S or E line then change to I
(you are going to modify, so all other copies are (you are going to modify, so all other copies are invalidated). invalidated).
If address matches M then modified line must be If address matches M then modified line must be written back to memory before proceeding (or written back to memory before proceeding (or maybe passed directly)maybe passed directly)
ReadRead If address is S, then no change. If line is E, then If address is S, then no change. If line is E, then
must change to S. If line is M then must write back must change to S. If line is M then must write back to memory and change to S (or again may pass to memory and change to S (or again may pass directly and change own state to S)directly and change own state to S)
Other snoopy protocolsOther snoopy protocols
IllinoisIllinois BerkeleyBerkeley SynapseSynapse FireflyFirefly ““Write once”Write once”
See Pattersons Berkeley CS Lectures (on line)
Machine ExampleMachine Example
Sun Enterprise Sun Enterprise 1000010000 (Old machine now)(Old machine now)
Up to 64 processorsUp to 64 processors Need to interleave Need to interleave
4 buses to achieve 4 buses to achieve necessary necessary bandwidthbandwidth
Snoop broadcast Snoop broadcast every other cycleevery other cycle
Non-Uniform Shared Non-Uniform Shared MemoryMemory
Network + directory system thatkeeps track of where memory is
Also known as NUMA for Non-Uniform Memory Access (e.g. Altix)
Directory-based schemesDirectory-based schemes
Significantly reduces bandwidth Significantly reduces bandwidth requirements by using point-to-point requirements by using point-to-point messagesmessages
Directories distributed with memory Directories distributed with memory regions regions
Combined with memory controllerCombined with memory controller Total storage requirement: Total storage requirement:
number of memory blocks*Nnumber of memory blocks*Ncpucpu
Can require up to 15% of total memory on Can require up to 15% of total memory on large systemslarge systems
Directories in actionDirectories in action
Modify shared data:Modify shared data: Establish exclusive ownership via directoryEstablish exclusive ownership via directory Controller must notify all owners of this Controller must notify all owners of this
cache linecache line When complete modification may be When complete modification may be
performedperformed Reading modified data:Reading modified data:
Again send request to directoryAgain send request to directory True owner is determined and value is True owner is determined and value is
routed to requesting noderouted to requesting node
Machine Example: SGI Machine Example: SGI AltixAltix
Developed from Developed from Origin system, Origin system, which were which were developed from developed from the Stanford the Stanford DASH prototypeDASH prototype
Ring and fat-Ring and fat-tree topologiestree topologies
Summary Part 2Summary Part 2
Cache coherency is key in the design Cache coherency is key in the design of shared memory systemsof shared memory systems
Cache coherency usually implements Cache coherency usually implements a model of sequential consistencya model of sequential consistency
Snoopy protocols simple, elegant, Snoopy protocols simple, elegant, but bandwidth limitedbut bandwidth limited
Directory based protocols scale Directory based protocols scale betterbetter
Part 3: OpenMP IPart 3: OpenMP I Origins, what does it specify?Origins, what does it specify? Parallel do loopsParallel do loops PragmasPragmas
What is OpenMP?What is OpenMP?
OpenMP is a pragma based API that OpenMP is a pragma based API that provides a simple extension to C/C++ and provides a simple extension to C/C++ and FORTRANFORTRAN
It is exclusively designed for shared It is exclusively designed for shared memory programmingmemory programming
However, some vendors (Intel) are However, some vendors (Intel) are developing virtual shared memory developing virtual shared memory compilers that will support OpenMPcompilers that will support OpenMP
Ultimately, OpenMP is a very simple Ultimately, OpenMP is a very simple interface to threads based programminginterface to threads based programming
Components of OpenMPComponents of OpenMP
Directives(Pragmas)
Runtime Libraryroutines
Environmentvariables
OpenMP: Where did it come OpenMP: Where did it come from?from?
Prior to 1997, vendors all had their own Prior to 1997, vendors all had their own proprietary shared memory programming proprietary shared memory programming commandscommands
Programs were not portable from one SMP to Programs were not portable from one SMP to anotheranother
Researchers were calling for some kind of Researchers were calling for some kind of portabilityportability
ANSI X3H5 (1994) proposal tried to formalize a ANSI X3H5 (1994) proposal tried to formalize a shared memory standard – but ultimately failedshared memory standard – but ultimately failed
OpenMP (1997) worked because the vendors got OpenMP (1997) worked because the vendors got behind it and there was new growth in the shared behind it and there was new growth in the shared memory arenamemory arena
OpenMP Architecture OpenMP Architecture Review BoardReview Board
HPHP IntelIntel IBMIBM SGISGI SunSun FujitsuFujitsu Portland Compiler GroupPortland Compiler Group US DoE ASCI ProgramUS DoE ASCI Program
OpenMP is heavily steered by vendors
http://www.openmp.org
BottomlineBottomline
For OpenMP one only has to worry about For OpenMP one only has to worry about parallelism of workparallelism of work
The global address space enabled by The global address space enabled by shared-memory is the reason for thisshared-memory is the reason for this
In MPI one has to worry both about In MPI one has to worry both about parallelism of the work and also the parallelism of the work and also the placement of dataplacement of data
Data movement is what makes MPI codes Data movement is what makes MPI codes so much longer – it can be highly non-so much longer – it can be highly non-trivialtrivial
A few numbersA few numbers
Interconnect technology Latency(Interconnect technology Latency(s) Price($/port)s) Price($/port)
Gigabit Ethernet 30-50 50Gigabit Ethernet 30-50 50
10Gb Ethernet 10 5000(!)10Gb Ethernet 10 5000(!)
Myrinet 6 1500Myrinet 6 1500
Infiniband 6-7 1500Infiniband 6-7 1500
Quadrics 3 2500Quadrics 3 2500
NUMAflex 0.5-2 10000?NUMAflex 0.5-2 10000?
No surprise GigE is the dominant interconnect it is by far the cheapest!!!
Distributed memory systems
Shared memory systems
Loop Level ParallelismLoop Level Parallelism
Consider the single precision vector Consider the single precision vector add-multiply operation add-multiply operation YY=a=aXX++Y Y (“SAXPY”)(“SAXPY”)
do i=1,n Y(i)=a*X(i)+Y(i)end do
FORTRAN
C$OMP PARALLEL DOC$OMP& DEFAULT(NONE)C$OMP& PRIVATE(i),SHARED(X,Y,n,a) do i=1,n Y(i)=a*X(i)+Y(i) end do
for (i=1;i<=n;++i) { Y[i]+=a*X[i];}
C/C++
#pragma omp parallel for \ private(i) shared(X,Y,n,a)for (i=1;i<=n;++i) { Y[i]+=a*X[i];}
In more detailIn more detail
C$OMP PARALLEL DOC$OMP& DEFAULT(NONE)C$OMP& PRIVATE(i),SHARED(X,Y,n,a) do i=1,n Y(i)=a*X(i)+Y(i) end do
Denotes this is a region of code for parallel execution
Good programming practice, mustdeclare nature of all variables
Thread PRIVATE variables: each threadmust have their own copy of this variable (in this case i is the only private variable)
Thread SHARED variables: all threads can
access these variables, but must not updateindividual memory locations simultaneously
Comment pragmas for FORTRAN - ampersand
necessary for continuation
A quick noteA quick note
To be fully lexically correct you may To be fully lexically correct you may want to include an want to include an C$OMP END C$OMP END PARALLEL DOPARALLEL DO
In f90 programs use In f90 programs use !$OMP!$OMP as a as a sentinelsentinel
First StepsFirst Steps
Loop level parallelism is the simplest Loop level parallelism is the simplest and easiest way to use OpenMPand easiest way to use OpenMP
It allows you to slowly build up It allows you to slowly build up parallelism within your applicationparallelism within your application
However, not all loops are However, not all loops are immediately parallelizeable due to immediately parallelizeable due to data dependencies or race data dependencies or race conditionsconditions
AccumulationsAccumulations
Consider the following loop:Consider the following loop:
It apparently has a data dependency – It apparently has a data dependency – however each thread can sum values of however each thread can sum values of aa independently independently
OpenMP provides an explicit interface OpenMP provides an explicit interface for this kind of operation (“for this kind of operation (“REDUCTIONREDUCTION”)”)
a=0.0do i=1,n a=a+X(i)end do
Another example of a non-Another example of a non-parallelizable loopparallelizable loop
For this example, running the For this example, running the iterations from n to 1 will produce a iterations from n to 1 will produce a different answerdifferent answer
Break it into two parallelizable loops Break it into two parallelizable loops (`loop fission’)(`loop fission’)
do i=1,n a(i)=b(i) c(i)=a(i+1)+d(i)end do
Watch for stridesWatch for strides
do i=2,5 a(i)=c*a(i-1)end do
do i=2,5,2 a(i)=c*a(i-1)end do
dependence no dependence
Strides may remove dependencies – same concept in vectorization
SolutionsSolutions
Need to ensure Bernstein’s Need to ensure Bernstein’s conditions: memory read/writes conditions: memory read/writes occur without any overlapoccur without any overlap
Alternatively, if the access occurs to Alternatively, if the access occurs to a single variable, we can use a a single variable, we can use a critical section:critical section:
call omp_init_lock(lckx)do i=1,n **work** call omp_set_lock(lckx) a=a+1. call omp_unset_lock(lckx)end do
do i=1,n **work**C$OMP CRITICAL(lckx) a=a+1.C$OMP END CRITICAL(lckx) end do
ATOMICATOMIC
If all you want to do is ensure the If all you want to do is ensure the correct update of one variable you correct update of one variable you can use the atomic update facility:can use the atomic update facility:
Exactly the same as a critical section Exactly the same as a critical section around one single update pointaround one single update point
C$OMP PARALLEL DO do i=1,n **work**C$OMP ATOMIC a=a+1. end do
CanCan be inefficient be inefficient
If other threads are If other threads are waiting to enter the waiting to enter the critical section then critical section then the program may even the program may even degenerate to a serial degenerate to a serial code!code!
Make sure there is Make sure there is much more work much more work outside the locked outside the locked region than inside it!region than inside it!
Parallel Section where eachthread waits for the lock before being able to proceed –A complete disaster
= doing work=waiting for lock
Requirements for Requirements for parallel loopsparallel loops
To divide up the work the compiler To divide up the work the compiler needs to know the number of needs to know the number of iterations to be executed – the iterations to be executed – the trip trip countcount must be computable must be computable
DO WHILEDO WHILE is not parallelizable is not parallelizable The loop can only have one exit The loop can only have one exit
point – therefore point – therefore BREAKBREAK or or GOTOGOTOs are s are not allowednot allowed
The Parallel Do PragmasThe Parallel Do Pragmas
So far we’ve considered a small So far we’ve considered a small subset of functionalitysubset of functionality
Besides Besides PRIVATEPRIVATE and and SHAREDSHARED variables there are a number of variables there are a number of other clauses that can be applied to other clauses that can be applied to parallel do loopsparallel do loops
Loop Level Parallelism in Loop Level Parallelism in more detailmore detail
For each parallel do(for) pragma, For each parallel do(for) pragma, the following clauses are possible:the following clauses are possible:
FORTRANPRIVATESHAREDFIRSTPRIVATELASTPRIVATEREDUCTIONORDEREDSCHEDULECOPYINDEFAULT
C/C++privatesharedfirstprivatelastprivatereductionorderedschedulecopyin
Red=most frequently used
SHAREDSHARED and and PRIVATEPRIVATE
Most commonly used directives which Most commonly used directives which are necessary to ensure correct are necessary to ensure correct executionexecution
PRIVATEPRIVATE: any variable declared as : any variable declared as private will be local only to a given private will be local only to a given thread and is inaccesible to others (also thread and is inaccesible to others (also is is uninitializeduninitialized))
SHAREDSHARED: any variable declared as shared : any variable declared as shared will be accessible by all other threads of will be accessible by all other threads of executionexecution
ExampleExample
The SHARED and PRIVATE The SHARED and PRIVATE specifications can be long:specifications can be long:C$OMP& PRIVATE(icb,icol,izt,iyt,icell,iz_off,iy_off,ibz,C$OMP& iby,ibx,i,rxadd,ryadd,rzadd,inx,iny,inz,nb,nebs,ibrf,C$OMP& nbz,nby,nbx,nbrf,nbref,jnbox,jnboxnhc,idt,mdt,iboxd,C$OMP& dedge,idir,redge,is,ie,twoh,dosph,rmind,in,ixyz,C$OMP& redaughter,Ustmp,ngpp,hpp,vpp,apps,epp,hppi,hpp2,C$OMP& rh2,hpp2i,hpp3i,hpp5i,dpp,divpp,dcvpp,nspp,rnspp,C$OMP& rad2torbin,de1,dosphflag,dosphnb,nbzlow,nbzhigh,nbylow,C$OMP& nbyhigh,nbxlow,nbxhigh,nbzadd,nbyadd,r3i,r2i,r1i,C$OMP& dosphnbnb,dogravnb,js,je,j,rad2,rmj,grc,igrc,gfrac,C$OMP& Gr,hppj,jlist,dx,rdv,rcv,v2,radii2,rbin,ibin,fbin,C$OMP& wl1,dwl1,drnspp,hppa,hppji,hppj2i,hppj3i,hppj5i,C$OMP& wl2,dwl2,w,dw,df,dppi,divppr,dcvpp2,dcvppm,divppm,csi,C$OMP& fi,prhoi2,ispp,frcij,rdotv,hpa,rmuij,rhoij,cij,qij,C$OMP& frc3,frc4,hcalc,rath,av,frc2,dr1,dr2,dr3,dr12,dr22,dr32,C$OMP& appg1,appg2,appg3,gdiff,ddiff,d2diff,dv1,dv2,dv3,rpp,C$OMP& Gro)
Default behaviourDefault behaviour
You can actually omit the You can actually omit the SHAREDSHARED and and PRIVATEPRIVATE statements – what is statements – what is the expected behaviour?the expected behaviour?
Scalars are private by defaultScalars are private by default Arrays are shared by defaultArrays are shared by default
Bad practice in my opinion – specify the types for everything
DEFAULTDEFAULT
I recommend using I recommend using DEFAULT(NONE)DEFAULT(NONE) at all at all timestimes
Forces specification of all variable typesForces specification of all variable types Alternatively, can use Alternatively, can use DEFAULT(SHARED)DEFAULT(SHARED), ,
or or DEFAULT(PRIVATE)DEFAULT(PRIVATE) to specify that un- to specify that un-scoped variables will default to the scoped variables will default to the particular type chosen particular type chosen
e.g.e.g. choosing choosing DEFAULT(PRIVATE)DEFAULT(PRIVATE) will will ensure any un-scoped variable is private ensure any un-scoped variable is private
ReductionReduction This clause deals with parallel versions of This clause deals with parallel versions of
the following loopsthe following loops
Outcome is determined by a `reduction’ Outcome is determined by a `reduction’ over all the values for each threadover all the values for each thread
e.g.e.g. max over all of a set, is equivalent to max over all of a set, is equivalent to the max over all max with subsets: the max over all max with subsets:
Max(A) where A=U AMax(A) where A=U Ann= Max(U = Max(U Max(AMax(Ann))))
do i=1,N a=max(a,b(i))end do
do i=1,N a=min(a,b(i))end do
do i=1,n a=a+b(i)end do
ExamplesExamples
Syntax:Syntax: REDUCTION(OP: REDUCTION(OP: variablevariable)) where where OP=max,min,+,-,*OP=max,min,+,-,* (& logic (& logic ops)ops)
C$OMP PARALLEL DOC$OMP& PRIVATE(i), SHARED(b)C$OMP& REDUCTION(max:a)do i=1,N a=max(a,b(i))end do
C$OMP PARALLEL DOC$OMP& PRIVATE(i), SHARED(b)C$OMP& REDUCTION(min:a)do i=1,N a=min(a,b(i))end do
What is REDUCTION What is REDUCTION actually doing?actually doing?
Saving you from writing more codeSaving you from writing more code The reduction clause generates an The reduction clause generates an
array of the reduction variables, and array of the reduction variables, and each thread is responsible for a each thread is responsible for a certain element in the arraycertain element in the array
The final reduction over all the array The final reduction over all the array elements (when the loop is finished) elements (when the loop is finished) is performed transparently to the is performed transparently to the useruser
InitializationInitialization
Reduction variables are initialized as Reduction variables are initialized as follows (from the standard):follows (from the standard):
Operator Initialization+ 0* 1- 0MAX Smallest rep. numberMIN Largest rep. number
NoteNote
While you can do reductions over While you can do reductions over arrays (as part of the OpenMP 2.0 arrays (as part of the OpenMP 2.0 standard) this isn’t always a great standard) this isn’t always a great ideaidea
Was brought in to help with support Was brought in to help with support of f90 array syntaxof f90 array syntax
Summary Part 3Summary Part 3
OpenMP is strongly driven by vendors OpenMP is strongly driven by vendors and will be around for a long time yetand will be around for a long time yet
Evolving API, with useful functionalityEvolving API, with useful functionality Parallelism can be easily exposed via Parallelism can be easily exposed via
loop level parallelism, but other modes loop level parallelism, but other modes are supported within the languageare supported within the language
Most significant part of programming: Most significant part of programming: dealing with race conditions and data dealing with race conditions and data dependenciesdependencies
Next lectureNext lecture
Shared memory parallelism IIShared memory parallelism II More on OpenMP programmingMore on OpenMP programming Iteration scheduling for load balanceIteration scheduling for load balance Using memory effectivelyUsing memory effectively