Upload
kita
View
20
Download
0
Embed Size (px)
DESCRIPTION
High Performance Computing – CISC 811. Dr Rob Thacker Dept of Physics (308A) thacker@physics. Today’s Lecture. Distributed Memory Computing III. Part 1: MPI-2: RMA Part 2: MPI-2: Parallel I/O Part 3: Odds and ends. Part 1: MPI-2 RMA. - PowerPoint PPT Presentation
Citation preview
High High Performance Performance Computing – Computing –
CISC 811CISC 811Dr Rob ThackerDr Rob Thacker
Dept of Physics (308A)Dept of Physics (308A)
thacker@physicsthacker@physics
Today’s LectureToday’s Lecture
Part 1: MPI-2: RMAPart 1: MPI-2: RMA Part 2: MPI-2: Parallel I/O Part 2: MPI-2: Parallel I/O Part 3: Odds and endsPart 3: Odds and ends
Distributed Memory Computing III
Part 1: MPI-2 RMAPart 1: MPI-2 RMA
One sided communication is a One sided communication is a significant step forward in significant step forward in functionality for MPIfunctionality for MPI
Ability to retrieve remote data Ability to retrieve remote data without cooperative message without cooperative message passing enables increased time in passing enables increased time in computationcomputation
One sided comms are sensitive to One sided comms are sensitive to OS/machine optimizations thoughOS/machine optimizations though
Standard message Standard message passingpassing
HOST A HOST B
NIC NIC
Memory/Buffer CPU CPU
Memory/Buffer
MPI_Send MPI_Recv
Packet transmission is directly mitigated by the CPU’s on both machines,multiple buffer copies may be necessary
Traditional message Traditional message passingpassing
Both sender and receiver must Both sender and receiver must cooperatecooperate Send needs to address buffer to be sentSend needs to address buffer to be sent Sender specifies destination and tagSender specifies destination and tag Recv needs to specify it’s own bufferRecv needs to specify it’s own buffer Recv must specify origin and tagRecv must specify origin and tag
In blocking mode this is a very In blocking mode this is a very expensive operationexpensive operation Both sender and receiver must cooperate Both sender and receiver must cooperate
and stop any computation they may be and stop any computation they may be doingdoing
Sequence of operations to Sequence of operations to `get’ data`get’ data
Suppose process A wants to Suppose process A wants to retrieve a section of an array retrieve a section of an array from process B (from process B (process B is process B is unaware of what is unaware of what is requiredrequired)) Process A executes MPI_Send Process A executes MPI_Send
to B with details of what it to B with details of what it requiresrequires
Process executes MPI_Recv Process executes MPI_Recv from A and determines data from A and determines data required by Arequired by A
Process B executes MPI_Send Process B executes MPI_Send to A with required datato A with required data
Process A executes MPI_Recv Process A executes MPI_Recv from B…from B…
4 MPI-1 commands4 MPI-1 commands Additionally process B has to Additionally process B has to
be aware of incoming be aware of incoming messagemessage Requires frequent polling for Requires frequent polling for
messages – potentially highly messages – potentially highly wastefulwasteful
Process A Process B
MPI_SEND
MPI_RECV
MPI_SEND
MPI_RECV
Even worse exampleEven worse example Suppose you need to Suppose you need to
read a remote list to read a remote list to figure out what data figure out what data you need – sequence of you need – sequence of ops is then:ops is then:
Process A Process BMPI_Send (get list) MPI_Recv (list request)MPI_Recv (list returned) MPI_Send (list info)MPI_Send (get data) MPI_Recv (data request)MPI_Recv (data returned) MPI_Send (data info)
Process A Process B
MPI_SEND
MPI_RECVMPI_SEND
MPI_RECV
MPI_SEND
MPI_RECVMPI_SEND
MPI_RECV
GET LIST
GET DATA
Coarse versus fine Coarse versus fine graininggraining
Expense of message passing implicitly suggests Expense of message passing implicitly suggests MPI-1 programs should be coarse grainedMPI-1 programs should be coarse grained
Unit of messaging in NUMA systems is the cache Unit of messaging in NUMA systems is the cache lineline What about API for (fast network) distributed memory What about API for (fast network) distributed memory
systems that is optimized for smaller messages?systems that is optimized for smaller messages? e.g. ARMCI e.g. ARMCI http://www.emsl.pnl.gov/docs/parsoft/armcihttp://www.emsl.pnl.gov/docs/parsoft/armci Would enable distributed memory systems to have Would enable distributed memory systems to have
moderately high performance fine grained parallelismmoderately high performance fine grained parallelism A number of applications are suited to this style of A number of applications are suited to this style of
parallelism (especially irregular data structures)parallelism (especially irregular data structures) T3E and T3D both we capable of performing fine T3E and T3D both we capable of performing fine
grained calculations – well balanced machinesgrained calculations – well balanced machines API’s supporting fine grained parallelism have one-API’s supporting fine grained parallelism have one-
sided communication for efficiency – no handshaking to sided communication for efficiency – no handshaking to take processes away from computationtake processes away from computation
Puts and Gets in MPI-2Puts and Gets in MPI-2 In one sided communication the number of In one sided communication the number of
operations is reduced by at least factor of 2operations is reduced by at least factor of 2 For our earlier example 4 MPI operations can be replaced For our earlier example 4 MPI operations can be replaced
with a single MPI_Getwith a single MPI_Get Circumvents the need to forward information directly Circumvents the need to forward information directly
to the remote CPU specifying what data is requiredto the remote CPU specifying what data is required MPI_Sends+MPI_Recv’s are replaced by three MPI_Sends+MPI_Recv’s are replaced by three
possibilitiespossibilities MPI_Get: Retrieve section of a remote arrayMPI_Get: Retrieve section of a remote array MPI_Put: Place a section of a local array into remote memoryMPI_Put: Place a section of a local array into remote memory MPI_Accumulate: Remote update over operator and local MPI_Accumulate: Remote update over operator and local
datadata However, programmer must be aware of the However, programmer must be aware of the
possibility of remote processes changing local arrays!possibility of remote processes changing local arrays!
RMA illustratedRMA illustrated
HOST A HOST B
NIC(withRDMA
engine)
NIC(withRDMA
engine)
Memory/Buffer CPU CPU
Memory/Buffer
Benefits of one-sided Benefits of one-sided communicationcommunication
No matching operation required for remote No matching operation required for remote processprocess
All parameters of the operations are specified by All parameters of the operations are specified by the origin processthe origin process
Allows very flexible communcations patternsAllows very flexible communcations patterns Communication and synchronization are separatedCommunication and synchronization are separated
Synchronization is now implied by the Synchronization is now implied by the access epochaccess epoch
Removes need for polling for incoming messagesRemoves need for polling for incoming messages Significantly improves performance of Significantly improves performance of
applications with irregular and unpredictable applications with irregular and unpredictable data movementdata movement
Windows: The fundamental Windows: The fundamental construction for one-sided construction for one-sided
commscomms One sided comms may One sided comms may
only write into memory only write into memory regions “windows” set regions “windows” set aside for aside for communicationcommunication
Access to the windows Access to the windows must be within a must be within a specific access epochspecific access epoch
All processes may All processes may agree on access epoch, agree on access epoch, or just a pair of or just a pair of processes may processes may cooperatecooperate
Origin Target
One-sided Put
MemoryWindow
Creating a windowCreating a window MPI_Win_create(base,size,disp_unit,info,comMPI_Win_create(base,size,disp_unit,info,com
m,win,ierr)m,win,ierr) Base address of windowBase address of window Size of window in BYTESSize of window in BYTES Local unit size for displacements (BYTES, e.g. 4)Local unit size for displacements (BYTES, e.g. 4) Info – argument about type of operations that may Info – argument about type of operations that may
occur on windowoccur on window Win – window object returned by callWin – window object returned by call
Should also free window using Should also free window using MPI_Win_free(win,ierr)MPI_Win_free(win,ierr)
Window performance is always better when Window performance is always better when base aligns on a word boundarybase aligns on a word boundary
Options to infoOptions to info Vendors are allowed to include options to Vendors are allowed to include options to
improve window performance under certain improve window performance under certain circumstancescircumstances
MPI_INFO_NULL is always validMPI_INFO_NULL is always valid If win_lock is not going to be used then this If win_lock is not going to be used then this
information can be passed as an info argument:information can be passed as an info argument:
MPI_Info info;MPI_Info_create(&info);MPI_Info_set(info,”no_locks”,”true”);MPI_Win_create(…,info,…);MPI_Info_free(&info);
Access epochsAccess epochs Although communication is mediated by GETs Although communication is mediated by GETs
and PUTs they do not guarantee message and PUTs they do not guarantee message completioncompletion
All communication must occur within an All communication must occur within an access epochaccess epoch
Communication is only guaranteed to have Communication is only guaranteed to have completed when the epoch is finishedcompleted when the epoch is finished This is to optimize messaging – do not have to This is to optimize messaging – do not have to
worry about completion until access epoch is endedworry about completion until access epoch is ended Two ways of coordinating accessTwo ways of coordinating access
Active target: remote process governs completionActive target: remote process governs completion Passive target: Origin process governs completionPassive target: Origin process governs completion
Access epochs : Active Access epochs : Active targettarget
Active target Active target communication is communication is usually expressed in a usually expressed in a collective operationcollective operation
All process agree on All process agree on the beginning of the the beginning of the windowwindow
Communication Communication occursoccurs
Communication is Communication is then guaranteed to then guaranteed to have completed when have completed when second WIN_Fence is second WIN_Fence is calledcalled
Origin Target
WIN_FENCE
WIN_FENCE
!Processes agree on fenceCall MPI_Win_fence!Put remote dataCall MPI_PUT(..)!Collective fenceCall MPI_Win_fence!Message is guaranteed to !complete after win_fence!on remote process completes
All other processes
Access epochs : Passive Access epochs : Passive targettarget
For passive target For passive target communication, the communication, the origin process origin process controls all aspects controls all aspects of communicationof communication
Target process is Target process is oblivious to the oblivious to the communication communication epochepoch
MPI_Win_(un)lock MPI_Win_(un)lock facilitates the facilitates the communication communication
!Lock remote process windowCall MPI_Win_lock!Put remote dataCall MPI_PUT(..)!Unlock remote process windowCall MPI_win_unlock!Message is guaranteed to !complete after win_unlock
Origin Target
WIN_LOCK
WIN_UNLOCK
Non-collective active Non-collective active targettarget
Win_fence is Win_fence is collective over the collective over the comm of the comm of the windowwindow
A similar A similar construct over construct over groups is availablegroups is available
See Using MPI-2 See Using MPI-2 for more detailsfor more details
Origin Target
WIN_FENCE
WIN_FENCE
All other processes
!Processes agree on fenceCall MPI_Win_start(group,..)Call MPI_Win_post(group,..)!Put remote dataCall MPI_PUT(..)!Collective fenceCall MPI_Win_complete(win)Call MPI_Win_wait(win)!Message is guaranteed to !complete after waits finish
Rules for memory areas Rules for memory areas assigned to windowsassigned to windows
Memory regions for Memory regions for windows involved in active windows involved in active target synchronization may target synchronization may be statically declaredbe statically declared
Memory regions for Memory regions for windows involved in windows involved in passive target access passive target access epochs may have to be epochs may have to be dynamically allocated dynamically allocated depends on implementationdepends on implementation For Fortran requires For Fortran requires
definition of Cray-like definition of Cray-like pointers to arrayspointers to arrays
MPI_Alloc_mem(size,MPI_IMPI_Alloc_mem(size,MPI_INFO_NULL,pointer,ierr)NFO_NULL,pointer,ierr)
Must be associated with Must be associated with freeing callfreeing call
MPI_Free_mem(array,ierr)MPI_Free_mem(array,ierr)
double precision u
pointer (p,u(0:50,0:20))
integer (kind=MPI_ADDRESS_KIND) sizeinteger sizeofdouble, ierr
call MPI_Sizeof(u,sizeofdouble,ierr)
size=51*21*sizeofdouble
call MPI_Alloc_mem(size,MPI_INFO_NULL,p,ierr)…Can now refer to u……call MPI_Free_mem(u,ierr)
More on passive target More on passive target accessaccess
Closest idea to shared memory operation Closest idea to shared memory operation on a distributed systemon a distributed system
Very flexible communication modelVery flexible communication model Multiple origin processes must negotiate Multiple origin processes must negotiate
on access to lockson access to locks
Process A Process B Process C lock A Locked put to A window unlock A lock B Locked put to B window unlock B lock A Locked put to A window unlock A
Time
Cray SHMEM – origin of Cray SHMEM – origin of many one-sided many one-sided
communication conceptscommunication concepts On the T3E a number of variable types were guaranteed to On the T3E a number of variable types were guaranteed to
occupy the same point in memory on different nodes:occupy the same point in memory on different nodes: Global variables/variables in common blocksGlobal variables/variables in common blocks Local static variablesLocal static variables Fortran variables specified via !DIR$ SYMMETRIC directiveFortran variables specified via !DIR$ SYMMETRIC directive C variables specified by #pragma symmetric directiveC variables specified by #pragma symmetric directive Variables that are stack allocated, or dynamically on to the heap Variables that are stack allocated, or dynamically on to the heap
are not guaranteed to occupy the same address on different are not guaranteed to occupy the same address on different processorsprocessors
These variables could be rapidly retrieved/replaced via These variables could be rapidly retrieved/replaced via shmem_get/put shmem_get/put One sided operationsOne sided operations
Because these memory locations are shared among Because these memory locations are shared among processors the library is dubbed “SHared MEMory” – processors the library is dubbed “SHared MEMory” – SHMEMSHMEM It does not have a global address space (although you could It does not have a global address space (although you could
implement one around this idea)implement one around this idea) Similar idea to global arraysSimilar idea to global arrays
A lot of functionality from SHMEM is available in the MPI-2 A lot of functionality from SHMEM is available in the MPI-2 one sided library (and was central in the design)one sided library (and was central in the design)
Shmem exampleShmem exampleCC Taken from Cray MPP Fortran Reference Manual CC Added CACHE_ALIGN directive to show how it should be doneCC Ken Steube - 3/11/96CC Each PE initializes array source() with the PE number,C mype, and then gets the values of source from PE numberC mype-1. It checks to make sure the values it got the C right values after receiving the data.CC This code calls shmem_get() to accomplish the task. C Be aware that shmem_put() is significantly faster than C shmem_get(), and so it should be used when possible.C program ring_of_PEs parameter (N=10 000) common /xxx/ target,source real target(N) real source(N)CDIR$ CACHE_ALIGN target source integer previous integer shmem_get intrinsic my_pe data iflag /1/C
mype = my_pe()C previous = mod(mype - 1 + N$PES, N$PES)C do i = 1 , N ! Assign unique values on each PE source(i) = mype enddoC call barrier() ! All PEs initialize source ! before doing the getC iget = shmem_get(target, source, N, $ previous) do i = 1, N if (target(i) .ne. previous) then iflag = 0 print*,'PE #',mype,': target(',i,')=', $ target(i),', should be ',previous endif enddoC if (iflag .eq. 0) then print*,'Test failed on PE ',mype else print*,'Test passed on PE ',mype endifC end
MPI_Get/Put/AccumulateMPI_Get/Put/Accumulate Non-blocking operationsNon-blocking operations MPI_Get(origin address,count,datatype,targetMPI_Get(origin address,count,datatype,target,target displ,target count,target ,target displ,target count,target
datatypedatatype,win,ierr),win,ierr) Must specify information about both origin and Must specify information about both origin and remote remote
datatypesdatatypes – more arguments – more arguments No need to specify communicator – contained in windowNo need to specify communicator – contained in window Target displ is displacement from beginning of target Target displ is displacement from beginning of target
windowwindow Note remote datatype cannot resolve to overlapping Note remote datatype cannot resolve to overlapping
entriesentries MPI_Put has same interfaceMPI_Put has same interface MPI_Accumulate requires the reduction operator MPI_Accumulate requires the reduction operator
also be specified (argument before the window)also be specified (argument before the window) Same operators as MPI_REDUCE, but user defined Same operators as MPI_REDUCE, but user defined
functions cannot be usedfunctions cannot be used Note MPI_Accumulate is really MPI_Put_accumulate, there Note MPI_Accumulate is really MPI_Put_accumulate, there
is no get functionality (must do by hand)is no get functionality (must do by hand)
Don’t forget datatypesDon’t forget datatypes
In one-sided In one-sided comms datatypes comms datatypes play an extremely play an extremely important roleimportant role Specify explicitly Specify explicitly
the unpacking on the unpacking on the remote nodethe remote node
Origin node must Origin node must know precisely what know precisely what the required remote the required remote data type isdata type is
Contiguous origin datatype
Sparse targetdatatype
MPI_AccumulateMPI_Accumulate Extremely powerful operation “put+op”Extremely powerful operation “put+op” Questions marks for implementations though:Questions marks for implementations though:
Who actually implements the “op” side of things?Who actually implements the “op” side of things? If on remote node then there must be an extra If on remote node then there must be an extra
thread to do this operationthread to do this operation If on local node, then accumulate becomes get If on local node, then accumulate becomes get
followed by operation followed by putfollowed by operation followed by put Many computations involve summing values Many computations involve summing values
into fieldsinto fields MPI_Accumulate provides the perfect command MPI_Accumulate provides the perfect command
for thisfor this For scientific computation it is frequently For scientific computation it is frequently
more useful than MPI_Putmore useful than MPI_Put
Use PUTs rather than Use PUTs rather than GETsGETs
Although both PUTs and GETs are Although both PUTs and GETs are non-blocking it is desirable to use non-blocking it is desirable to use PUTs whenever possiblePUTs whenever possible GETs imply an inherent wait for data GETs imply an inherent wait for data
arrival and only complete when the arrival and only complete when the message side has fully decoded the message side has fully decoded the incoming messageincoming message
PUTs can be thought of as “fire and PUTs can be thought of as “fire and forget”forget”
MPI_Win_fenceMPI_Win_fence MPI_Win_fence(info,win,ierr)MPI_Win_fence(info,win,ierr)
Info allows user to specify constant that may Info allows user to specify constant that may improve performance (default of 0)improve performance (default of 0)
MPI_MODE_NOSTORE: No local storesMPI_MODE_NOSTORE: No local stores MPI_MODE_NOPUT: No puts will occur within the MPI_MODE_NOPUT: No puts will occur within the
window (don’t have to watch for remote updates)window (don’t have to watch for remote updates) MPI_MODE_NOPRECEDE: No earlier epochs of MPI_MODE_NOPRECEDE: No earlier epochs of
communication (optimize assumptions about window communication (optimize assumptions about window variables)variables)
MPI_MODE_NOSUCCEED: No epochs of communication MPI_MODE_NOSUCCEED: No epochs of communication will follow this fencewill follow this fence
NO_PRECEDE and NOSUCCEED must be called NO_PRECEDE and NOSUCCEED must be called collectivelycollectively
Multiple messages sent to the same target Multiple messages sent to the same target between fences may be concatenated to between fences may be concatenated to improve performanceimprove performance
Active target sync.
MPI_Win_(un)lockMPI_Win_(un)lock MPI_Win_lock(lock_type,target,info,win,ierr)MPI_Win_lock(lock_type,target,info,win,ierr)
Lock_types:Lock_types: MPI_LOCK_SHARED – use only for concurrent reads MPI_LOCK_SHARED – use only for concurrent reads MPI_LOCK_EXCLUSIVE – use when updates are necessaryMPI_LOCK_EXCLUSIVE – use when updates are necessary
Although called a lock – it actually Although called a lock – it actually isn’tisn’t (very (very poor naming convention)poor naming convention) ““MPI_begin/end_passive_target_epoch”MPI_begin/end_passive_target_epoch” Only on the local process does MPI_Win_lock act as a Only on the local process does MPI_Win_lock act as a
locklock Otherwise non-blockingOtherwise non-blocking
Provides a mechanism to ensure that the Provides a mechanism to ensure that the communication epoch is completedcommunication epoch is completed
Says nothing about order in which other Says nothing about order in which other competing message updates will occur on the competing message updates will occur on the target (consistency model is not specified)target (consistency model is not specified)
Passive target sync.
Subtleties of Subtleties of nonblockingnonblocking ‘locking’ and messaging‘locking’ and messaging
Suppose we wanted to implement a fetch Suppose we wanted to implement a fetch and add:and add:
Code is Code is erroneouserroneous for two reasons: for two reasons: 1. Cannot read and update same memory 1. Cannot read and update same memory
location in same access epochlocation in same access epoch 2. Even if you could, communication is 2. Even if you could, communication is
nonblocking and can complete in any ordernonblocking and can complete in any order
int one=1;MPI_Win_create(…,&win);…MPI_Win_lock(MPI_LOCK_EXCLUSIVE,0,0,win);MPI_Get(&value,1,MPI_INT,0,0,1,MPI_INT,win);MPI_Accumulate(&one,1,MPI_INT,0,0,1,MPI_INT,MPI_SUM,win);MPI_Win_unlock(0,win);
Simple exampleSimple example
subroutine exchng2( a, sx, ex, sy, ey, win, * left_nbr, right_nbr, top_nbr, bot_nbr, * right_ghost_disp, left_ghost_disp, * top_ghost_disp, coltype, right_coltype, left_coltype ) include 'mpif.h' integer sx, ex, sy, ey, win, ierr integer left_nbr, right_nbr, top_nbr, bot_nbr integer coltype, right_coltype, left_coltype double precision a(sx-1:ex+1,sy-1:ey+1)C This assumes that an address fits in a Fortran integer.C Change this to integer*8 if you need 8-byte addresses integer (kind=MPI_ADDRESS_KIND) right_ghost_disp, * left_ghost_disp, top_ghost_disp, bot_ghost_disp integer nx nx = ex - sx + 1 call MPI_WIN_FENCE( 0, win, ierr )C Put bottom edge into bottom neighbor's top ghost cells call MPI_PUT( a(sx,sy), nx, MPI_DOUBLE_PRECISION, bot_nbr, * top_ghost_disp, nx, MPI_DOUBLE_PRECISION, * win, ierr )C Put top edge into top neighbor's bottom ghost cells bot_ghost_disp = 1 call MPI_PUT( a(sx,ey), nx, MPI_DOUBLE_PRECISION, top_nbr, * bot_ghost_disp, nx, MPI_DOUBLE_PRECISION, * win, ierr )C Put right edge into right neighbor's left ghost cells call MPI_PUT( a(ex,sy), 1, coltype, * right_nbr, left_ghost_disp, 1, right_coltype, * win, ierr )C Put left edge into the left neighbor's right ghost cells call MPI_PUT( a(sx,sy), 1, coltype, * left_nbr, right_ghost_disp, 1, left_coltype, * win, ierr ) call MPI_WIN_FENCE( 0, win, ierr ) return end
exchng2 for 2d poisson problemNo gets are required – just putyour own data into other processes memory window.
Problems with passive Problems with passive target accesstarget access
Window creation must be collective over the Window creation must be collective over the communicatorcommunicator Expensive and time consuming Expensive and time consuming
MPI_Alloc_mem may be requiredMPI_Alloc_mem may be required Race conditions on a single window location under Race conditions on a single window location under
concurrent get/put must be handled by userconcurrent get/put must be handled by user See section 6.4 in Using MPI-2See section 6.4 in Using MPI-2
Local and remote operations on a remote window cannot Local and remote operations on a remote window cannot occur concurrently even if different parts of the window occur concurrently even if different parts of the window are being accessed at the same timeare being accessed at the same time Local processes must execute MPI_Win_lock as wellLocal processes must execute MPI_Win_lock as well
Multiple windows may have overlap, but must ensure Multiple windows may have overlap, but must ensure concurrent operations to do different windows do not lead concurrent operations to do different windows do not lead to race conditions on the overlapto race conditions on the overlap
Cannot access (via MPI_get for example) and update (via a Cannot access (via MPI_get for example) and update (via a put back ) the same location in the same access epoch put back ) the same location in the same access epoch (either between fences or lock/unlock)(either between fences or lock/unlock)
Drawbacks of one sided Drawbacks of one sided comms in generalcomms in general
No evidence for advantage except onNo evidence for advantage except on SMP machinesSMP machines Cray distributed memory systems (and Quadrics)Cray distributed memory systems (and Quadrics)
Although advantage on these machines is significant – Although advantage on these machines is significant – on T3E MPI latency is 16 µs, SHMEM latency is 2 µson T3E MPI latency is 16 µs, SHMEM latency is 2 µs
Slow acceptanceSlow acceptance Just beginning to reach maturityJust beginning to reach maturity
Unclear how many applications actually Unclear how many applications actually benefit from this modelbenefit from this model Not entirely clear whether nonblocking normal Not entirely clear whether nonblocking normal
send/recvs can achieve similar speed for some send/recvs can achieve similar speed for some applicationsapplications
Hardware – Reasons to be Hardware – Reasons to be optimisticoptimistic
Newer network technologies (e.g. Infiniband, Quadrics) Newer network technologies (e.g. Infiniband, Quadrics) have a built in RDMA enginehave a built in RDMA engine RMA framework can built on top of the NIC library (“verbs”)RMA framework can built on top of the NIC library (“verbs”)
10 gigabit ethernet will almost certainly come with an 10 gigabit ethernet will almost certainly come with an RDMA engineRDMA engine
Myrinet now has one sided comms library through HP Myrinet now has one sided comms library through HP MPIMPI SCI?SCI?
Still in its infancy – number of software issues to work Still in its infancy – number of software issues to work outout Support for non-contiguous datatypes is proving difficult – need Support for non-contiguous datatypes is proving difficult – need
efficient way to deal with the gather/scatter stepefficient way to deal with the gather/scatter step Many RDMA engines are designed for movement of contiguous Many RDMA engines are designed for movement of contiguous
regions – a comparatively rare operation in many situationsregions – a comparatively rare operation in many situations See http://nowlab.cis.ohio-state.edu/projects/mpi-iba/See http://nowlab.cis.ohio-state.edu/projects/mpi-iba/
Case Study: Matrix Case Study: Matrix transposetranspose
See Sun documentationSee Sun documentation Need to transpose elements across Need to transpose elements across
processor spaceprocessor space Could do one element at a time (bad idea!)Could do one element at a time (bad idea!) Aggregate as much local data as possible Aggregate as much local data as possible
and send large message (requires a lot of and send large message (requires a lot of local data movement)local data movement)
Send medium-sized contiguous packets of Send medium-sized contiguous packets of elements (there is some contiguity in the elements (there is some contiguity in the data layout)data layout)
Parallel IssuesParallel Issues 1 2 3 4 5 6 7 8 9 10 11 1213 14 15 16
1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16
Storage on 2 processors
1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16
P0 P1 P0 P1
1 2 3 4 5 6 7 8 9 10 11 1213 14 15 16
12345678
910111213141516
1d storage order
15913261014
371115481216
Possible parallel Possible parallel algorithmalgorithm
12345678
910111213141516
12563478
910131411121516
12569101314
347811121516
15913261014
371115481216
Local permutation Send data Local permutation
Program 1Program 1include "mpif.h"real(8), allocatable, dimension(:) :: a, b, c, dreal(8) t0, t1, t2, t3! initialize parameterscall init(me,np,n,nb)! allocate matricesallocate(a(nb*np*nb))allocate(b(nb*nb*np))allocate(c(nb*nb*np))allocate(d(nb*np*nb))! initialize matrixcall initialize_matrix(me,np,nb,a)! timingdo itime = 1, 10 call MPI_Barrier(MPI_COMM_WORLD,ier) t0 = MPI_Wtime()
! first local transpose do k = 1, nb do j = 0, np - 1 ioffa = nb * ( j + np * (k-1) ) ioffb = nb * ( (k-1) + nb * j ) do i = 1, nb b(i+ioffb) = a(i+ioffa) enddo enddo enddo t1 = MPI_Wtime() ! global all-to-allcall MPI_Alltoall(b, nb*nb, MPI_REAL8, & c, nb*nb, MPI_REAL8, MPI_COMM_WORLD, ier) t2 = MPI_Wtime() ! second local transpose call dtrans(`o', 1.d0, c, nb, nb*np, d) call MPI_Barrier(MPI_COMM_WORLD,ier) t3 = MPI_Wtime()
if ( me .eq. 0 ) & write(6,'(f8.3," seconds; breakdown on proc 0 = ",3f10.3)') & t3 - t0, t1 - t0, t2 - t1, t3 - t2enddo! checkcall check_matrix(me,np,nb,d)deallocate(a)deallocate(b)deallocate(c)deallocate(d)call MPI_Finalize(ier)end
This code aggregates data locally and uses the two-sided Alltoall collectiveOperation. Data is then rearranged using a subroutine called DTRANS()
Version 2 – one sidedVersion 2 – one sidedinclude "mpif.h"integer(kind=MPI_ADDRESS_KIND) nbytesinteger winreal(8) c(*)pointer (cptr,c)real(8), allocatable, dimension(:) :: a, b, dreal(8) t0, t1, t2, t3! initialize parameterscall init(me,np,n,nb)! allocate matricesallocate(a(nb*np*nb))allocate(b(nb*nb*np))allocate(d(nb*np*nb))nbytes = 8 * nb * nb * npcall MPI_Alloc_mem(nbytes, MPI_INFO_NULL, cptr, ier)if ( ier .eq. MPI_ERR_NO_MEM ) stop! create windowcall MPI_Win_create(c, nbytes, 1, MPI_INFO_NULL, MPI_COMM_WORLD, win, ier)! initialize matrixcall initialize_matrix(me,np,nb,a)! timing
do itime = 1, 10 call MPI_Barrier(MPI_COMM_WORLD,ier) t0 = MPI_Wtime() t1 = t0 ! combined local transpose with global all-to-all call MPI_Win_fence(0, win, ier) do ip = 0, np - 1 do ib = 0, nb - 1 nbytes = 8 * nb * ( ib + nb * me ) call MPI_Put(a(1+nb*ip+nb*np*ib), nb, MPI_REAL8, ip, nbytes, & nb, MPI_REAL8, win, ier) enddo enddo call MPI_Win_fence(0, win, ier) t2 = MPI_Wtime() ! second local transpose call dtrans(`o', 1.d0, c, nb, nb*np, d) call MPI_Barrier(MPI_COMM_WORLD,ier) t3 = MPI_Wtime() if ( me .eq. 0 ) & write(6,'(f8.3," seconds; breakdown on proc 0 = ",3f10.3)') & t3 - t0, t1 - t0, t2 - t1, t3 - t2enddo! checkcall check_matrix(me,np,nb,d)! deallocate matrices and stuffcall MPI_Win_free(win, ier)deallocate(a)deallocate(b)deallocate(d)call MPI_Free_mem(c, ier)call MPI_Finalize(ier)end
No local aggregation is used, and communication is mediated viaMPI_Puts. Data is then rearranged using a subroutine called DTRANS()
Performance comparisonPerformance comparison
VersionVersion TotalTotal Local Local AggregatioAggregatio
nn
CommunicCommunicationation
Dtrans callDtrans call
11 2.1092.109 0.5850.585 0.8520.852 0.6730.673
22 1.1771.177 0.00.0 0.430.43 0.7470.747
One sided version is twice as fast on this machine (Sun One sided version is twice as fast on this machine (Sun 6000 SMP)6000 SMP)
Net data movement is slightly over 1.1 Gbyte/s, which is Net data movement is slightly over 1.1 Gbyte/s, which is about ½ the net bus bandwidth (2.6 Gbyte/s)about ½ the net bus bandwidth (2.6 Gbyte/s)
Big performance boost from getting rid of aggregation and Big performance boost from getting rid of aggregation and the fast messaging using shorter one sided messagesthe fast messaging using shorter one sided messages
Summary Part 1Summary Part 1
One sided comms can reduce One sided comms can reduce synchronization and thereby increase synchronization and thereby increase performanceperformance
They indirectly reduce local data They indirectly reduce local data movementmovement
The reduction in messaging overhead can The reduction in messaging overhead can simplify programming simplify programming
Part 2: Parallel I/OPart 2: Parallel I/O
MotivationsMotivations Review of different I/O strategiesReview of different I/O strategies Far too many issues to put into one Far too many issues to put into one
lecture – plenty of web resources provide lecture – plenty of web resources provide more details if you need themmore details if you need them
Draws heavily from Using MPI-2 – see the Draws heavily from Using MPI-2 – see the excellent discussion presented therein for excellent discussion presented therein for further detailsfurther details
Non-parallel I/ONon-parallel I/O
Simplest way to do I/O – number of factors may contribute to this kind Simplest way to do I/O – number of factors may contribute to this kind of model:of model:
May have to implement this way because only one process is capable of I/OMay have to implement this way because only one process is capable of I/O May have to use serial I/O libraryMay have to use serial I/O library I/O may be enhanced for large writesI/O may be enhanced for large writes Easiest file handling – only one file to keep track ofEasiest file handling – only one file to keep track of
Arguments against:Arguments against: Strongly limiting in terms of throughput if the underlying file system does Strongly limiting in terms of throughput if the underlying file system does
permit parallel I/Opermit parallel I/O
Memory
Processes
File
Additional argument for Additional argument for parallel I/O standardparallel I/O standard
Standard UNIX I/O is not portableStandard UNIX I/O is not portable Endianess becomes serious problem Endianess becomes serious problem
across different machinesacross different machines Writing wrappers to perform byte Writing wrappers to perform byte
conversion is tediousconversion is tedious
Simple parallel I/OSimple parallel I/O
Multiple independent files are writtenMultiple independent files are written Same as parallel I/O allowed under OpenMPSame as parallel I/O allowed under OpenMP May still be able to use sequential I/O librariesMay still be able to use sequential I/O libraries Significant increases in throughput are possibleSignificant increases in throughput are possible
Drawbacks are potentially serious Drawbacks are potentially serious Now have multiple files to keep track of (concatenation may not be an Now have multiple files to keep track of (concatenation may not be an
option)option) May be non-portable if application reading these files must have the May be non-portable if application reading these files must have the
same number of processessame number of processes
Memory
Processes
Files
Simple parallel I/O (no MPI Simple parallel I/O (no MPI calls)calls)
I/O operations are I/O operations are completely completely independent across independent across the processesthe processes
Append a rank to each Append a rank to each file name to specify file name to specify the different filesthe different files
Need to be very Need to be very careful about careful about performance – performance – individual files may be individual files may be too small for good too small for good throughputthroughput
#include “mpi.h”#include <stdio.h>#define BUFSIZE 100
Int main(int argc, char *argv[]){ int i,myrank,buf[BUFSIZE]; char filename[128]; FILE *myfile;
MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); for (i=0; i< BUFSIZE; i++) buf[i] = myrank * BUFSIZE + I; sprintf(filename,”testfile.%d”,myrank); myfile= fopen(filename, “w”); fwrite(buf,sizeof(int), BUFSIZE, myfile); fclose(myfile); MPI_Finalize(); return 0;}
Simple parallel I/O (using Simple parallel I/O (using MPI calls)MPI calls)
Rework previous example Rework previous example to use MPI callsto use MPI calls
Note the file pointer has Note the file pointer has been replaced by a been replaced by a variable of type MPI_Filevariable of type MPI_File
Under MPI I/O open, Under MPI I/O open, write (read) and close write (read) and close statements are providedstatements are provided
Note MPI_COMM_SELF Note MPI_COMM_SELF denotes a communicator denotes a communicator over local processover local process
#include “mpi.h”#include <stdio.h>#define BUFSIZE 100
Int main(int argc, char *argv[]){ int i,myrank,buf[BUFSIZE]; char filename[128]; MPI_File myfile;
MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); for (i=0; i< BUFSIZE; i++) buf[i] = myrank * BUFSIZE + i; sprintf(filename,”testfile.%d”,myrank); MPI_File_open(MPI_COMM_SELF, filename, MPI_MODE_WRONLY | MPI_MODE_CREATE, MPI_INFO_NULL, &myfile); MPI_File_write(myfile,buf, BUFSIZE, MPI_INT, MPI_STATUS_IGNORE); MPI_File_close(&myfile); MPI_Finalize(); return 0;}
MPI_File_open MPI_File_open argumentsarguments
MPI_File_open(comm,filename,accessmodMPI_File_open(comm,filename,accessmode,info,filehandle,ierr)e,info,filehandle,ierr) Comm – can choose immediately whether file Comm – can choose immediately whether file
access is collective or localaccess is collective or local MPI_COMM_WORLD or MPI_COMM_SELFMPI_COMM_WORLD or MPI_COMM_SELF
Access mode is specified by Access mode is specified by MPI_MODE_CREATE and MPI_MODE_CREATE and MPI_MODE_WRONLY are or’d together MPI_MODE_WRONLY are or’d together
Same as Unix openSame as Unix open The file handler is passed back from this call The file handler is passed back from this call
to be used later in MPI_File_writeto be used later in MPI_File_write
MPI_File_writeMPI_File_write
MPI_File_write(filehandler,buff,count,datatMPI_File_write(filehandler,buff,count,datatype,status,ierr)ype,status,ierr) Very similar to message passing interfaceVery similar to message passing interface
Imagine file handler as providing destinationImagine file handler as providing destination ““Address,count,datatype” interfaceAddress,count,datatype” interface
Specification of non-contiguous writes would Specification of non-contiguous writes would be done via a user-defined datatypebe done via a user-defined datatype
MPI_STATUS_IGNORE can be passed in the MPI_STATUS_IGNORE can be passed in the status fieldstatus field
Informs system not to fill field since user will ignore itInforms system not to fill field since user will ignore it May slightly improve I/O performance when not neededMay slightly improve I/O performance when not needed
ROMIO (“romeo”)ROMIO (“romeo”)
Publically available implementation Publically available implementation of the MPI-I/O instruction setof the MPI-I/O instruction set http://www-unix.mcs.anl.gov/romiohttp://www-unix.mcs.anl.gov/romio
Runs on multiple platformsRuns on multiple platforms Optimized for non-contiguous access Optimized for non-contiguous access
patternspatterns Common for parallel applicationsCommon for parallel applications
Optimized for collective operationsOptimized for collective operations
True parallel MPI I/OTrue parallel MPI I/O
Processes must all now agree on opening a Processes must all now agree on opening a single filesingle file Each process must have its own pointer within the fileEach process must have its own pointer within the file File clearly must reside on a single file systemFile clearly must reside on a single file system Can read the file with a different number of processes Can read the file with a different number of processes
as compared to what it was written withas compared to what it was written with
Memory
Processes
File
Parallel I/O to a single file Parallel I/O to a single file using MPI callsusing MPI calls
Rework previous example Rework previous example to write to a single fileto write to a single file
Write is now Write is now collectivecollective MPI_COMM_SELF has MPI_COMM_SELF has
been replaced by been replaced by MPI_COMM_WORLDMPI_COMM_WORLD
All files agree on a All files agree on a collective name for the collective name for the file “testfile”file “testfile”
Access to given parts of Access to given parts of the file is specifically the file is specifically controlledcontrolled MPI_File_set_viewMPI_File_set_view Shift displacements Shift displacements
according to local rankaccording to local rank
#include “mpi.h”#include <stdio.h>#define BUFSIZE 100
Int main(int argc, char *argv[]){ int i,myrank,buf[BUFSIZE]; MPI_File thefile;
MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); for (i=0; i< BUFSIZE; i++) buf[i] = myrank * BUFSIZE + i; MPI_File_open(MPI_COMM_WORLD, “testfile”, MPI_MODE_WRONLY | MPI_MODE_CREATE, MPI_INFO_NULL, &thefile); MPI_File_set_view(thefile, myrank*BUFSIZE*sizeof(int), MPI_INT,MPI_INT,”native”, MPI_INFO_NULL); MPI_File_write(thefile,buf, BUFSIZE, MPI_INT, MPI_STATUS_IGNORE); MPI_File_close(&thefile); MPI_Finalize(); return 0;}
File viewFile view
File viewFile view defines which portion of a file defines which portion of a file is “visible” to a given processis “visible” to a given process On first opening a file the entire file is On first opening a file the entire file is
visiblevisible Data is described at the byte level initiallyData is described at the byte level initially
MPI_File_set_view provides information MPI_File_set_view provides information toto enable reading of the data (datatypes)enable reading of the data (datatypes) Specify which parts of the file should be Specify which parts of the file should be
skipped skipped
MPI_File_set_viewMPI_File_set_view MPI_File_set_view(filehandler,displ,etype,filedatatMPI_File_set_view(filehandler,displ,etype,filedatat
ype,datarep,info,ierr)ype,datarep,info,ierr) Displ controls the BYTE offset from the beginning of the Displ controls the BYTE offset from the beginning of the
filefile The displacement is of a type “MPI_Offset” larger than The displacement is of a type “MPI_Offset” larger than
normal MPI_INT to allow for 64 bit addressing (byte normal MPI_INT to allow for 64 bit addressing (byte offsets can easily exceed 32 bit limit)offsets can easily exceed 32 bit limit)
etype is the datatype of the buffer etype is the datatype of the buffer filedatatype is the corresponding datatype in the file, filedatatype is the corresponding datatype in the file,
which must be either etype, or derived from itwhich must be either etype, or derived from it Must use MPI_Type_create etcMust use MPI_Type_create etc
DatarepDatarep native: data is stored in the file as in memorynative: data is stored in the file as in memory internal: implementation specific format that may provide a internal: implementation specific format that may provide a
level or portabilitylevel or portability external32: 32bit big endian IEEE format (defined for all MPI external32: 32bit big endian IEEE format (defined for all MPI
implementations)implementations) Only use if portability is required (conversion may be necessary) Only use if portability is required (conversion may be necessary)
Portion of file visible to a given process
etype and filetype in etype and filetype in actionaction
Suppose we have a buffer with an etype of MPI_INTSuppose we have a buffer with an etype of MPI_INT Filetype is defined to be 2 MPI_INTs followed by an offset Filetype is defined to be 2 MPI_INTs followed by an offset
of 4 MPI_INTS (extent=6)of 4 MPI_INTS (extent=6)
MPI_File_set_view(fh,5*sizeof(int),etype,filetype,”native”,MPI_MPI_File_set_view(fh,5*sizeof(int),etype,filetype,”native”,MPI_INFO_NULL)INFO_NULL)MPI_File_write(fh,buf,1000,MPI_INT,MPI_STATUS_IGNORE)MPI_File_write(fh,buf,1000,MPI_INT,MPI_STATUS_IGNORE)
1 2 3 4 5 6
etype = MPI_INT
filetype = 2 MPI_INTs, followed by 4 offset
Displacement of5 MPI_INTS
filetype filetype filetype
Fortran issuesFortran issues Two levels of supportTwo levels of support
basic – just include basic – just include ‘mpif.h’‘mpif.h’
Designed for f77 Designed for f77 backwards backwards compatibilitycompatibility
extended – need to extended – need to include f90 module (use include f90 module (use mpi)mpi)
Use the extended set Use the extended set whenever possiblewhenever possible
Note that the MPI_FILE Note that the MPI_FILE type is an integer in type is an integer in FortranFortran
PROGRAM main include ‘mpif.h’! Should really use “use mpi” integer ierr,i,myrank,BUFSIZE,thefile parameter (BUFSIZE=100) integer buf(BUFSIZE) integer(kind=MPI_OFFSET_KIND) disp
call MPI_Init(ierr) call MPI_Comm_rank(MPI_COMM_WORLD,myrank,ierr) do i=1,BUFSIZE buf(i) = myrank * BUFSIZE + i end do call MPI_File_open(MPI_COMM_WORLD, ‘testfile’,& MPI_MODE_WRONLY + MPI_MODE_CREATE,& MPI_INFO_NULL,thefile,ierr) disp=myrank*BUFSIZE*4 call MPI_File_set_view(thefile,disp,MPI_INTEGER, & MPI_INTEGER,’native’, & MPI_INFO_NULL,ierr) call MPI_File_write(thefile,buf, BUFSIZE, & MPI_INTEGER,MPI_STATUS_IGNORE,ierr) call MPI_File_close(thefile,ierr) call MPI_Finalize(ierr); END PROGRAM main
Reading a single file with an Reading a single file with an unknown number of processorsunknown number of processors
New function:New function: MPI_File_get_sizeMPI_File_get_size Returns size of open file, Returns size of open file,
need to use 64 bit int need to use 64 bit int (MPI_Offset)(MPI_Offset)
Check how much data Check how much data has been read by using has been read by using status handlerstatus handler
Pass to MPI_Get_countPass to MPI_Get_count Determines number of Determines number of
datatypes that were readdatatypes that were read If number of read items is If number of read items is
less than expected then less than expected then (hopefully) EOF is (hopefully) EOF is reachedreached
#include “mpi.h”#include <stdio.h>
int main(int argc, char *argv[]){ int myrank,numprocs,bufsize,*buf,count; MPI_File thefile; MPI_Status status; MPI_Offset filesize;
MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); MPI_Comm_size(MPI_COMM_WORLD, &numprocs); MPI_File_open(MPI_COMM_WORLD, “testfile”, MPI_MODE_RDONLY,MPI_INFO_NULL, &thefile); MPI_File_get_size(thefile,&filezize); filesize=filesize/sizeof(int); bufsize=filesize/(numprocs+1); buf=(int *) malloc(bufsize*sizeof(int)); MPI_File_set_view(thefile, myrank*bufsize*sizeof(int), MPI_INT,MPI_INT,”native”, MPI_INFO_NULL); MPI_File_read(thefile,buf, bufsize, MPI_INT, &status); MPI_Get_count(&status, MPI_INT, &count); printf(“process %d read %d ints\n”,myrank,count); MPI_File_close(&thefile); MPI_Finalize(); return 0;}
Using Individual file Using Individual file pointers by handpointers by hand
We assume a known We assume a known filesize in this casefilesize in this case
Pointers must be Pointers must be explicitly movedexplicitly moved MPI_File_seekMPI_File_seek Offset of position to Offset of position to
move pointer to is move pointer to is second argumentsecond argument
Second argument is of Second argument is of type MPI_Offsettype MPI_Offset Good practice to resolve Good practice to resolve
calculation into this calculation into this variable and then call variable and then call using this variableusing this variable
#include “mpi.h”#define FILESIZE (1024*1024)
int main(int argc, char *argv[]){ int *buf,rank,nprocs,nints,bufsize; MPI_File fh; MPI_Status status;
MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); MPI_Comm_size(MPI_COMM_WORLD, &numprocs); bufsize = FILESIZE/nprocs; buf=(int *) malloc(bufsize); nints=bufsize/sizeof(int); MPI_File_open(MPI_COMM_WORLD, “testfile”, MPI_MODE_RDONLY,MPI_INFO_NULL, &fh); MPI_File_seek(fh,rank*bufsize, MPI_SEEK_SET); MPI_File_read(fh,buf, nints, MPI_INT, &status);
MPI_File_close(&fh); MPI_Finalize(); return 0;}
Using explicit offsetsUsing explicit offsets MPI_File_read/write are MPI_File_read/write are individual-file-pointer individual-file-pointer
functionsfunctions Both use current location of pointer to determine where Both use current location of pointer to determine where
to read/writeto read/write Must perform a seek to arrive at the correct region of Must perform a seek to arrive at the correct region of
datadata Explicit offset functionsExplicit offset functions don’t use an individual file don’t use an individual file
pointerpointer File offset is passed directly as an argument to the File offset is passed directly as an argument to the
functionfunction Seek is not requiredSeek is not required MPI_File_read/write_atMPI_File_read/write_at Must use this version in multithreaded environmentMust use this version in multithreaded environment
Revised code for explicit Revised code for explicit offsetsoffsets
No need to apply seeksNo need to apply seeks Specific movement of Specific movement of
file pointer is not file pointer is not applied, instead offset applied, instead offset is passed as argumentis passed as argument
Remember offsets Remember offsets must be of kind must be of kind MPI_OFFSET_KINDMPI_OFFSET_KIND
Same issue here with Same issue here with precalculating offset precalculating offset and resolving into a and resolving into a variable with the variable with the appropriate typing appropriate typing
PROGRAM main include ‘mpif.h integer FILESIZE,MAX_BUFSIZE,INTSIZE parameter (FILESIZE=1048576, MAX_BUFSIZE=1048576) parameter (INTSIZE=4) integer buf(MAX_BUFSIZE),rank,ierr,fh,nprocs,nints integer status(MPI_STATUS_SIZE), count integer(kind=MPI_OFFSET_KIND) offset
call MPI_Init(ierr) call MPI_Comm_rank(MPI_COMM_WORLD,rank,ierr) call MPI_Comm_size(MPI_COMM_WORLD,nprocs,ierr) call MPI_File_open(MPI_COMM_WORLD, ‘testfile’,& MPI_MODE_RDONLY,& MPI_INFO_NULL,fh,ierr) nints=FILESIZE/(nprocs*INTSIZE) offset=rank*nints*INTSIZE call MPI_File_read_at(fh,offset, buf, nints, & MPI_INTEGER,status,ierr) call MPI_Get_count(status,MPI_INTEGER,count,ierr) print*, ‘process ’,rank,’read ‘,count’,’ints’ call MPI_File_close(fh,ierr) call MPI_Finalize(ierr); END PROGRAM main
Dealing with Dealing with multidimensional arraysmultidimensional arrays
Storage formats differ for C versus fortran Storage formats differ for C versus fortran row major versus column majorrow major versus column major
MPI_Type_create_darray, and also MPI_Type_create_darray, and also MPI_Type_create_subarray are used to MPI_Type_create_subarray are used to specify dervied datatypes specify dervied datatypes These datatypes can then be used to resolve These datatypes can then be used to resolve
local regions within a linearized global arraylocal regions within a linearized global array Specifically they deal with the noncontiguous Specifically they deal with the noncontiguous
nature of the domain decompositionnature of the domain decomposition See Using MPI-2 book for more detailsSee Using MPI-2 book for more details
Summary Part 2Summary Part 2
MPI I/O provides a straightforward MPI I/O provides a straightforward interface for performing parallel I/Ointerface for performing parallel I/O Single file output is supported via multiple file Single file output is supported via multiple file
pointers into the same filepointers into the same file Multiple output files may also be written Multiple output files may also be written
Can use explicit pointer based schemes, Can use explicit pointer based schemes, or alternatively use file views to specify or alternatively use file views to specify local access local access
Part 3: Odds and endsPart 3: Odds and ends
Dynamic process managementDynamic process management Connecting different MPI processesConnecting different MPI processes Thread safetyThread safety
Dynamic process Dynamic process managementmanagement
Provided in MPI-2 to provide an API with some Provided in MPI-2 to provide an API with some backward compatibility to PVMbackward compatibility to PVM
Processes are created by MPI_Comm_spawnProcesses are created by MPI_Comm_spawn Collective operation over spawning processes Collective operation over spawning processes
(parents)(parents) Returns an Returns an intercommunicatorintercommunicator
Local group = parentsLocal group = parents Remote group = childrenRemote group = children
New processes have their own MPI_COMM_WORLD New processes have their own MPI_COMM_WORLD and execute their own MPI_Initand execute their own MPI_Init
Communicator is provided for children to message Communicator is provided for children to message parentsparents MPI_Comm_parent functionMPI_Comm_parent function
Parent/Children Parent/Children communicatorscommunicators
Parent groupOriginal MPI_COMM_WORLD
Child groupOwn MPI_COMM_WORLD
MPI_Comm_spawn
MPI_Comm_parent
Parent groupOriginal MPI_COMM_WORLD
Intercommunicators returned
Master exampleMaster example/* manager */ #include "mpi.h" int main(int argc, char *argv[]) { int world_size, universe_size, *universe_sizep, flag; MPI_Comm everyone; /* intercommunicator */ char worker_program[100];
MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &world_size);
if (world_size != 1) error("Top heavy with management");
MPI_Attr_get(MPI_COMM_WORLD, MPI_UNIVERSE_SIZE, &universe_sizep, &flag); if (!flag) { printf("This MPI does not support UNIVERSE_SIZE. How many\n\ processes total?"); scanf("%d", &universe_size); } else universe_size = *universe_sizep; if (universe_size == 1) error("No room to start workers");
/* * Now spawn the workers. Note that there is a run-time determination of what type of worker to spawn, and presumably this calculation must be done at run time and cannot be calculated before starting the program. If everything is known when the application is first started, it is generally better to start them all at once in a single MPI_COMM_WORLD. */
choose_worker_program(worker_program); MPI_Comm_spawn(worker_program, MPI_ARGV_NULL, universe_size-1, MPI_INFO_NULL, 0, MPI_COMM_SELF, &everyone, MPI_ERRCODES_IGNORE); /* * Parallel code here. The communicator "everyone" can be used to communicate with the spawned processes, which have ranks 0,..,MPI_UNIVERSE_SIZE-1 in the remote group of the intercommunicator "everyone". */
MPI_Finalize(); return 0; }
MPI_UNIVERSE_SIZEMPI_UNIVERSE_SIZE
Attribute of MPI_COMM_WORLDAttribute of MPI_COMM_WORLD Best guess from system of how many processes Best guess from system of how many processes
could existcould exist Is not required to be defined by MPI Is not required to be defined by MPI
implementationimplementation Main problem – need to relate to scheduler via Main problem – need to relate to scheduler via
library of some sortlibrary of some sort May be set by an environment variable on some May be set by an environment variable on some
systemssystems Best way to do things is to have value set via a call Best way to do things is to have value set via a call
to the start-up programto the start-up program e.g. mpiexec –n 1 –universe_size 10 my_proge.g. mpiexec –n 1 –universe_size 10 my_prog
Worker codeWorker code/* worker */
#include "mpi.h" int main(int argc, char *argv[]) { int size; MPI_Comm parent; MPI_Init(&argc, &argv); MPI_Comm_get_parent(&parent); if (parent == MPI_COMM_NULL) error("No parent!"); MPI_Comm_remote_size(parent, &size); if (size != 1) error("Something's wrong with the parent");
/* * Parallel code here. The manager is represented as the process with rank 0 in (the remote group of) MPI_COMM_PARENT. If the workers need to communicate among themselves, they can use MPI_COMM_WORLD. */
MPI_Finalize(); return 0; }
Connecting different MPI Connecting different MPI ApplicationsApplications
Climate models are a good example of two Climate models are a good example of two separate applications that need to share separate applications that need to share datadata Atmosphere needs input from ocean modelAtmosphere needs input from ocean model
e.g. effect of large currents on warminge.g. effect of large currents on warming Ocean model needs input from atmosphereOcean model needs input from atmosphere
Atmospheric temperature affects evaporation ratesAtmospheric temperature affects evaporation rates Secondary example – visualization of an Secondary example – visualization of an
active applicationactive application May want to pass information to a visualization May want to pass information to a visualization
engine that is implemented in a separate engine that is implemented in a separate programprogram
We’ll assume that such programs cannot be We’ll assume that such programs cannot be spawned within the MPI environmentspawned within the MPI environment
Mediating Mediating communicationcommunication
Necessary to modify one program to Necessary to modify one program to accept a connection from anotheraccept a connection from another Must open a portMust open a port
MPI_Open_port(MPI_INFO_NULL,port_name,ierr)MPI_Open_port(MPI_INFO_NULL,port_name,ierr) The connecting application still has to be The connecting application still has to be
allowed to connect to communicatorallowed to connect to communicator MPI_Comm_accept(port_name,MPI_INFO_NULL,0,MPI_Comm_accept(port_name,MPI_INFO_NULL,0, MPI_COMM_WORLD,client,ierr)MPI_COMM_WORLD,client,ierr) Collective over comm in 4Collective over comm in 4thth argument argument client returns a new comm (intercommunicator) to client returns a new comm (intercommunicator) to
the connecting applicationthe connecting application Can now send to the remote application Can now send to the remote application
Example: 2d Poisson Example: 2d Poisson modelmodel
character*(MPI_MAX_PORT_NAME) port_nameinteger client…if (myid .eq. 0) then call MPI_Open_port(MPI_INFO_NULL, port_name,ierr) write(*,*) port_nameendif call MPI_Comm_accept(port_name,MPI_INFO_NULL,0, & MPI_COMM_WORLD,client,ierr)!Must send information to the connecting programcall MPI_Gather(mesh_size,1,MPI_INTEGER, & MPI_BOTTOM,0,MPI_DATATYPE_NULL, & 0, client, ierr)…! After each iteration process 0 sends its infoif (myid.eq.0) then call MPI_Send(it,1,MPI_INTEGER,0,0,client,ierr)end ifcall MPI_Gatherv(mesh,mesh_size,MPI_DOUBLE_PRECISION, & MPI_BOTTOM,0,0, & MPI_DATATYPE_NULL, 0, client, ierr)…if (myid .eq.0) then call MPI_Close_port(port_name,ierr)end ifcall MPI_Comm_disconnect(client,ierr)call MPI_Finalize(ierr)
These messages are matched inThe connecting application
Connecting application mustalso execute this gather
Accessing the portAccessing the port To connect to the front end application To connect to the front end application
the client must be given the port namethe client must be given the port name Port names are usually character stringsPort names are usually character strings The connect to port by callingThe connect to port by calling
MPI_Comm_connect(port_name,info,root,comm,newMPI_Comm_connect(port_name,info,root,comm,newcomm,ierr)comm,ierr)
Returns new communicatorReturns new communicator Can determine remote communicator size viaCan determine remote communicator size via
MPI_Comm_remote_size(newcomm,procs,ierr)MPI_Comm_remote_size(newcomm,procs,ierr) Communication must concur with that Communication must concur with that
executed in the server programexecuted in the server program Finally, disconnected from the portFinally, disconnected from the port
MPI_Comm_disconnect(newcomm,ierr)MPI_Comm_disconnect(newcomm,ierr)
Alternative methods of Alternative methods of publishing the port namepublishing the port name
MPI_Publish_name(service,info,port,ierr)MPI_Publish_name(service,info,port,ierr) Allows a given program (specified by service) to Allows a given program (specified by service) to
post the port name to all MPI infrastructurepost the port name to all MPI infrastructure Call MPI_Unpublish_name to remove data from Call MPI_Unpublish_name to remove data from
system before finalizingsystem before finalizing Client then calls Client then calls
MPI_Lookup_name(service,info,port,ierr)MPI_Lookup_name(service,info,port,ierr) Returns desired port nameReturns desired port name
Possible problem:Possible problem: Two programmers may choose same service Two programmers may choose same service
name (unlikely)name (unlikely)
MPI-2 & threadsMPI-2 & threads MPI processes are usually conceived as being single MPI processes are usually conceived as being single
threadedthreaded Threads carry their own pc, register set and stackThreads carry their own pc, register set and stack Individual threads are not considered to be visible outside the Individual threads are not considered to be visible outside the
processprocess Threads have the potential to improve performance in Threads have the potential to improve performance in
codes where polling is requiredcodes where polling is required Polling thread operates at lower overhead than stopping entire Polling thread operates at lower overhead than stopping entire
process to pollprocess to poll Equivalent to non-blocking receiveEquivalent to non-blocking receive
However, to work effectively the MPI implementation However, to work effectively the MPI implementation must be must be thread safethread safe Multiple threads can execute message passing calls without Multiple threads can execute message passing calls without
causing problemscausing problems MPI-1 is not guaranteed to be thread safe (some implementations MPI-1 is not guaranteed to be thread safe (some implementations
are)are)
Determining what type of Determining what type of threading is allowedthreading is allowed
MPI_Init_thread(required,provided,ierr)MPI_Init_thread(required,provided,ierr) A non-thread safe library will returnA non-thread safe library will return
MPI_THREAD_SINGLEMPI_THREAD_SINGLE Several user threads supported but only one may make Several user threads supported but only one may make
messaging calls (standard interpretation)messaging calls (standard interpretation) MPI_THREAD_FUNNELEDMPI_THREAD_FUNNELED
When all threads make calls, but only one at a timeWhen all threads make calls, but only one at a time MPI_THREAD_SERIALIZEDMPI_THREAD_SERIALIZED
Finally, for all threads executing messaging at any time Finally, for all threads executing messaging at any time “thread compliant”“thread compliant”
MPI_THREAD_MULTIPLEMPI_THREAD_MULTIPLE This function can be used to start the program This function can be used to start the program
instead of MPI_Initinstead of MPI_Init C binding MPI_Init_thread(int *argc, char *** argv, int C binding MPI_Init_thread(int *argc, char *** argv, int
required, int provided)required, int provided) Note only the thread which called the Note only the thread which called the
initialization may perform the finalizeinitialization may perform the finalize
“required” allows you tochoose different versionsof a given MPI lib
Ensuring all processes Ensuring all processes agree on the number of agree on the number of
threadsthreads MPI does not require MPI does not require
that environment that environment variables be propagated variables be propagated to all processesto all processes Problematic for OpenMP Problematic for OpenMP
where number of threads where number of threads is usually specified by is usually specified by environment variableenvironment variable
Most often rank 0 gets the Most often rank 0 gets the environment but others do environment but others do notnot
Code on right may be Code on right may be used to propagate the used to propagate the thread count to all other thread count to all other processesprocesses
MPI_Comm_rank(MPI_COMM_WORLD,&rank);if (rank==0) { nthreads_str = getenv(“OMP_NUM_THREADS”); if (nthreads_str) /* convert string to integer*/ nthreads = atoi( nthreads_str); else nthreads = 1;}MPI_Bcast(&nthreads,1,MPI_INT,0, MPI_COMM_WORLD);omp_set_num_threads(nthreads);
Summary Part 3Summary Part 3 Dynamic process management is Dynamic process management is
supported in MPI-2 – MPI_Comm_spawnsupported in MPI-2 – MPI_Comm_spawn Different MPI applications may connect Different MPI applications may connect
to one anotherto one another Important for a class of applications where Important for a class of applications where
different solvers require parts of the same different solvers require parts of the same datasetdataset
Communication is mediated by portsCommunication is mediated by ports Level of thread safety can be Level of thread safety can be
determined at compile timedetermined at compile time
Next lectureNext lecture Grids – motivations, applicationsGrids – motivations, applications
Computational gridsComputational grids Data gridsData grids
GlobusGlobus