High Performance Computing – CISC 811

High High Performance Performance Computing – Computing –

CISC 811CISC 811Dr Rob ThackerDr Rob Thacker

Dept of Physics (308A)Dept of Physics (308A)

thacker@physicsthacker@physics

Today’s LectureToday’s Lecture

Part 1: MPI-2: RMAPart 1: MPI-2: RMA Part 2: MPI-2: Parallel I/O Part 2: MPI-2: Parallel I/O Part 3: Odds and endsPart 3: Odds and ends

Distributed Memory Computing III

Part 1: MPI-2 RMAPart 1: MPI-2 RMA

One sided communication is a One sided communication is a significant step forward in significant step forward in functionality for MPIfunctionality for MPI

Ability to retrieve remote data Ability to retrieve remote data without cooperative message without cooperative message passing enables increased time in passing enables increased time in computationcomputation

One sided comms are sensitive to One sided comms are sensitive to OS/machine optimizations thoughOS/machine optimizations though

Standard message Standard message passingpassing

HOST A HOST B

NIC NIC

Memory/Buffer CPU CPU

Memory/Buffer

MPI_Send MPI_Recv

Packet transmission is directly mitigated by the CPU’s on both machines,multiple buffer copies may be necessary

Traditional message Traditional message passingpassing

Both sender and receiver must Both sender and receiver must cooperatecooperate Send needs to address buffer to be sentSend needs to address buffer to be sent Sender specifies destination and tagSender specifies destination and tag Recv needs to specify it’s own bufferRecv needs to specify it’s own buffer Recv must specify origin and tagRecv must specify origin and tag

In blocking mode this is a very In blocking mode this is a very expensive operationexpensive operation Both sender and receiver must cooperate Both sender and receiver must cooperate

and stop any computation they may be and stop any computation they may be doingdoing

Sequence of operations to Sequence of operations to `get’ data`get’ data

Suppose process A wants to Suppose process A wants to retrieve a section of an array retrieve a section of an array from process B (from process B (process B is process B is unaware of what is unaware of what is requiredrequired)) Process A executes MPI_Send Process A executes MPI_Send

to B with details of what it to B with details of what it requiresrequires

Process executes MPI_Recv Process executes MPI_Recv from A and determines data from A and determines data required by Arequired by A

Process B executes MPI_Send Process B executes MPI_Send to A with required datato A with required data

Process A executes MPI_Recv Process A executes MPI_Recv from B…from B…

4 MPI-1 commands4 MPI-1 commands Additionally process B has to Additionally process B has to

be aware of incoming be aware of incoming messagemessage Requires frequent polling for Requires frequent polling for

messages – potentially highly messages – potentially highly wastefulwasteful

Process A Process B

MPI_SEND

MPI_RECV

MPI_SEND

MPI_RECV

Even worse exampleEven worse example Suppose you need to Suppose you need to

read a remote list to read a remote list to figure out what data figure out what data you need – sequence of you need – sequence of ops is then:ops is then:

Process A Process BMPI_Send (get list) MPI_Recv (list request)MPI_Recv (list returned) MPI_Send (list info)MPI_Send (get data) MPI_Recv (data request)MPI_Recv (data returned) MPI_Send (data info)

Process A Process B

MPI_SEND

MPI_RECVMPI_SEND

MPI_RECV

MPI_SEND

MPI_RECVMPI_SEND

MPI_RECV

GET LIST

GET DATA

Coarse versus fine Coarse versus fine graininggraining

Expense of message passing implicitly suggests Expense of message passing implicitly suggests MPI-1 programs should be coarse grainedMPI-1 programs should be coarse grained

Unit of messaging in NUMA systems is the cache Unit of messaging in NUMA systems is the cache lineline What about API for (fast network) distributed memory What about API for (fast network) distributed memory

systems that is optimized for smaller messages?systems that is optimized for smaller messages? e.g. ARMCI e.g. ARMCI http://www.emsl.pnl.gov/docs/parsoft/armcihttp://www.emsl.pnl.gov/docs/parsoft/armci Would enable distributed memory systems to have Would enable distributed memory systems to have

moderately high performance fine grained parallelismmoderately high performance fine grained parallelism A number of applications are suited to this style of A number of applications are suited to this style of

parallelism (especially irregular data structures)parallelism (especially irregular data structures) T3E and T3D both we capable of performing fine T3E and T3D both we capable of performing fine

grained calculations – well balanced machinesgrained calculations – well balanced machines API’s supporting fine grained parallelism have one-API’s supporting fine grained parallelism have one-

sided communication for efficiency – no handshaking to sided communication for efficiency – no handshaking to take processes away from computationtake processes away from computation

Puts and Gets in MPI-2Puts and Gets in MPI-2 In one sided communication the number of In one sided communication the number of

operations is reduced by at least factor of 2operations is reduced by at least factor of 2 For our earlier example 4 MPI operations can be replaced For our earlier example 4 MPI operations can be replaced

with a single MPI_Getwith a single MPI_Get Circumvents the need to forward information directly Circumvents the need to forward information directly

to the remote CPU specifying what data is requiredto the remote CPU specifying what data is required MPI_Sends+MPI_Recv’s are replaced by three MPI_Sends+MPI_Recv’s are replaced by three

possibilitiespossibilities MPI_Get: Retrieve section of a remote arrayMPI_Get: Retrieve section of a remote array MPI_Put: Place a section of a local array into remote memoryMPI_Put: Place a section of a local array into remote memory MPI_Accumulate: Remote update over operator and local MPI_Accumulate: Remote update over operator and local

datadata However, programmer must be aware of the However, programmer must be aware of the

possibility of remote processes changing local arrays!possibility of remote processes changing local arrays!

RMA illustratedRMA illustrated

HOST A HOST B

NIC(withRDMA

engine)

NIC(withRDMA

engine)

Memory/Buffer CPU CPU

Memory/Buffer

Benefits of one-sided Benefits of one-sided communicationcommunication

No matching operation required for remote No matching operation required for remote processprocess

All parameters of the operations are specified by All parameters of the operations are specified by the origin processthe origin process

Allows very flexible communcations patternsAllows very flexible communcations patterns Communication and synchronization are separatedCommunication and synchronization are separated

Synchronization is now implied by the Synchronization is now implied by the access epochaccess epoch

Removes need for polling for incoming messagesRemoves need for polling for incoming messages Significantly improves performance of Significantly improves performance of

applications with irregular and unpredictable applications with irregular and unpredictable data movementdata movement

Windows: The fundamental Windows: The fundamental construction for one-sided construction for one-sided

commscomms One sided comms may One sided comms may

only write into memory only write into memory regions “windows” set regions “windows” set aside for aside for communicationcommunication

Access to the windows Access to the windows must be within a must be within a specific access epochspecific access epoch

All processes may All processes may agree on access epoch, agree on access epoch, or just a pair of or just a pair of processes may processes may cooperatecooperate

Origin Target

One-sided Put

MemoryWindow

Creating a windowCreating a window MPI_Win_create(base,size,disp_unit,info,comMPI_Win_create(base,size,disp_unit,info,com

m,win,ierr)m,win,ierr) Base address of windowBase address of window Size of window in BYTESSize of window in BYTES Local unit size for displacements (BYTES, e.g. 4)Local unit size for displacements (BYTES, e.g. 4) Info – argument about type of operations that may Info – argument about type of operations that may

occur on windowoccur on window Win – window object returned by callWin – window object returned by call

Should also free window using Should also free window using MPI_Win_free(win,ierr)MPI_Win_free(win,ierr)

Window performance is always better when Window performance is always better when base aligns on a word boundarybase aligns on a word boundary

Options to infoOptions to info Vendors are allowed to include options to Vendors are allowed to include options to

improve window performance under certain improve window performance under certain circumstancescircumstances

MPI_INFO_NULL is always validMPI_INFO_NULL is always valid If win_lock is not going to be used then this If win_lock is not going to be used then this

information can be passed as an info argument:information can be passed as an info argument:

MPI_Info info;MPI_Info_create(&info);MPI_Info_set(info,”no_locks”,”true”);MPI_Win_create(…,info,…);MPI_Info_free(&info);

Access epochsAccess epochs Although communication is mediated by GETs Although communication is mediated by GETs

and PUTs they do not guarantee message and PUTs they do not guarantee message completioncompletion

All communication must occur within an All communication must occur within an access epochaccess epoch

Communication is only guaranteed to have Communication is only guaranteed to have completed when the epoch is finishedcompleted when the epoch is finished This is to optimize messaging – do not have to This is to optimize messaging – do not have to

worry about completion until access epoch is endedworry about completion until access epoch is ended Two ways of coordinating accessTwo ways of coordinating access

Active target: remote process governs completionActive target: remote process governs completion Passive target: Origin process governs completionPassive target: Origin process governs completion

Access epochs : Active Access epochs : Active targettarget

Active target Active target communication is communication is usually expressed in a usually expressed in a collective operationcollective operation

All process agree on All process agree on the beginning of the the beginning of the windowwindow

Communication Communication occursoccurs

Communication is Communication is then guaranteed to then guaranteed to have completed when have completed when second WIN_Fence is second WIN_Fence is calledcalled

Origin Target

WIN_FENCE

WIN_FENCE

!Processes agree on fenceCall MPI_Win_fence!Put remote dataCall MPI_PUT(..)!Collective fenceCall MPI_Win_fence!Message is guaranteed to !complete after win_fence!on remote process completes

All other processes

Access epochs : Passive Access epochs : Passive targettarget

For passive target For passive target communication, the communication, the origin process origin process controls all aspects controls all aspects of communicationof communication

Target process is Target process is oblivious to the oblivious to the communication communication epochepoch

MPI_Win_(un)lock MPI_Win_(un)lock facilitates the facilitates the communication communication

!Lock remote process windowCall MPI_Win_lock!Put remote dataCall MPI_PUT(..)!Unlock remote process windowCall MPI_win_unlock!Message is guaranteed to !complete after win_unlock

Origin Target

WIN_LOCK

WIN_UNLOCK

Non-collective active Non-collective active targettarget

Win_fence is Win_fence is collective over the collective over the comm of the comm of the windowwindow

A similar A similar construct over construct over groups is availablegroups is available

See Using MPI-2 See Using MPI-2 for more detailsfor more details

Origin Target

WIN_FENCE

WIN_FENCE

All other processes

!Processes agree on fenceCall MPI_Win_start(group,..)Call MPI_Win_post(group,..)!Put remote dataCall MPI_PUT(..)!Collective fenceCall MPI_Win_complete(win)Call MPI_Win_wait(win)!Message is guaranteed to !complete after waits finish

Rules for memory areas Rules for memory areas assigned to windowsassigned to windows

Memory regions for Memory regions for windows involved in active windows involved in active target synchronization may target synchronization may be statically declaredbe statically declared

Memory regions for Memory regions for windows involved in windows involved in passive target access passive target access epochs may have to be epochs may have to be dynamically allocated dynamically allocated depends on implementationdepends on implementation For Fortran requires For Fortran requires

definition of Cray-like definition of Cray-like pointers to arrayspointers to arrays

MPI_Alloc_mem(size,MPI_IMPI_Alloc_mem(size,MPI_INFO_NULL,pointer,ierr)NFO_NULL,pointer,ierr)

Must be associated with Must be associated with freeing callfreeing call

MPI_Free_mem(array,ierr)MPI_Free_mem(array,ierr)

double precision u

pointer (p,u(0:50,0:20))

integer (kind=MPI_ADDRESS_KIND) sizeinteger sizeofdouble, ierr

call MPI_Sizeof(u,sizeofdouble,ierr)

size=51*21*sizeofdouble

call MPI_Alloc_mem(size,MPI_INFO_NULL,p,ierr)…Can now refer to u……call MPI_Free_mem(u,ierr)

More on passive target More on passive target accessaccess

Closest idea to shared memory operation Closest idea to shared memory operation on a distributed systemon a distributed system

Very flexible communication modelVery flexible communication model Multiple origin processes must negotiate Multiple origin processes must negotiate

on access to lockson access to locks

Process A Process B Process C lock A Locked put to A window unlock A lock B Locked put to B window unlock B lock A Locked put to A window unlock A

Time

Cray SHMEM – origin of Cray SHMEM – origin of many one-sided many one-sided

communication conceptscommunication concepts On the T3E a number of variable types were guaranteed to On the T3E a number of variable types were guaranteed to

occupy the same point in memory on different nodes:occupy the same point in memory on different nodes: Global variables/variables in common blocksGlobal variables/variables in common blocks Local static variablesLocal static variables Fortran variables specified via !DIR$ SYMMETRIC directiveFortran variables specified via !DIR$ SYMMETRIC directive C variables specified by #pragma symmetric directiveC variables specified by #pragma symmetric directive Variables that are stack allocated, or dynamically on to the heap Variables that are stack allocated, or dynamically on to the heap

are not guaranteed to occupy the same address on different are not guaranteed to occupy the same address on different processorsprocessors

These variables could be rapidly retrieved/replaced via These variables could be rapidly retrieved/replaced via shmem_get/put shmem_get/put One sided operationsOne sided operations

Because these memory locations are shared among Because these memory locations are shared among processors the library is dubbed “SHared MEMory” – processors the library is dubbed “SHared MEMory” – SHMEMSHMEM It does not have a global address space (although you could It does not have a global address space (although you could

implement one around this idea)implement one around this idea) Similar idea to global arraysSimilar idea to global arrays

A lot of functionality from SHMEM is available in the MPI-2 A lot of functionality from SHMEM is available in the MPI-2 one sided library (and was central in the design)one sided library (and was central in the design)

Shmem exampleShmem exampleCC Taken from Cray MPP Fortran Reference Manual CC Added CACHE_ALIGN directive to show how it should be doneCC Ken Steube - 3/11/96CC Each PE initializes array source() with the PE number,C mype, and then gets the values of source from PE numberC mype-1. It checks to make sure the values it got the C right values after receiving the data.CC This code calls shmem_get() to accomplish the task. C Be aware that shmem_put() is significantly faster than C shmem_get(), and so it should be used when possible.C program ring_of_PEs parameter (N=10 000) common /xxx/ target,source real target(N) real source(N)CDIR$ CACHE_ALIGN target source integer previous integer shmem_get intrinsic my_pe data iflag /1/C

mype = my_pe()C previous = mod(mype - 1 + N$PES, N$PES)C do i = 1 , N ! Assign unique values on each PE source(i) = mype enddoC call barrier() ! All PEs initialize source ! before doing the getC iget = shmem_get(target, source, N, $ previous) do i = 1, N if (target(i) .ne. previous) then iflag = 0 print*,'PE #',mype,': target(',i,')=', $ target(i),', should be ',previous endif enddoC if (iflag .eq. 0) then print*,'Test failed on PE ',mype else print*,'Test passed on PE ',mype endifC end

MPI_Get/Put/AccumulateMPI_Get/Put/Accumulate Non-blocking operationsNon-blocking operations MPI_Get(origin address,count,datatype,targetMPI_Get(origin address,count,datatype,target,target displ,target count,target ,target displ,target count,target

datatypedatatype,win,ierr),win,ierr) Must specify information about both origin and Must specify information about both origin and remote remote

datatypesdatatypes – more arguments – more arguments No need to specify communicator – contained in windowNo need to specify communicator – contained in window Target displ is displacement from beginning of target Target displ is displacement from beginning of target

windowwindow Note remote datatype cannot resolve to overlapping Note remote datatype cannot resolve to overlapping

entriesentries MPI_Put has same interfaceMPI_Put has same interface MPI_Accumulate requires the reduction operator MPI_Accumulate requires the reduction operator

also be specified (argument before the window)also be specified (argument before the window) Same operators as MPI_REDUCE, but user defined Same operators as MPI_REDUCE, but user defined

functions cannot be usedfunctions cannot be used Note MPI_Accumulate is really MPI_Put_accumulate, there Note MPI_Accumulate is really MPI_Put_accumulate, there

is no get functionality (must do by hand)is no get functionality (must do by hand)

Don’t forget datatypesDon’t forget datatypes

In one-sided In one-sided comms datatypes comms datatypes play an extremely play an extremely important roleimportant role Specify explicitly Specify explicitly

the unpacking on the unpacking on the remote nodethe remote node

Origin node must Origin node must know precisely what know precisely what the required remote the required remote data type isdata type is

Contiguous origin datatype

Sparse targetdatatype

MPI_AccumulateMPI_Accumulate Extremely powerful operation “put+op”Extremely powerful operation “put+op” Questions marks for implementations though:Questions marks for implementations though:

Who actually implements the “op” side of things?Who actually implements the “op” side of things? If on remote node then there must be an extra If on remote node then there must be an extra

thread to do this operationthread to do this operation If on local node, then accumulate becomes get If on local node, then accumulate becomes get

followed by operation followed by putfollowed by operation followed by put Many computations involve summing values Many computations involve summing values

into fieldsinto fields MPI_Accumulate provides the perfect command MPI_Accumulate provides the perfect command

for thisfor this For scientific computation it is frequently For scientific computation it is frequently

more useful than MPI_Putmore useful than MPI_Put

Use PUTs rather than Use PUTs rather than GETsGETs

Although both PUTs and GETs are Although both PUTs and GETs are non-blocking it is desirable to use non-blocking it is desirable to use PUTs whenever possiblePUTs whenever possible GETs imply an inherent wait for data GETs imply an inherent wait for data

arrival and only complete when the arrival and only complete when the message side has fully decoded the message side has fully decoded the incoming messageincoming message

PUTs can be thought of as “fire and PUTs can be thought of as “fire and forget”forget”

MPI_Win_fenceMPI_Win_fence MPI_Win_fence(info,win,ierr)MPI_Win_fence(info,win,ierr)

Info allows user to specify constant that may Info allows user to specify constant that may improve performance (default of 0)improve performance (default of 0)

MPI_MODE_NOSTORE: No local storesMPI_MODE_NOSTORE: No local stores MPI_MODE_NOPUT: No puts will occur within the MPI_MODE_NOPUT: No puts will occur within the

window (don’t have to watch for remote updates)window (don’t have to watch for remote updates) MPI_MODE_NOPRECEDE: No earlier epochs of MPI_MODE_NOPRECEDE: No earlier epochs of

communication (optimize assumptions about window communication (optimize assumptions about window variables)variables)

MPI_MODE_NOSUCCEED: No epochs of communication MPI_MODE_NOSUCCEED: No epochs of communication will follow this fencewill follow this fence

NO_PRECEDE and NOSUCCEED must be called NO_PRECEDE and NOSUCCEED must be called collectivelycollectively

Multiple messages sent to the same target Multiple messages sent to the same target between fences may be concatenated to between fences may be concatenated to improve performanceimprove performance

Active target sync.

MPI_Win_(un)lockMPI_Win_(un)lock MPI_Win_lock(lock_type,target,info,win,ierr)MPI_Win_lock(lock_type,target,info,win,ierr)

Lock_types:Lock_types: MPI_LOCK_SHARED – use only for concurrent reads MPI_LOCK_SHARED – use only for concurrent reads MPI_LOCK_EXCLUSIVE – use when updates are necessaryMPI_LOCK_EXCLUSIVE – use when updates are necessary

Although called a lock – it actually Although called a lock – it actually isn’tisn’t (very (very poor naming convention)poor naming convention) ““MPI_begin/end_passive_target_epoch”MPI_begin/end_passive_target_epoch” Only on the local process does MPI_Win_lock act as a Only on the local process does MPI_Win_lock act as a

locklock Otherwise non-blockingOtherwise non-blocking

Provides a mechanism to ensure that the Provides a mechanism to ensure that the communication epoch is completedcommunication epoch is completed

Says nothing about order in which other Says nothing about order in which other competing message updates will occur on the competing message updates will occur on the target (consistency model is not specified)target (consistency model is not specified)

Passive target sync.

Subtleties of Subtleties of nonblockingnonblocking ‘locking’ and messaging‘locking’ and messaging

Suppose we wanted to implement a fetch Suppose we wanted to implement a fetch and add:and add:

Code is Code is erroneouserroneous for two reasons: for two reasons: 1. Cannot read and update same memory 1. Cannot read and update same memory

location in same access epochlocation in same access epoch 2. Even if you could, communication is 2. Even if you could, communication is

nonblocking and can complete in any ordernonblocking and can complete in any order

int one=1;MPI_Win_create(…,&win);…MPI_Win_lock(MPI_LOCK_EXCLUSIVE,0,0,win);MPI_Get(&value,1,MPI_INT,0,0,1,MPI_INT,win);MPI_Accumulate(&one,1,MPI_INT,0,0,1,MPI_INT,MPI_SUM,win);MPI_Win_unlock(0,win);

Simple exampleSimple example

subroutine exchng2( a, sx, ex, sy, ey, win, * left_nbr, right_nbr, top_nbr, bot_nbr, * right_ghost_disp, left_ghost_disp, * top_ghost_disp, coltype, right_coltype, left_coltype ) include 'mpif.h' integer sx, ex, sy, ey, win, ierr integer left_nbr, right_nbr, top_nbr, bot_nbr integer coltype, right_coltype, left_coltype double precision a(sx-1:ex+1,sy-1:ey+1)C This assumes that an address fits in a Fortran integer.C Change this to integer*8 if you need 8-byte addresses integer (kind=MPI_ADDRESS_KIND) right_ghost_disp, * left_ghost_disp, top_ghost_disp, bot_ghost_disp integer nx nx = ex - sx + 1 call MPI_WIN_FENCE( 0, win, ierr )C Put bottom edge into bottom neighbor's top ghost cells call MPI_PUT( a(sx,sy), nx, MPI_DOUBLE_PRECISION, bot_nbr, * top_ghost_disp, nx, MPI_DOUBLE_PRECISION, * win, ierr )C Put top edge into top neighbor's bottom ghost cells bot_ghost_disp = 1 call MPI_PUT( a(sx,ey), nx, MPI_DOUBLE_PRECISION, top_nbr, * bot_ghost_disp, nx, MPI_DOUBLE_PRECISION, * win, ierr )C Put right edge into right neighbor's left ghost cells call MPI_PUT( a(ex,sy), 1, coltype, * right_nbr, left_ghost_disp, 1, right_coltype, * win, ierr )C Put left edge into the left neighbor's right ghost cells call MPI_PUT( a(sx,sy), 1, coltype, * left_nbr, right_ghost_disp, 1, left_coltype, * win, ierr ) call MPI_WIN_FENCE( 0, win, ierr ) return end

exchng2 for 2d poisson problemNo gets are required – just putyour own data into other processes memory window.

Problems with passive Problems with passive target accesstarget access

Window creation must be collective over the Window creation must be collective over the communicatorcommunicator Expensive and time consuming Expensive and time consuming

MPI_Alloc_mem may be requiredMPI_Alloc_mem may be required Race conditions on a single window location under Race conditions on a single window location under

concurrent get/put must be handled by userconcurrent get/put must be handled by user See section 6.4 in Using MPI-2See section 6.4 in Using MPI-2

Local and remote operations on a remote window cannot Local and remote operations on a remote window cannot occur concurrently even if different parts of the window occur concurrently even if different parts of the window are being accessed at the same timeare being accessed at the same time Local processes must execute MPI_Win_lock as wellLocal processes must execute MPI_Win_lock as well

Multiple windows may have overlap, but must ensure Multiple windows may have overlap, but must ensure concurrent operations to do different windows do not lead concurrent operations to do different windows do not lead to race conditions on the overlapto race conditions on the overlap

Cannot access (via MPI_get for example) and update (via a Cannot access (via MPI_get for example) and update (via a put back ) the same location in the same access epoch put back ) the same location in the same access epoch (either between fences or lock/unlock)(either between fences or lock/unlock)

Drawbacks of one sided Drawbacks of one sided comms in generalcomms in general

No evidence for advantage except onNo evidence for advantage except on SMP machinesSMP machines Cray distributed memory systems (and Quadrics)Cray distributed memory systems (and Quadrics)

Although advantage on these machines is significant – Although advantage on these machines is significant – on T3E MPI latency is 16 µs, SHMEM latency is 2 µson T3E MPI latency is 16 µs, SHMEM latency is 2 µs

Slow acceptanceSlow acceptance Just beginning to reach maturityJust beginning to reach maturity

Unclear how many applications actually Unclear how many applications actually benefit from this modelbenefit from this model Not entirely clear whether nonblocking normal Not entirely clear whether nonblocking normal

send/recvs can achieve similar speed for some send/recvs can achieve similar speed for some applicationsapplications

Hardware – Reasons to be Hardware – Reasons to be optimisticoptimistic

Newer network technologies (e.g. Infiniband, Quadrics) Newer network technologies (e.g. Infiniband, Quadrics) have a built in RDMA enginehave a built in RDMA engine RMA framework can built on top of the NIC library (“verbs”)RMA framework can built on top of the NIC library (“verbs”)

10 gigabit ethernet will almost certainly come with an 10 gigabit ethernet will almost certainly come with an RDMA engineRDMA engine

Myrinet now has one sided comms library through HP Myrinet now has one sided comms library through HP MPIMPI SCI?SCI?

Still in its infancy – number of software issues to work Still in its infancy – number of software issues to work outout Support for non-contiguous datatypes is proving difficult – need Support for non-contiguous datatypes is proving difficult – need

efficient way to deal with the gather/scatter stepefficient way to deal with the gather/scatter step Many RDMA engines are designed for movement of contiguous Many RDMA engines are designed for movement of contiguous

regions – a comparatively rare operation in many situationsregions – a comparatively rare operation in many situations See http://nowlab.cis.ohio-state.edu/projects/mpi-iba/See http://nowlab.cis.ohio-state.edu/projects/mpi-iba/

Case Study: Matrix Case Study: Matrix transposetranspose

See Sun documentationSee Sun documentation Need to transpose elements across Need to transpose elements across

processor spaceprocessor space Could do one element at a time (bad idea!)Could do one element at a time (bad idea!) Aggregate as much local data as possible Aggregate as much local data as possible

and send large message (requires a lot of and send large message (requires a lot of local data movement)local data movement)

Send medium-sized contiguous packets of Send medium-sized contiguous packets of elements (there is some contiguity in the elements (there is some contiguity in the data layout)data layout)

Parallel IssuesParallel Issues 1 2 3 4 5 6 7 8 9 10 11 1213 14 15 16

1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16

Storage on 2 processors

1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16

P0 P1 P0 P1

1 2 3 4 5 6 7 8 9 10 11 1213 14 15 16

12345678

910111213141516

1d storage order

15913261014

371115481216

Possible parallel Possible parallel algorithmalgorithm

12345678

910111213141516

12563478

910131411121516

12569101314

347811121516

15913261014

371115481216

Local permutation Send data Local permutation

Program 1Program 1include "mpif.h"real(8), allocatable, dimension(:) :: a, b, c, dreal(8) t0, t1, t2, t3! initialize parameterscall init(me,np,n,nb)! allocate matricesallocate(a(nb*np*nb))allocate(b(nb*nb*np))allocate(c(nb*nb*np))allocate(d(nb*np*nb))! initialize matrixcall initialize_matrix(me,np,nb,a)! timingdo itime = 1, 10 call MPI_Barrier(MPI_COMM_WORLD,ier) t0 = MPI_Wtime()

! first local transpose do k = 1, nb do j = 0, np - 1 ioffa = nb * ( j + np * (k-1) ) ioffb = nb * ( (k-1) + nb * j ) do i = 1, nb b(i+ioffb) = a(i+ioffa) enddo enddo enddo t1 = MPI_Wtime() ! global all-to-allcall MPI_Alltoall(b, nb*nb, MPI_REAL8, & c, nb*nb, MPI_REAL8, MPI_COMM_WORLD, ier) t2 = MPI_Wtime() ! second local transpose call dtrans(`o', 1.d0, c, nb, nb*np, d) call MPI_Barrier(MPI_COMM_WORLD,ier) t3 = MPI_Wtime()

if ( me .eq. 0 ) & write(6,'(f8.3," seconds; breakdown on proc 0 = ",3f10.3)') & t3 - t0, t1 - t0, t2 - t1, t3 - t2enddo! checkcall check_matrix(me,np,nb,d)deallocate(a)deallocate(b)deallocate(c)deallocate(d)call MPI_Finalize(ier)end

This code aggregates data locally and uses the two-sided Alltoall collectiveOperation. Data is then rearranged using a subroutine called DTRANS()

Version 2 – one sidedVersion 2 – one sidedinclude "mpif.h"integer(kind=MPI_ADDRESS_KIND) nbytesinteger winreal(8) c(*)pointer (cptr,c)real(8), allocatable, dimension(:) :: a, b, dreal(8) t0, t1, t2, t3! initialize parameterscall init(me,np,n,nb)! allocate matricesallocate(a(nb*np*nb))allocate(b(nb*nb*np))allocate(d(nb*np*nb))nbytes = 8 * nb * nb * npcall MPI_Alloc_mem(nbytes, MPI_INFO_NULL, cptr, ier)if ( ier .eq. MPI_ERR_NO_MEM ) stop! create windowcall MPI_Win_create(c, nbytes, 1, MPI_INFO_NULL, MPI_COMM_WORLD, win, ier)! initialize matrixcall initialize_matrix(me,np,nb,a)! timing

do itime = 1, 10 call MPI_Barrier(MPI_COMM_WORLD,ier) t0 = MPI_Wtime() t1 = t0 ! combined local transpose with global all-to-all call MPI_Win_fence(0, win, ier) do ip = 0, np - 1 do ib = 0, nb - 1 nbytes = 8 * nb * ( ib + nb * me ) call MPI_Put(a(1+nb*ip+nb*np*ib), nb, MPI_REAL8, ip, nbytes, & nb, MPI_REAL8, win, ier) enddo enddo call MPI_Win_fence(0, win, ier) t2 = MPI_Wtime() ! second local transpose call dtrans(`o', 1.d0, c, nb, nb*np, d) call MPI_Barrier(MPI_COMM_WORLD,ier) t3 = MPI_Wtime() if ( me .eq. 0 ) & write(6,'(f8.3," seconds; breakdown on proc 0 = ",3f10.3)') & t3 - t0, t1 - t0, t2 - t1, t3 - t2enddo! checkcall check_matrix(me,np,nb,d)! deallocate matrices and stuffcall MPI_Win_free(win, ier)deallocate(a)deallocate(b)deallocate(d)call MPI_Free_mem(c, ier)call MPI_Finalize(ier)end

No local aggregation is used, and communication is mediated viaMPI_Puts. Data is then rearranged using a subroutine called DTRANS()

Performance comparisonPerformance comparison

VersionVersion TotalTotal Local Local AggregatioAggregatio

nn

CommunicCommunicationation

Dtrans callDtrans call

11 2.1092.109 0.5850.585 0.8520.852 0.6730.673

22 1.1771.177 0.00.0 0.430.43 0.7470.747

One sided version is twice as fast on this machine (Sun One sided version is twice as fast on this machine (Sun 6000 SMP)6000 SMP)

Net data movement is slightly over 1.1 Gbyte/s, which is Net data movement is slightly over 1.1 Gbyte/s, which is about ½ the net bus bandwidth (2.6 Gbyte/s)about ½ the net bus bandwidth (2.6 Gbyte/s)

Big performance boost from getting rid of aggregation and Big performance boost from getting rid of aggregation and the fast messaging using shorter one sided messagesthe fast messaging using shorter one sided messages

Summary Part 1Summary Part 1

One sided comms can reduce One sided comms can reduce synchronization and thereby increase synchronization and thereby increase performanceperformance

They indirectly reduce local data They indirectly reduce local data movementmovement

The reduction in messaging overhead can The reduction in messaging overhead can simplify programming simplify programming

Part 2: Parallel I/OPart 2: Parallel I/O

MotivationsMotivations Review of different I/O strategiesReview of different I/O strategies Far too many issues to put into one Far too many issues to put into one

lecture – plenty of web resources provide lecture – plenty of web resources provide more details if you need themmore details if you need them

Draws heavily from Using MPI-2 – see the Draws heavily from Using MPI-2 – see the excellent discussion presented therein for excellent discussion presented therein for further detailsfurther details

Non-parallel I/ONon-parallel I/O

Simplest way to do I/O – number of factors may contribute to this kind Simplest way to do I/O – number of factors may contribute to this kind of model:of model:

May have to implement this way because only one process is capable of I/OMay have to implement this way because only one process is capable of I/O May have to use serial I/O libraryMay have to use serial I/O library I/O may be enhanced for large writesI/O may be enhanced for large writes Easiest file handling – only one file to keep track ofEasiest file handling – only one file to keep track of

Arguments against:Arguments against: Strongly limiting in terms of throughput if the underlying file system does Strongly limiting in terms of throughput if the underlying file system does

permit parallel I/Opermit parallel I/O

Memory

Processes

File

Additional argument for Additional argument for parallel I/O standardparallel I/O standard

Standard UNIX I/O is not portableStandard UNIX I/O is not portable Endianess becomes serious problem Endianess becomes serious problem

across different machinesacross different machines Writing wrappers to perform byte Writing wrappers to perform byte

conversion is tediousconversion is tedious

Simple parallel I/OSimple parallel I/O

Multiple independent files are writtenMultiple independent files are written Same as parallel I/O allowed under OpenMPSame as parallel I/O allowed under OpenMP May still be able to use sequential I/O librariesMay still be able to use sequential I/O libraries Significant increases in throughput are possibleSignificant increases in throughput are possible

Drawbacks are potentially serious Drawbacks are potentially serious Now have multiple files to keep track of (concatenation may not be an Now have multiple files to keep track of (concatenation may not be an

option)option) May be non-portable if application reading these files must have the May be non-portable if application reading these files must have the

same number of processessame number of processes

Memory

Processes

Files

Simple parallel I/O (no MPI Simple parallel I/O (no MPI calls)calls)

I/O operations are I/O operations are completely completely independent across independent across the processesthe processes

Append a rank to each Append a rank to each file name to specify file name to specify the different filesthe different files

Need to be very Need to be very careful about careful about performance – performance – individual files may be individual files may be too small for good too small for good throughputthroughput

#include “mpi.h”#include <stdio.h>#define BUFSIZE 100

Int main(int argc, char *argv[]){ int i,myrank,buf[BUFSIZE]; char filename[128]; FILE *myfile;

MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); for (i=0; i< BUFSIZE; i++) buf[i] = myrank * BUFSIZE + I; sprintf(filename,”testfile.%d”,myrank); myfile= fopen(filename, “w”); fwrite(buf,sizeof(int), BUFSIZE, myfile); fclose(myfile); MPI_Finalize(); return 0;}

Simple parallel I/O (using Simple parallel I/O (using MPI calls)MPI calls)

Rework previous example Rework previous example to use MPI callsto use MPI calls

Note the file pointer has Note the file pointer has been replaced by a been replaced by a variable of type MPI_Filevariable of type MPI_File

Under MPI I/O open, Under MPI I/O open, write (read) and close write (read) and close statements are providedstatements are provided

Note MPI_COMM_SELF Note MPI_COMM_SELF denotes a communicator denotes a communicator over local processover local process


Int main(int argc, char *argv[]){ int i,myrank,buf[BUFSIZE]; char filename[128]; MPI_File myfile;

MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); for (i=0; i< BUFSIZE; i++) buf[i] = myrank * BUFSIZE + i; sprintf(filename,”testfile.%d”,myrank); MPI_File_open(MPI_COMM_SELF, filename, MPI_MODE_WRONLY | MPI_MODE_CREATE, MPI_INFO_NULL, &myfile); MPI_File_write(myfile,buf, BUFSIZE, MPI_INT, MPI_STATUS_IGNORE); MPI_File_close(&myfile); MPI_Finalize(); return 0;}

MPI_File_open MPI_File_open argumentsarguments

MPI_File_open(comm,filename,accessmodMPI_File_open(comm,filename,accessmode,info,filehandle,ierr)e,info,filehandle,ierr) Comm – can choose immediately whether file Comm – can choose immediately whether file

access is collective or localaccess is collective or local MPI_COMM_WORLD or MPI_COMM_SELFMPI_COMM_WORLD or MPI_COMM_SELF

Access mode is specified by Access mode is specified by MPI_MODE_CREATE and MPI_MODE_CREATE and MPI_MODE_WRONLY are or’d together MPI_MODE_WRONLY are or’d together

Same as Unix openSame as Unix open The file handler is passed back from this call The file handler is passed back from this call

to be used later in MPI_File_writeto be used later in MPI_File_write

MPI_File_writeMPI_File_write

MPI_File_write(filehandler,buff,count,datatMPI_File_write(filehandler,buff,count,datatype,status,ierr)ype,status,ierr) Very similar to message passing interfaceVery similar to message passing interface

Imagine file handler as providing destinationImagine file handler as providing destination ““Address,count,datatype” interfaceAddress,count,datatype” interface

Specification of non-contiguous writes would Specification of non-contiguous writes would be done via a user-defined datatypebe done via a user-defined datatype

MPI_STATUS_IGNORE can be passed in the MPI_STATUS_IGNORE can be passed in the status fieldstatus field

Informs system not to fill field since user will ignore itInforms system not to fill field since user will ignore it May slightly improve I/O performance when not neededMay slightly improve I/O performance when not needed

ROMIO (“romeo”)ROMIO (“romeo”)

Publically available implementation Publically available implementation of the MPI-I/O instruction setof the MPI-I/O instruction set http://www-unix.mcs.anl.gov/romiohttp://www-unix.mcs.anl.gov/romio

Runs on multiple platformsRuns on multiple platforms Optimized for non-contiguous access Optimized for non-contiguous access

patternspatterns Common for parallel applicationsCommon for parallel applications

Optimized for collective operationsOptimized for collective operations

True parallel MPI I/OTrue parallel MPI I/O

Processes must all now agree on opening a Processes must all now agree on opening a single filesingle file Each process must have its own pointer within the fileEach process must have its own pointer within the file File clearly must reside on a single file systemFile clearly must reside on a single file system Can read the file with a different number of processes Can read the file with a different number of processes

as compared to what it was written withas compared to what it was written with

Memory

Processes

File

Parallel I/O to a single file Parallel I/O to a single file using MPI callsusing MPI calls

Rework previous example Rework previous example to write to a single fileto write to a single file

Write is now Write is now collectivecollective MPI_COMM_SELF has MPI_COMM_SELF has

been replaced by been replaced by MPI_COMM_WORLDMPI_COMM_WORLD

All files agree on a All files agree on a collective name for the collective name for the file “testfile”file “testfile”

Access to given parts of Access to given parts of the file is specifically the file is specifically controlledcontrolled MPI_File_set_viewMPI_File_set_view Shift displacements Shift displacements

according to local rankaccording to local rank


Int main(int argc, char *argv[]){ int i,myrank,buf[BUFSIZE]; MPI_File thefile;

MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); for (i=0; i< BUFSIZE; i++) buf[i] = myrank * BUFSIZE + i; MPI_File_open(MPI_COMM_WORLD, “testfile”, MPI_MODE_WRONLY | MPI_MODE_CREATE, MPI_INFO_NULL, &thefile); MPI_File_set_view(thefile, myrank*BUFSIZE*sizeof(int), MPI_INT,MPI_INT,”native”, MPI_INFO_NULL); MPI_File_write(thefile,buf, BUFSIZE, MPI_INT, MPI_STATUS_IGNORE); MPI_File_close(&thefile); MPI_Finalize(); return 0;}

File viewFile view

File viewFile view defines which portion of a file defines which portion of a file is “visible” to a given processis “visible” to a given process On first opening a file the entire file is On first opening a file the entire file is

visiblevisible Data is described at the byte level initiallyData is described at the byte level initially

MPI_File_set_view provides information MPI_File_set_view provides information toto enable reading of the data (datatypes)enable reading of the data (datatypes) Specify which parts of the file should be Specify which parts of the file should be

skipped skipped

MPI_File_set_viewMPI_File_set_view MPI_File_set_view(filehandler,displ,etype,filedatatMPI_File_set_view(filehandler,displ,etype,filedatat

ype,datarep,info,ierr)ype,datarep,info,ierr) Displ controls the BYTE offset from the beginning of the Displ controls the BYTE offset from the beginning of the

filefile The displacement is of a type “MPI_Offset” larger than The displacement is of a type “MPI_Offset” larger than

normal MPI_INT to allow for 64 bit addressing (byte normal MPI_INT to allow for 64 bit addressing (byte offsets can easily exceed 32 bit limit)offsets can easily exceed 32 bit limit)

etype is the datatype of the buffer etype is the datatype of the buffer filedatatype is the corresponding datatype in the file, filedatatype is the corresponding datatype in the file,

which must be either etype, or derived from itwhich must be either etype, or derived from it Must use MPI_Type_create etcMust use MPI_Type_create etc

DatarepDatarep native: data is stored in the file as in memorynative: data is stored in the file as in memory internal: implementation specific format that may provide a internal: implementation specific format that may provide a

level or portabilitylevel or portability external32: 32bit big endian IEEE format (defined for all MPI external32: 32bit big endian IEEE format (defined for all MPI

implementations)implementations) Only use if portability is required (conversion may be necessary) Only use if portability is required (conversion may be necessary)

Portion of file visible to a given process

etype and filetype in etype and filetype in actionaction

Suppose we have a buffer with an etype of MPI_INTSuppose we have a buffer with an etype of MPI_INT Filetype is defined to be 2 MPI_INTs followed by an offset Filetype is defined to be 2 MPI_INTs followed by an offset

of 4 MPI_INTS (extent=6)of 4 MPI_INTS (extent=6)

MPI_File_set_view(fh,5*sizeof(int),etype,filetype,”native”,MPI_MPI_File_set_view(fh,5*sizeof(int),etype,filetype,”native”,MPI_INFO_NULL)INFO_NULL)MPI_File_write(fh,buf,1000,MPI_INT,MPI_STATUS_IGNORE)MPI_File_write(fh,buf,1000,MPI_INT,MPI_STATUS_IGNORE)

1 2 3 4 5 6

etype = MPI_INT

filetype = 2 MPI_INTs, followed by 4 offset

Displacement of5 MPI_INTS

filetype filetype filetype

Fortran issuesFortran issues Two levels of supportTwo levels of support

basic – just include basic – just include ‘mpif.h’‘mpif.h’

Designed for f77 Designed for f77 backwards backwards compatibilitycompatibility

extended – need to extended – need to include f90 module (use include f90 module (use mpi)mpi)

Use the extended set Use the extended set whenever possiblewhenever possible

Note that the MPI_FILE Note that the MPI_FILE type is an integer in type is an integer in FortranFortran

PROGRAM main include ‘mpif.h’! Should really use “use mpi” integer ierr,i,myrank,BUFSIZE,thefile parameter (BUFSIZE=100) integer buf(BUFSIZE) integer(kind=MPI_OFFSET_KIND) disp

call MPI_Init(ierr) call MPI_Comm_rank(MPI_COMM_WORLD,myrank,ierr) do i=1,BUFSIZE buf(i) = myrank * BUFSIZE + i end do call MPI_File_open(MPI_COMM_WORLD, ‘testfile’,& MPI_MODE_WRONLY + MPI_MODE_CREATE,& MPI_INFO_NULL,thefile,ierr) disp=myrank*BUFSIZE*4 call MPI_File_set_view(thefile,disp,MPI_INTEGER, & MPI_INTEGER,’native’, & MPI_INFO_NULL,ierr) call MPI_File_write(thefile,buf, BUFSIZE, & MPI_INTEGER,MPI_STATUS_IGNORE,ierr) call MPI_File_close(thefile,ierr) call MPI_Finalize(ierr); END PROGRAM main

Reading a single file with an Reading a single file with an unknown number of processorsunknown number of processors

New function:New function: MPI_File_get_sizeMPI_File_get_size Returns size of open file, Returns size of open file,

need to use 64 bit int need to use 64 bit int (MPI_Offset)(MPI_Offset)

Check how much data Check how much data has been read by using has been read by using status handlerstatus handler

Pass to MPI_Get_countPass to MPI_Get_count Determines number of Determines number of

datatypes that were readdatatypes that were read If number of read items is If number of read items is

less than expected then less than expected then (hopefully) EOF is (hopefully) EOF is reachedreached

#include “mpi.h”#include <stdio.h>

int main(int argc, char *argv[]){ int myrank,numprocs,bufsize,*buf,count; MPI_File thefile; MPI_Status status; MPI_Offset filesize;

MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); MPI_Comm_size(MPI_COMM_WORLD, &numprocs); MPI_File_open(MPI_COMM_WORLD, “testfile”, MPI_MODE_RDONLY,MPI_INFO_NULL, &thefile); MPI_File_get_size(thefile,&filezize); filesize=filesize/sizeof(int); bufsize=filesize/(numprocs+1); buf=(int *) malloc(bufsize*sizeof(int)); MPI_File_set_view(thefile, myrank*bufsize*sizeof(int), MPI_INT,MPI_INT,”native”, MPI_INFO_NULL); MPI_File_read(thefile,buf, bufsize, MPI_INT, &status); MPI_Get_count(&status, MPI_INT, &count); printf(“process %d read %d ints\n”,myrank,count); MPI_File_close(&thefile); MPI_Finalize(); return 0;}

Using Individual file Using Individual file pointers by handpointers by hand

We assume a known We assume a known filesize in this casefilesize in this case

Pointers must be Pointers must be explicitly movedexplicitly moved MPI_File_seekMPI_File_seek Offset of position to Offset of position to

move pointer to is move pointer to is second argumentsecond argument

Second argument is of Second argument is of type MPI_Offsettype MPI_Offset Good practice to resolve Good practice to resolve

calculation into this calculation into this variable and then call variable and then call using this variableusing this variable

#include “mpi.h”#define FILESIZE (1024*1024)

int main(int argc, char *argv[]){ int *buf,rank,nprocs,nints,bufsize; MPI_File fh; MPI_Status status;

MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); MPI_Comm_size(MPI_COMM_WORLD, &numprocs); bufsize = FILESIZE/nprocs; buf=(int *) malloc(bufsize); nints=bufsize/sizeof(int); MPI_File_open(MPI_COMM_WORLD, “testfile”, MPI_MODE_RDONLY,MPI_INFO_NULL, &fh); MPI_File_seek(fh,rank*bufsize, MPI_SEEK_SET); MPI_File_read(fh,buf, nints, MPI_INT, &status);

MPI_File_close(&fh); MPI_Finalize(); return 0;}

Using explicit offsetsUsing explicit offsets MPI_File_read/write are MPI_File_read/write are individual-file-pointer individual-file-pointer

functionsfunctions Both use current location of pointer to determine where Both use current location of pointer to determine where

to read/writeto read/write Must perform a seek to arrive at the correct region of Must perform a seek to arrive at the correct region of

datadata Explicit offset functionsExplicit offset functions don’t use an individual file don’t use an individual file

pointerpointer File offset is passed directly as an argument to the File offset is passed directly as an argument to the

functionfunction Seek is not requiredSeek is not required MPI_File_read/write_atMPI_File_read/write_at Must use this version in multithreaded environmentMust use this version in multithreaded environment

Revised code for explicit Revised code for explicit offsetsoffsets

No need to apply seeksNo need to apply seeks Specific movement of Specific movement of

file pointer is not file pointer is not applied, instead offset applied, instead offset is passed as argumentis passed as argument

Remember offsets Remember offsets must be of kind must be of kind MPI_OFFSET_KINDMPI_OFFSET_KIND

Same issue here with Same issue here with precalculating offset precalculating offset and resolving into a and resolving into a variable with the variable with the appropriate typing appropriate typing

PROGRAM main include ‘mpif.h integer FILESIZE,MAX_BUFSIZE,INTSIZE parameter (FILESIZE=1048576, MAX_BUFSIZE=1048576) parameter (INTSIZE=4) integer buf(MAX_BUFSIZE),rank,ierr,fh,nprocs,nints integer status(MPI_STATUS_SIZE), count integer(kind=MPI_OFFSET_KIND) offset

call MPI_Init(ierr) call MPI_Comm_rank(MPI_COMM_WORLD,rank,ierr) call MPI_Comm_size(MPI_COMM_WORLD,nprocs,ierr) call MPI_File_open(MPI_COMM_WORLD, ‘testfile’,& MPI_MODE_RDONLY,& MPI_INFO_NULL,fh,ierr) nints=FILESIZE/(nprocs*INTSIZE) offset=rank*nints*INTSIZE call MPI_File_read_at(fh,offset, buf, nints, & MPI_INTEGER,status,ierr) call MPI_Get_count(status,MPI_INTEGER,count,ierr) print*, ‘process ’,rank,’read ‘,count’,’ints’ call MPI_File_close(fh,ierr) call MPI_Finalize(ierr); END PROGRAM main

Dealing with Dealing with multidimensional arraysmultidimensional arrays

Storage formats differ for C versus fortran Storage formats differ for C versus fortran row major versus column majorrow major versus column major

MPI_Type_create_darray, and also MPI_Type_create_darray, and also MPI_Type_create_subarray are used to MPI_Type_create_subarray are used to specify dervied datatypes specify dervied datatypes These datatypes can then be used to resolve These datatypes can then be used to resolve

local regions within a linearized global arraylocal regions within a linearized global array Specifically they deal with the noncontiguous Specifically they deal with the noncontiguous

nature of the domain decompositionnature of the domain decomposition See Using MPI-2 book for more detailsSee Using MPI-2 book for more details

Summary Part 2Summary Part 2

MPI I/O provides a straightforward MPI I/O provides a straightforward interface for performing parallel I/Ointerface for performing parallel I/O Single file output is supported via multiple file Single file output is supported via multiple file

pointers into the same filepointers into the same file Multiple output files may also be written Multiple output files may also be written

Can use explicit pointer based schemes, Can use explicit pointer based schemes, or alternatively use file views to specify or alternatively use file views to specify local access local access

Part 3: Odds and endsPart 3: Odds and ends

Dynamic process managementDynamic process management Connecting different MPI processesConnecting different MPI processes Thread safetyThread safety

Dynamic process Dynamic process managementmanagement

Provided in MPI-2 to provide an API with some Provided in MPI-2 to provide an API with some backward compatibility to PVMbackward compatibility to PVM

Processes are created by MPI_Comm_spawnProcesses are created by MPI_Comm_spawn Collective operation over spawning processes Collective operation over spawning processes

(parents)(parents) Returns an Returns an intercommunicatorintercommunicator

Local group = parentsLocal group = parents Remote group = childrenRemote group = children

New processes have their own MPI_COMM_WORLD New processes have their own MPI_COMM_WORLD and execute their own MPI_Initand execute their own MPI_Init

Communicator is provided for children to message Communicator is provided for children to message parentsparents MPI_Comm_parent functionMPI_Comm_parent function

Parent/Children Parent/Children communicatorscommunicators

Parent groupOriginal MPI_COMM_WORLD

Child groupOwn MPI_COMM_WORLD

MPI_Comm_spawn

MPI_Comm_parent

Parent groupOriginal MPI_COMM_WORLD

Intercommunicators returned

Master exampleMaster example/* manager */ #include "mpi.h" int main(int argc, char *argv[]) { int world_size, universe_size, *universe_sizep, flag; MPI_Comm everyone; /* intercommunicator */ char worker_program[100];

MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &world_size);

if (world_size != 1) error("Top heavy with management");

MPI_Attr_get(MPI_COMM_WORLD, MPI_UNIVERSE_SIZE, &universe_sizep, &flag); if (!flag) { printf("This MPI does not support UNIVERSE_SIZE. How many\n\ processes total?"); scanf("%d", &universe_size); } else universe_size = *universe_sizep; if (universe_size == 1) error("No room to start workers");

/* * Now spawn the workers. Note that there is a run-time determination of what type of worker to spawn, and presumably this calculation must be done at run time and cannot be calculated before starting the program. If everything is known when the application is first started, it is generally better to start them all at once in a single MPI_COMM_WORLD. */

choose_worker_program(worker_program); MPI_Comm_spawn(worker_program, MPI_ARGV_NULL, universe_size-1, MPI_INFO_NULL, 0, MPI_COMM_SELF, &everyone, MPI_ERRCODES_IGNORE); /* * Parallel code here. The communicator "everyone" can be used to communicate with the spawned processes, which have ranks 0,..,MPI_UNIVERSE_SIZE-1 in the remote group of the intercommunicator "everyone". */

MPI_Finalize(); return 0; }

MPI_UNIVERSE_SIZEMPI_UNIVERSE_SIZE

Attribute of MPI_COMM_WORLDAttribute of MPI_COMM_WORLD Best guess from system of how many processes Best guess from system of how many processes

could existcould exist Is not required to be defined by MPI Is not required to be defined by MPI

implementationimplementation Main problem – need to relate to scheduler via Main problem – need to relate to scheduler via

library of some sortlibrary of some sort May be set by an environment variable on some May be set by an environment variable on some

systemssystems Best way to do things is to have value set via a call Best way to do things is to have value set via a call

to the start-up programto the start-up program e.g. mpiexec –n 1 –universe_size 10 my_proge.g. mpiexec –n 1 –universe_size 10 my_prog

Worker codeWorker code/* worker */

#include "mpi.h" int main(int argc, char *argv[]) { int size; MPI_Comm parent; MPI_Init(&argc, &argv); MPI_Comm_get_parent(&parent); if (parent == MPI_COMM_NULL) error("No parent!"); MPI_Comm_remote_size(parent, &size); if (size != 1) error("Something's wrong with the parent");

/* * Parallel code here. The manager is represented as the process with rank 0 in (the remote group of) MPI_COMM_PARENT. If the workers need to communicate among themselves, they can use MPI_COMM_WORLD. */

MPI_Finalize(); return 0; }

Connecting different MPI Connecting different MPI ApplicationsApplications

Climate models are a good example of two Climate models are a good example of two separate applications that need to share separate applications that need to share datadata Atmosphere needs input from ocean modelAtmosphere needs input from ocean model

e.g. effect of large currents on warminge.g. effect of large currents on warming Ocean model needs input from atmosphereOcean model needs input from atmosphere

Atmospheric temperature affects evaporation ratesAtmospheric temperature affects evaporation rates Secondary example – visualization of an Secondary example – visualization of an

active applicationactive application May want to pass information to a visualization May want to pass information to a visualization

engine that is implemented in a separate engine that is implemented in a separate programprogram

We’ll assume that such programs cannot be We’ll assume that such programs cannot be spawned within the MPI environmentspawned within the MPI environment

Mediating Mediating communicationcommunication

Necessary to modify one program to Necessary to modify one program to accept a connection from anotheraccept a connection from another Must open a portMust open a port

MPI_Open_port(MPI_INFO_NULL,port_name,ierr)MPI_Open_port(MPI_INFO_NULL,port_name,ierr) The connecting application still has to be The connecting application still has to be

allowed to connect to communicatorallowed to connect to communicator MPI_Comm_accept(port_name,MPI_INFO_NULL,0,MPI_Comm_accept(port_name,MPI_INFO_NULL,0, MPI_COMM_WORLD,client,ierr)MPI_COMM_WORLD,client,ierr) Collective over comm in 4Collective over comm in 4thth argument argument client returns a new comm (intercommunicator) to client returns a new comm (intercommunicator) to

the connecting applicationthe connecting application Can now send to the remote application Can now send to the remote application

Example: 2d Poisson Example: 2d Poisson modelmodel

character*(MPI_MAX_PORT_NAME) port_nameinteger client…if (myid .eq. 0) then call MPI_Open_port(MPI_INFO_NULL, port_name,ierr) write(*,*) port_nameendif call MPI_Comm_accept(port_name,MPI_INFO_NULL,0, & MPI_COMM_WORLD,client,ierr)!Must send information to the connecting programcall MPI_Gather(mesh_size,1,MPI_INTEGER, & MPI_BOTTOM,0,MPI_DATATYPE_NULL, & 0, client, ierr)…! After each iteration process 0 sends its infoif (myid.eq.0) then call MPI_Send(it,1,MPI_INTEGER,0,0,client,ierr)end ifcall MPI_Gatherv(mesh,mesh_size,MPI_DOUBLE_PRECISION, & MPI_BOTTOM,0,0, & MPI_DATATYPE_NULL, 0, client, ierr)…if (myid .eq.0) then call MPI_Close_port(port_name,ierr)end ifcall MPI_Comm_disconnect(client,ierr)call MPI_Finalize(ierr)

These messages are matched inThe connecting application

Connecting application mustalso execute this gather

Accessing the portAccessing the port To connect to the front end application To connect to the front end application

the client must be given the port namethe client must be given the port name Port names are usually character stringsPort names are usually character strings The connect to port by callingThe connect to port by calling

MPI_Comm_connect(port_name,info,root,comm,newMPI_Comm_connect(port_name,info,root,comm,newcomm,ierr)comm,ierr)

Returns new communicatorReturns new communicator Can determine remote communicator size viaCan determine remote communicator size via

MPI_Comm_remote_size(newcomm,procs,ierr)MPI_Comm_remote_size(newcomm,procs,ierr) Communication must concur with that Communication must concur with that

executed in the server programexecuted in the server program Finally, disconnected from the portFinally, disconnected from the port

MPI_Comm_disconnect(newcomm,ierr)MPI_Comm_disconnect(newcomm,ierr)

Alternative methods of Alternative methods of publishing the port namepublishing the port name

MPI_Publish_name(service,info,port,ierr)MPI_Publish_name(service,info,port,ierr) Allows a given program (specified by service) to Allows a given program (specified by service) to

post the port name to all MPI infrastructurepost the port name to all MPI infrastructure Call MPI_Unpublish_name to remove data from Call MPI_Unpublish_name to remove data from

system before finalizingsystem before finalizing Client then calls Client then calls

MPI_Lookup_name(service,info,port,ierr)MPI_Lookup_name(service,info,port,ierr) Returns desired port nameReturns desired port name

Possible problem:Possible problem: Two programmers may choose same service Two programmers may choose same service

name (unlikely)name (unlikely)

MPI-2 & threadsMPI-2 & threads MPI processes are usually conceived as being single MPI processes are usually conceived as being single

threadedthreaded Threads carry their own pc, register set and stackThreads carry their own pc, register set and stack Individual threads are not considered to be visible outside the Individual threads are not considered to be visible outside the

processprocess Threads have the potential to improve performance in Threads have the potential to improve performance in

codes where polling is requiredcodes where polling is required Polling thread operates at lower overhead than stopping entire Polling thread operates at lower overhead than stopping entire

process to pollprocess to poll Equivalent to non-blocking receiveEquivalent to non-blocking receive

However, to work effectively the MPI implementation However, to work effectively the MPI implementation must be must be thread safethread safe Multiple threads can execute message passing calls without Multiple threads can execute message passing calls without

causing problemscausing problems MPI-1 is not guaranteed to be thread safe (some implementations MPI-1 is not guaranteed to be thread safe (some implementations

are)are)

Determining what type of Determining what type of threading is allowedthreading is allowed

MPI_Init_thread(required,provided,ierr)MPI_Init_thread(required,provided,ierr) A non-thread safe library will returnA non-thread safe library will return

MPI_THREAD_SINGLEMPI_THREAD_SINGLE Several user threads supported but only one may make Several user threads supported but only one may make

messaging calls (standard interpretation)messaging calls (standard interpretation) MPI_THREAD_FUNNELEDMPI_THREAD_FUNNELED

When all threads make calls, but only one at a timeWhen all threads make calls, but only one at a time MPI_THREAD_SERIALIZEDMPI_THREAD_SERIALIZED

Finally, for all threads executing messaging at any time Finally, for all threads executing messaging at any time “thread compliant”“thread compliant”

MPI_THREAD_MULTIPLEMPI_THREAD_MULTIPLE This function can be used to start the program This function can be used to start the program

instead of MPI_Initinstead of MPI_Init C binding MPI_Init_thread(int *argc, char *** argv, int C binding MPI_Init_thread(int *argc, char *** argv, int

required, int provided)required, int provided) Note only the thread which called the Note only the thread which called the

initialization may perform the finalizeinitialization may perform the finalize

“required” allows you tochoose different versionsof a given MPI lib

Ensuring all processes Ensuring all processes agree on the number of agree on the number of

threadsthreads MPI does not require MPI does not require

that environment that environment variables be propagated variables be propagated to all processesto all processes Problematic for OpenMP Problematic for OpenMP

where number of threads where number of threads is usually specified by is usually specified by environment variableenvironment variable

Most often rank 0 gets the Most often rank 0 gets the environment but others do environment but others do notnot

Code on right may be Code on right may be used to propagate the used to propagate the thread count to all other thread count to all other processesprocesses

MPI_Comm_rank(MPI_COMM_WORLD,&rank);if (rank==0) { nthreads_str = getenv(“OMP_NUM_THREADS”); if (nthreads_str) /* convert string to integer*/ nthreads = atoi( nthreads_str); else nthreads = 1;}MPI_Bcast(&nthreads,1,MPI_INT,0, MPI_COMM_WORLD);omp_set_num_threads(nthreads);

Summary Part 3Summary Part 3 Dynamic process management is Dynamic process management is

supported in MPI-2 – MPI_Comm_spawnsupported in MPI-2 – MPI_Comm_spawn Different MPI applications may connect Different MPI applications may connect

to one anotherto one another Important for a class of applications where Important for a class of applications where

different solvers require parts of the same different solvers require parts of the same datasetdataset

Communication is mediated by portsCommunication is mediated by ports Level of thread safety can be Level of thread safety can be

determined at compile timedetermined at compile time

Next lectureNext lecture Grids – motivations, applicationsGrids – motivations, applications

Computational gridsComputational grids Data gridsData grids

GlobusGlobus

Documents

High Performance Computing – CISC 811