View
214
Download
0
Category
Preview:
Citation preview
ECMWFPorting MPI Programs to the IBM Cluster 1600 Slide 1
Porting MPI Programs to theIBM Cluster 1600
Peter Towers
March 2004
ECMWFPorting MPI Programs to the IBM Cluster 1600 Slide 2
Topics
The current hardware switch
Parallel Environment (PE)
Issues with Standard Sends/Receives
Use of non blocking communications
Debugging MPI programs
MPI tracing
Profiling MPI programs
Tasks per Node
Communications Optimisation
The new hardware switch
Third Practical
ECMWFPorting MPI Programs to the IBM Cluster 1600 Slide 3
The current hardware switch
Designed for a previous generation of IBM hardware
Referred to as the Colony switch
2 switch adaptors per logical node
- 8 processors share 2 adaptors
- called a dual plane switch
Adaptors are multiplexed
- software stripes large messages across both adaptors
Minimum latency 21 microseconds
Maximum bandwidth approx 350 MBytes/s
- about 45 MB/s per task when all going off node together
ECMWFPorting MPI Programs to the IBM Cluster 1600 Slide 4
Parallel Environment (PE)
MPI programs are managed by the IBM PE
IBM documentation refers to PE and POE
- POE stands for Parallel Operating Environment
- many environment variables to tune the parallel environment
- talks about launching parallel jobs interactively
ECMWF uses Loadleveler for batch jobs
- PE usage becomes almost transparent
ECMWFPorting MPI Programs to the IBM Cluster 1600 Slide 5
Issues with Standard Sends/Receives
The MPI standard can be implemented in different ways
Programs may not be fully portable across platforms
Standard Sends and Receives can cause problems
- Potential for deadlocks
- need to understand Blocking v Non Blocking communications
- need to understand Eager versus Rendezvous protocols
IFS had to be modified to run on IBM
ECMWFPorting MPI Programs to the IBM Cluster 1600 Slide 6
Blocking Communications
MPI_Send is a blocking routine
It returns when it is safe to re-use the buffer being sent- the send buffer can then be overwritten
The MPI layer may have copied the data elsewhere- using internal buffer/mailbox space
- the message is then in transit but not yet received
- this is called an “eager” protocol
- good for short messages
The MPI layer may have waited for the receiver- the data is copied from send to receive buffer directly
- lower overhead transfer
- this is called a “rendezvous” protocol
- good for large messages
ECMWFPorting MPI Programs to the IBM Cluster 1600 Slide 7
MPI_Send on the IBM
Uses the “Eager” protocol for short messages- By default short means up to 4096 bytes
the higher the task count, the lower the value
Uses the “Rendezvous” protocol for long messages
Potential for send/send deadlocks- tasks block in mpi_send
if(me .eq.0) then him=1else him=0endif
call mpi_send(sbuff,n,MPI_REAL8,him,tag,MPI_COMM_WORLD,ierror)call mpi_recv(rbuff,n,MPI_REAL8,him,tag,MPI_COMM_WORLD,stat,ierror)
ECMWFPorting MPI Programs to the IBM Cluster 1600 Slide 8
Solutions to Send/Send deadlocks
Pair up sends and receives
use MPI_SENDRECV
use a buffered send
- MPI_BSEND
use asynchronous sends/receives
- MPI_ISEND/MPI_IRECV
ECMWFPorting MPI Programs to the IBM Cluster 1600 Slide 9
Paired Sends and Receives
More complex code
Requires close synchronisation
if (me .eq. 0) then him=1 call mpi_send(sbuff,n,MPI_REAL8,him,tag,MPI_COMM_WORLD,ierror) call mpi_recv(rbuff,n,MPI_REAL8,him,tag,MPI_COMM_WORLD,stat,ierror)else him=0 call mpi_recv(rbuff,n,MPI_REAL8,him,tag,MPI_COMM_WORLD,stat,ierror) call mpi_send(sbuff,n,MPI_REAL8,him,tag,MPI_COMM_WORLD,ierror)endif
ECMWFPorting MPI Programs to the IBM Cluster 1600 Slide 10
MPI_SENDRECV
Easier to code
Still implies close synchronisation
call mpi_sendrecv(sbuff,n,MPI_REAL8,him,1, &
rbuff,n,MPI_REAL8,him,1, &
MPI_COMM_WORLD,stat,ierror)
ECMWFPorting MPI Programs to the IBM Cluster 1600 Slide 11
MPI_BSEND
This performs a send using an additional buffer
- the buffer is allocated by the program via MPI_BUFFER_ATTACH
- done once as part of the program initialisation
Typically quick to implement
- add the mpi_buffer_attach call
how big to make the buffer?
- change MPI_SEND to MPI_BSEND everywhere
But introduces additional memory copy
- extra overhead
- not recommended for production codes
ECMWFPorting MPI Programs to the IBM Cluster 1600 Slide 12
MPI_IRECV / MPI_ISEND
Uses Non Blocking Communications
Routines return without completing the operation- the operations run asynchronously
- Must NOT reuse the buffer until safe to do so
Later test that the operation completed- via an integer identification handle passed to MPI_WAIT
I stands for immediate- the call returns immediately
call mpi_irecv(rbuff,n,MPI_REAL8,him,1,MPI_COMM_WORLD,request,ierror)call mpi_send (sbuff,n,MPI_REAL8,him,1,MPI_COMM_WORLD,ierror)call mpi_wait(request,stat,ierr)
Alternatively could have used MPI_ISEND and MPI_RECV
ECMWFPorting MPI Programs to the IBM Cluster 1600 Slide 13
Non Blocking Communications
Routines include
- MPI_ISEND
- MPI_IRECV
- MPI_WAIT
- MPI_WAITALL
ECMWFPorting MPI Programs to the IBM Cluster 1600 Slide 14
Debugging MPI Programs
The Universal Debug Tool
and
Totalview
ECMWFPorting MPI Programs to the IBM Cluster 1600 Slide 15
The Universal Debug Tool
The Print/Write Statement
Recommend the use of call flush(unit_number)- ensures output is not left in runtime buffers
Recommend the use of separate output files eg:unit_number=100+mytaskwrite(unit_number,*) ......call flush(unit_number)
Or set the Environment variable MP_LABELIO=yes
Do not output too much
Use as few processors as possible
Think carefully.....
Discuss the problem with a colleague
ECMWFPorting MPI Programs to the IBM Cluster 1600 Slide 16
Totalview
Assumes you can launch X-Windows remotely
Run totalview as part of a loadleveler job
export DISPLAY=.......poe totalview -a a.out <arguments>
But you have to wait for the job to run.....
Use only a few processors
- minimises the queuing time
- minimises the waste of resource while thinking....
ECMWFPorting MPI Programs to the IBM Cluster 1600 Slide 17
MPI Trace Tools
Identify message passing hot spots
Just link with
/usr/local/lib/trace/libmpiprof.a
low overhead timer for all mpi routine calls
Produces output files named mpi_profile.N
- were N is the task number
Examples of the output follow
ECMWFPorting MPI Programs to the IBM Cluster 1600 Slide 18
ECMWFPorting MPI Programs to the IBM Cluster 1600 Slide 19
ECMWFPorting MPI Programs to the IBM Cluster 1600 Slide 20
Profiling MPI programs
The same as for serial codes
Use the –pg flag at compile and/or link time
Produces multiple gmon.out.N files
- N is the task number
gprof a.out gmon.out.*
The routine .kickpipes often appears high up the profile
- an internal mpi library routine
- where the mpi library spins waiting for something
eg a message to be sent
or in a barrier
ECMWFPorting MPI Programs to the IBM Cluster 1600 Slide 21
Tasks Per Node ( 1 of 2 )
Try both 7 and 8 tasks per node for multi node jobs
- 7 tasks may run faster than 8
- depends on the frequency of communications
7 tasks leaves a processor spare
- used by the OS and background daemons such as for GPFS
- mpi tasks run with minimal scheduling interference
8 tasks are subject to scheduling interference
- by default mpi tasks cpu spin in kickpipes
- they may spin waiting for a task that has been scheduled out
- the OS has to schedule cpu time for background processes
- random interference across nodes is cumulative
ECMWFPorting MPI Programs to the IBM Cluster 1600 Slide 22
Tasks Per Node ( 2 of 2 )
Also try 8 tasks per node and MP_WAIT_MODE=sleep- export MP_WAIT_MODE=sleep
- tasks give up the cpu instead of spinning
- increases latency but reduces interference
- effect varies from application to application
Mixed mode MPI/OpenMP works well- master OpenMP thread does the message passing
- while slave OpenMP threads go to sleep
- cpu cycles are freed up for background processes
- used by the IFS to good effect
2 tasks each of 4 threads per node
suspect success depends on the parallel granularity
ECMWFPorting MPI Programs to the IBM Cluster 1600 Slide 23
Communications Optimisation
Communications costs often impact parallel speedup
Concatenate messages
- fewer larger messages are better
- reduces the effect of latency
Increase MP_EAGER_LIMIT
- export MP_EAGER_LIMIT=65536
- maximum size for messages sent with the “eager” protocol
Use collective routines
Use ISEND/IRECV
Remove barriers
Experiment with tasks per node
ECMWFPorting MPI Programs to the IBM Cluster 1600 Slide 24
The new hardware switch
Designed for the Cluster 1600
Referred to as the Federation switch
2 switch adaptors per physical node- 2 links each 2GB/s per adaptor
- 32 processors share 4 links
Adaptors/links are NOT multiplexed
Minimum latency 10 microseconds
Maximum bandwidth approx 2000 MByte/s- about 250 MB/s per task when all going off node together
Up to 5 times better performance
32 processor nodes- will affect how we schedule and run jobs
ECMWFPorting MPI Programs to the IBM Cluster 1600 Slide 25
Third Practical
Contained in the directory
/home/ectrain/trx/mpi/exercise3 on hpca
Parallelising the computation of PI
See the README for details
Recommended