ECMWF Slide 1Porting MPI Programs to the IBM Cluster 1600 Peter Towers March 2004

ECMWFPorting MPI Programs to the IBM Cluster 1600 Slide 1

Porting MPI Programs to theIBM Cluster 1600

Peter Towers

March 2004

Topics

The current hardware switch

Parallel Environment (PE)

Issues with Standard Sends/Receives

Use of non blocking communications

Debugging MPI programs

MPI tracing

Profiling MPI programs

Tasks per Node

Communications Optimisation

The new hardware switch

Third Practical

The current hardware switch

Designed for a previous generation of IBM hardware

Referred to as the Colony switch

2 switch adaptors per logical node

- 8 processors share 2 adaptors

- called a dual plane switch

Adaptors are multiplexed

- software stripes large messages across both adaptors

Minimum latency 21 microseconds

Maximum bandwidth approx 350 MBytes/s

- about 45 MB/s per task when all going off node together

Parallel Environment (PE)

MPI programs are managed by the IBM PE

IBM documentation refers to PE and POE

- POE stands for Parallel Operating Environment

- many environment variables to tune the parallel environment

- talks about launching parallel jobs interactively

ECMWF uses Loadleveler for batch jobs

- PE usage becomes almost transparent

Issues with Standard Sends/Receives

The MPI standard can be implemented in different ways

Programs may not be fully portable across platforms

Standard Sends and Receives can cause problems

- Potential for deadlocks

- need to understand Blocking v Non Blocking communications

- need to understand Eager versus Rendezvous protocols

IFS had to be modified to run on IBM

Blocking Communications

MPI_Send is a blocking routine

It returns when it is safe to re-use the buffer being sent- the send buffer can then be overwritten

The MPI layer may have copied the data elsewhere- using internal buffer/mailbox space

- the message is then in transit but not yet received

- this is called an “eager” protocol

- good for short messages

The MPI layer may have waited for the receiver- the data is copied from send to receive buffer directly

- lower overhead transfer

- this is called a “rendezvous” protocol

- good for large messages

MPI_Send on the IBM

Uses the “Eager” protocol for short messages- By default short means up to 4096 bytes

the higher the task count, the lower the value

Uses the “Rendezvous” protocol for long messages

Potential for send/send deadlocks- tasks block in mpi_send

if(me .eq.0) then him=1else him=0endif

call mpi_send(sbuff,n,MPI_REAL8,him,tag,MPI_COMM_WORLD,ierror)call mpi_recv(rbuff,n,MPI_REAL8,him,tag,MPI_COMM_WORLD,stat,ierror)

Solutions to Send/Send deadlocks

Pair up sends and receives

use MPI_SENDRECV

use a buffered send

- MPI_BSEND

use asynchronous sends/receives

- MPI_ISEND/MPI_IRECV

Paired Sends and Receives

More complex code

Requires close synchronisation

if (me .eq. 0) then him=1 call mpi_send(sbuff,n,MPI_REAL8,him,tag,MPI_COMM_WORLD,ierror) call mpi_recv(rbuff,n,MPI_REAL8,him,tag,MPI_COMM_WORLD,stat,ierror)else him=0 call mpi_recv(rbuff,n,MPI_REAL8,him,tag,MPI_COMM_WORLD,stat,ierror) call mpi_send(sbuff,n,MPI_REAL8,him,tag,MPI_COMM_WORLD,ierror)endif

MPI_SENDRECV

Easier to code

Still implies close synchronisation

call mpi_sendrecv(sbuff,n,MPI_REAL8,him,1, &

rbuff,n,MPI_REAL8,him,1, &

MPI_COMM_WORLD,stat,ierror)

MPI_BSEND

This performs a send using an additional buffer

- the buffer is allocated by the program via MPI_BUFFER_ATTACH

- done once as part of the program initialisation

Typically quick to implement

- add the mpi_buffer_attach call

how big to make the buffer?

- change MPI_SEND to MPI_BSEND everywhere

But introduces additional memory copy

- extra overhead

- not recommended for production codes

MPI_IRECV / MPI_ISEND

Uses Non Blocking Communications

Routines return without completing the operation- the operations run asynchronously

- Must NOT reuse the buffer until safe to do so

Later test that the operation completed- via an integer identification handle passed to MPI_WAIT

I stands for immediate- the call returns immediately

call mpi_irecv(rbuff,n,MPI_REAL8,him,1,MPI_COMM_WORLD,request,ierror)call mpi_send (sbuff,n,MPI_REAL8,him,1,MPI_COMM_WORLD,ierror)call mpi_wait(request,stat,ierr)

Alternatively could have used MPI_ISEND and MPI_RECV

Non Blocking Communications

Routines include

- MPI_ISEND

- MPI_IRECV

- MPI_WAIT

- MPI_WAITALL

Debugging MPI Programs

The Universal Debug Tool

Totalview

The Universal Debug Tool

The Print/Write Statement

Recommend the use of call flush(unit_number)- ensures output is not left in runtime buffers

Recommend the use of separate output files eg:unit_number=100+mytaskwrite(unit_number,*) ......call flush(unit_number)

Or set the Environment variable MP_LABELIO=yes

Do not output too much

Use as few processors as possible

Think carefully.....

Discuss the problem with a colleague

Totalview

Assumes you can launch X-Windows remotely

Run totalview as part of a loadleveler job

export DISPLAY=.......poe totalview -a a.out <arguments>

But you have to wait for the job to run.....

Use only a few processors

- minimises the queuing time

- minimises the waste of resource while thinking....

MPI Trace Tools

Identify message passing hot spots

Just link with

/usr/local/lib/trace/libmpiprof.a

low overhead timer for all mpi routine calls

Produces output files named mpi_profile.N

- were N is the task number

Examples of the output follow

Profiling MPI programs

The same as for serial codes

Use the –pg flag at compile and/or link time

Produces multiple gmon.out.N files

- N is the task number

gprof a.out gmon.out.*

The routine .kickpipes often appears high up the profile

- an internal mpi library routine

- where the mpi library spins waiting for something

eg a message to be sent

or in a barrier

Tasks Per Node ( 1 of 2 )

Try both 7 and 8 tasks per node for multi node jobs

- 7 tasks may run faster than 8

- depends on the frequency of communications

7 tasks leaves a processor spare

- used by the OS and background daemons such as for GPFS

- mpi tasks run with minimal scheduling interference

8 tasks are subject to scheduling interference

- by default mpi tasks cpu spin in kickpipes

- they may spin waiting for a task that has been scheduled out

- the OS has to schedule cpu time for background processes

- random interference across nodes is cumulative

Tasks Per Node ( 2 of 2 )

Also try 8 tasks per node and MP_WAIT_MODE=sleep- export MP_WAIT_MODE=sleep

- tasks give up the cpu instead of spinning

- increases latency but reduces interference

- effect varies from application to application

Mixed mode MPI/OpenMP works well- master OpenMP thread does the message passing

- while slave OpenMP threads go to sleep

- cpu cycles are freed up for background processes

- used by the IFS to good effect

2 tasks each of 4 threads per node

suspect success depends on the parallel granularity

Communications Optimisation

Communications costs often impact parallel speedup

Concatenate messages

- fewer larger messages are better

- reduces the effect of latency

Increase MP_EAGER_LIMIT

- export MP_EAGER_LIMIT=65536

- maximum size for messages sent with the “eager” protocol

Use collective routines

Use ISEND/IRECV

Remove barriers

Experiment with tasks per node

The new hardware switch

Designed for the Cluster 1600

Referred to as the Federation switch

2 switch adaptors per physical node- 2 links each 2GB/s per adaptor

- 32 processors share 4 links

Adaptors/links are NOT multiplexed

Minimum latency 10 microseconds

Maximum bandwidth approx 2000 MByte/s- about 250 MB/s per task when all going off node together

Up to 5 times better performance

32 processor nodes- will affect how we schedule and run jobs

Third Practical

Contained in the directory

/home/ectrain/trx/mpi/exercise3 on hpca

Parallelising the computation of PI

See the README for details

ECMWF Slide 1Porting MPI Programs to the IBM Cluster 1600 Peter Towers March 2004

Documents

MEKA CRUSHING SCREENING AND CONCRETE BATCHING … · 2019. 9. 24. · MPI 1515 1500 mm / 59 ” 1500 mm / 59 ” 400-600 mtph / 440-660 stph 21820 kg / 48100 Lbs MPI 1620 1600 mm

Diagnostics at ECMWF

2014 LENOVO. ALL RIGHTS RESERVED. - ECMWF...• AIX 7.1, XLF14, VAC 12, GPFS, IBM Parallel Environment, LoadLeveler ... Parallel netcdf was used for data ingests (MPI-IO) for all scaling

MPI MELT PRESSURE PRODUCTS MPI

ECMWF Scalability Workshop PB 04/2014 ECMWF€¦ · ECMWF Scalability Workshop PB 04/2014 Ⓒ ECMWF Next generation science developments • Several efforts to develop the next-generation

System (EFAS) - ECMWF

Tim Palmer ECMWF

Claire GRANIER - ECMWF

ICON - ECMWF · • Leonidas Linardakis (Software engineering group, MPI -M) • … and many thanks to all members of the ICON development team for their excellent work! Acknowledgements:

Surface data assimilation at ECMWF ECMWF turned 30 last week Sebastien.lafont@ecmwf.int

863 - ECMWF

ECMWF SEMINAR

Overview of the Numerics of the ECMWF Atmospheric … · ECMWF Seminar 6 Sept 2004 Slide 1 ECMWF Overview of the Numerics of the ECMWF Atmospheric Forecast Model ... Two-time-level

ECMWF ECMWF Meteorological Training Course Meteorological Training Course Horst Böttger ECMWF Head of Met Division Operations Department

Demonstration Monitoring - ECMWF

Richard Engelen ECMWF

ECMWF NAEDEX 2012 – ECMWF Status Report – Stephen Engilsh ECMWF Status Report Stephen English ECMWF

767 - ECMWF

The ECMWF model, progress and challenges · 2015-12-21 · ECMWF seminar 2013 Slide 1 ECMWF The ECMWF model, progress and challenges Seminar 2013 ... >> See George’s presentation

GEMS-Aerosol WP_AER_4: Evaluation of the model and analysis Lead Partners: NUIG & CNRS-LOA Partners: DWD, RMIB, MPI-M, CEA- IPSL-LSCE,ECMWF, DLR (at no