Upload
elaine-bridgham
View
215
Download
3
Tags:
Embed Size (px)
Citation preview
Presented by
Dealing with the Scale Problem
Innovative Computing Laboratory
MPI Team
2 Dongarra_KOJAK_SC072 Bosilca_OpenMPI_SC07
3 Dongarra_KOJAK_SC073 Bosilca_OpenMPI_SC07
Runtime scalability
10
5
2
8
9
3
4
6
7
0
1
Undirected graph G:=(V, E), |V|=n (any size)Node i={0,1,2,…,n-1} has links to a set of nodes UU={i±1, i±2,…, i±2k | 2k ≤ n} in a circular spaceU={ (i+1)mod n, (i+2)mod n,…, (i+2k)mod n | 2k ≤ n } and { (n+i-1)mod n, (n+i-2)mod n,…, (n+i-2k)mod n | 2k ≤ n }
11
5
28
10
9
3
46
7
0
1
Merging all links create binomial graph from each node of the graph
Broadcast from any node in Log2(n) steps
Binomial graph
4 Dongarra_KOJAK_SC074 Bosilca_OpenMPI_SC07
Degree = number of neighbors = number of connections = resource consumption
Node-Connectivity (κ)Link-Connectivity (λ)Optimally Connectedκ = λ =δmin
Diameter
⎡ ⎤ )( 2)(log2 ⎥⎥⎤⎢⎢
⎡ nO Average Distance 3
)(log2n≈
Binomial graph properties
Runtime scalability
5 Dongarra_KOJAK_SC075 Bosilca_OpenMPI_SC07
Broadcast: optimal in number of steps log2(n) (binomial tree from each node)
Allgather: Bruck algorithm log2(n) steps At step s:
Node i send data to node i-2s Node i receive data from node i+2s
Routing cost
Runtime scalability
Because of the counter-clockwise links the routing problem is NP-complete.
Good approximations exist, with a max overhead of 1 hop.
6 Dongarra_KOJAK_SC076 Bosilca_OpenMPI_SC07
5
2
8
9
3
4
6
7
0
1
Self-Healing BMG
# of nodes
# of added nodes
# of newconnections
Runtime scalability
Dynamic environments
7 Dongarra_KOJAK_SC077 Bosilca_OpenMPI_SC07
Optimization process
Run-time selection process We use performance models, graphical encoding, and statistical
learning techniques to build platform-specific, efficient, and fast runtime decision functions.
Collective communications
8 Dongarra_KOJAK_SC078 Bosilca_OpenMPI_SC07
Collective communications
Model prediction
(A) (B)
(C) (D)
9 Dongarra_KOJAK_SC079 Bosilca_OpenMPI_SC07
Model prediction
Collective communications
(A) (B)
(C) (D)
10 Dongarra_KOJAK_SC0710 Bosilca_OpenMPI_SC07
64 Opterons with1Gb/s TCP
Tuning broadcast
Collective communications
11 Dongarra_KOJAK_SC0711 Bosilca_OpenMPI_SC07
Tuning broadcast
Collective communications
12 Dongarra_KOJAK_SC0712 Bosilca_OpenMPI_SC07
Application tuning
Parallel Ocean Program (POP) on a Cray XT4. Dominated by MPI_Allreduce of 3 doubles.
Default Open MPI select recursive doubling Similar with Cray MPI (based on MPICH).
Cray MPI has better latency.
I.e., POP using Open MPI is 10% slower on 256 processes.
Profile the system for this specific collective and determine that “reduce + bcast” is faster. Replace the decision function.
New POP performance is about 5% faster than Cray MPI.
Collective communications
13 Dongarra_KOJAK_SC0713 Bosilca_OpenMPI_SC07
Fault tolerance
P1P1 P2P2 P3P3 P4P4 4 available processors
P1P1 P2P2 P3P3 P4P4Add a fifth and performa checkpoint(Allreduce)PcPc+ P4P4+ + =
P1P1 P2P2 P3P3 P4P4 Ready to continuePcPcP4P4
....
P1P1 P2P2 P3P3 P4P4 FailurePcPcP4P4
P1P1 P3P3 P4P4 Ready for recoveryPcPcP4P4
P1P1 P3P3 P4P4 Recover the processor/dataPcPc P2P2- - - =
Diskless checkpointing
14 Dongarra_KOJAK_SC0714 Bosilca_OpenMPI_SC07
Diskless checkpointing
How to checkpoint?
Either floating-point arithmetic or binary arithmetic will work.
If checkpoints are performed in floating‐point arithmetic then we can exploit the linearity of the mathematical relations on the object to maintain the checksums.
How to support multiple failures?
Reed-Salomon algorithm.
Support p failures require p additional processors (resources).
Fault tolerance
15 Dongarra_KOJAK_SC0715 Bosilca_OpenMPI_SC07
Performance of PCG with different MPI libraries
For ckpt we generate one
ckpt every 2000 iterations
Fault tolerance
PCG Fault tolerant CG 64x2 AMD 64 connected using GigE
16 Dongarra_KOJAK_SC0716 Bosilca_OpenMPI_SC07
Checkpoint overhead in
seconds
PCG
Fault tolerance
17 Dongarra_KOJAK_SC0717 Bosilca_OpenMPI_SC07
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Fault tolerance
Detailing event types to avoid intrusiveness
18 Dongarra_KOJAK_SC0718 Bosilca_OpenMPI_SC07
Fault tolerance
Interposition in Open MPI
We want to avoid tainting the base code with #ifdef FT_BLALA.
Vampire PML loads a new class of MCA components. Vprotocols provide the entire FT
protocol (only pessimistic for now). You can use the ability to define
subframeworks in your components!
Keep using the optimized low level and zero-copy devices (BTL) for communication.
Unchanged message scheduling logic.
19 Dongarra_KOJAK_SC0719 Bosilca_OpenMPI_SC07
Performance Overhead of Improved Pessimistic Message LoggingNetPipe Myrinet 2000
75,00%
80,00%
85,00%
90,00%
95,00%
100,00%
105,00%
1 4 121927355167991311952593875157711027153920513075409961478195122911638724579327714915565539983071310751966112621473932195242917864351E+062E+062E+063E+064E+066E+068E+06
Message Size (bytes)
%performance of Open MPI
Open MPI-V no Sender-Based Open MPI-V with Sender-Based
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Fault tolerance
Performance overhead
Myrinet 2G (mx 1.1.5)—Opteron 146x2—2GB RAM—Linux 2.6.18 —gcc/gfortran 4.1.2—NPB3.2— NetPIPE.
Only two application kernels show non-deterministic events (MG, LU).
20 Dongarra_KOJAK_SC0720 Bosilca_OpenMPI_SC07
Interactive Debugging
Testing
Release
Design and code
Fault tolerance
Debugging applications
Usual scenario involves Programmer design testing
suite Testing suite shows a bug Programmer runs the
application in a debugger (such as gdb) up to understand the bug and fix it
Programmer runs the testing suite again Cyclic debugging
21 Dongarra_KOJAK_SC0721 Bosilca_OpenMPI_SC07
Fault tolerance
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Interposition Open MPI
Events are stored (asynchronously on disk) during initial run.
Keep using the optimized low level and zero-copy devices (BTL) for communication.
Unchanged message scheduling logic.
We expect low impact on application behavior.
22 Dongarra_KOJAK_SC0722 Bosilca_OpenMPI_SC07
Fault tolerance
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Performance overhead Myrinet 2G (mx 1.0.3)—Opteron 146x2—2GB RAM—Linux
2.6.18—gcc/gfortran 4.1.2—NPB3.2—NetPIPE 2% overhead on bare latency, no overhead on bandwidth. Only two application kernels show non-deterministic events. Receiver-based has more overhead; moreover, it incurs large amount of
logged data—350 MB on simple ping-pong (this is beneficial; it is enabled on-demand).
23 Dongarra_KOJAK_SC0723 Bosilca_OpenMPI_SC07
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Log size per process on NAS Parallel Benchmarks (kB)
Fault tolerance
Log size on NAS parallel benchmarks
Among the 7 NAS kernels, only 2 NPB generates nondeterministic events.
Log size does not correlate with number of processes.
The more scalable is the application, the more scalable is the log mechanism.
Only 287KB of log/process for LU.C.1024 (200MB memory footprint/process).
24 Dongarra_KOJAK_SC0724 Bosilca_OpenMPI_SC07
Contact
Dr. George BosilcaInnovative Computing LaboratoryElectrical Engineering and Computer Science DepartmentUniversity of [email protected]