Sangmin Seo, Robert Latham, Junchao Zhang, Pavan Balaji Argonne National Laboratory {sseo, robl,...
If you can't read please download the document
Sangmin Seo, Robert Latham, Junchao Zhang, Pavan Balaji Argonne National Laboratory {sseo, robl, jczhang, balaji}@anl.gov May 4, 2015 Implementation and
Sangmin Seo, Robert Latham, Junchao Zhang, Pavan Balaji Argonne
National Laboratory {sseo, robl, jczhang, balaji}@anl.gov May 4,
2015 Implementation and Evaluation of MPI Nonblocking Collective
I/O PPMM 2015
Slide 2
File I/O in HPC File I/O becomes more important as many HPC
applications deal with larger datasets The well-known gap between
relative CPU speeds and storage bandwidth results in the need for
new strategies for managing I/O demands PPMM 2015 1 data source:
http://kk.org/thetechnium/2009/07/was-moores-law/ CPU Performance
HDD Performance CPU Storage Gap
Slide 3
MPI I/O Supports parallel I/O operations Has been included in
the MPI standard since MPI 2.0 Proposed many I/O optimizations to
improve the I/O performance and to help application developers
optimize their I/O use cases Blocking individual I/O Nonblocking
individual I/O Collective I/O Restrictive nonblocking collective
I/O Missing part? General nonblocking collective (NBC) I/O Proposed
for the upcoming MPI 3.1 standard This paper presents our initial
work on the implementation of the MPI NBC I/O operations PPMM 2015
2
Slide 4
Outline Background and motivation Nonblocking collective (NBC)
I/O operations Implementation of NBC I/O operations Collective I/O
in ROMIO State machine-based implementation Evaluation Conclusions
and future work PPMM 2015 3
Slide 5
Split Collective I/O The current MPI standard provides split
collective I/O routine to support NBC I/O A single collective
operation is divided into two parts A begin routine and an end
routine For example, MPI_File_read_all = MPI_File_read_all_begin +
MPI_File_read_all_end At most one active split collective operation
is possible on each file handle at any time The user has to wait
until the preceding operation is completed PPMM 2015 4
MPI_File_read_all_begin MPI_File_read_all_end
MPI_File_read_all_begin MPI_File_read_all_end wait
Slide 6
Another Limitation of Split Collective I/O MPI_Request is not
used in split collective I/O routines MPI_Test cannot be used May
be difficult to implement efficiently if collective I/O algorithms
require more than two steps Example: ROMIO A widely used MPI I/O
implementation Does not provide a true immediate return
implementation of split collective I/O routines Performs all I/O in
the begin step and only a small amount of bookkeeping in the end
step Cannot overlap computation and split collective I/O operations
5 MPI_File_read_all_begin MPI_File_read_all_end computation Overlap
I/O and computation? PPMM 2015
Slide 7
NBC I/O Proposal for MPI 3.1 Standard The upcoming MPI 3.1
standard will include immediate nonblocking versions of collective
I/O operations for individual file pointers and explicit offsets
MPI_File_iread_all(..., MPI_Request *req) MPI_File_iwrite_all(...,
MPI_Request *req) MPI_File_iread_at_all(..., MPI_Request *req)
MPI_File_iwrite_at_all(..., MPI_Request *req) These will replace
the current split collective I/O routines PPMM 2015 6
Slide 8
Implications for Applications Provide benefits of both
collective I/O operations and nonblocking operations Enable
different collective I/O operations to be overlapped PPMM 2015 7
MPI_File_iread_all MPI_Waitall multiple posts wait all
MPI_File_iread_all MPI_Test/Wait computation Overlapping I/O and
computation Optimized performance of collective I/O
Slide 9
Collective I/O in ROMIO Implemented using a generalized version
of the extended two-phase method Any noncontiguous I/O requests can
be handled Two-phase I/O method Basically splits a collective I/O
operation into two phases Example of the write operation In the
first phase, each process sends its noncontiguous data to other
processes in order for each process to rearrange the data for a
large contiguous region in a file In the second phase, each process
writes a big contiguous regions of a file with collected data
Combine a large number of noncontiguous requests into a small
number of contiguous I/O operations Can improve performance PPMM
2015 8
Slide 10
Example: Collective File Write in ROMIO If we handle requests
of all processes independently Each process needs three individual
write operations PPMM 2015 9 P0 Request P1P2 Write to a file: A to
1, B to 4, C to 7 Write to a file: D to 2, E to 5, F to 8 Write to
a file: G to 3, H to 6, I to 9 Data DEFGHI File 123456789ABC
Slide 11
Example: Collective File Write in ROMIO PPMM 2015 10 P0 Request
P1P2 Write to a file: A to 1, B to 4, C to 7 Write to a file: D to
2, E to 5, F to 8 Write to a file: G to 3, H to 6, I to 9 Data
Communication Buffer File Write Send B to P1 Send C to P2 Recv D
from P1 Recv G from P2 Send D to P0 Send F to P2 Recv B from P0
Recv H from P2 Send G to P0 Send H to P1 Recv C from P0 Recv F from
P1 BEHCFI123456789ADGDEFGHIABC Each process can write its buffer of
three blocks to the contiguous region in the file
Slide 12
Implementation of NBC I/O Operations Use the same general
algorithm for the blocking collective I/O operations in ROMIO
Replace all blocking communication or I/O operations with
nonblocking counterparts Use request handles to make progress or
keep track of progress Divide the original routine into separate
routines when the blocking operation is changed Manage the progress
of NBC I/O operations using The extended generalized request A
state machine PPMM 2015 11
Slide 13
Extended Generalized Request Standard generalized requests
Allow users to add new nonblocking operations to MPI while still
using many pieces of MPI infrastructure such as request objects and
the progress notification routines Unable to make progress with the
test or wait routines Their progress must occur completely outside
the underlying MPI implementation (typically via pthreads or signal
handlers) Extended generalized requests Add poll and wait routines
Enable users to utilize the test and wait routines of MPI in order
to check progress on or make progress on user- defined nonblocking
operations 12 PPMM 2015
Slide 14
Using the Extended Generalized Request Exploit the extended
generalized request to mange the progress of NBC I/O operations
PPMM 2015 13 MPI_File_write_all(..., MPI_Status *status) {...
MPI_Alltoall(...);... ADIO_WriteStrided(...);... }
MPI_File_iwrite_all(..., MPI_Request *req) { MPI_Request nio_req;
MPIX_Grequest_class_create(..., iwrite_all_poll_fn,...,
&greq_class); MPIX_Grequest_class_allocate(greq_class,
nio_status, &nio_req); memcpy(req, &nio_req,
sizeof(MPI_Request));... MPI_Alltoall(...);...
ADIO_WriteStrided(...);... MPI_Grequest_complete(nio_req); }
Slide 15
State Machine-Based Implementation (1/3) Use the same general
algorithm for the blocking collective I/O operations in ROMIO
Replace all blocking communication or I/O operations with
nonblocking counterparts Use request handles to make progress or
keep track of progress PPMM 2015 14 MPI_File_iwrite_all(...,
MPI_Request *req) { MPI_Request cur_req;... MPI_Ialltoall(...,
&cur_req); MPI_Wait(&cur_req, &status)...
ADIO_IwriteStrided(..., &cur_req); MPI_Wait(&cur_req,
&status)... MPI_Grequest_complete(nio_req); }
MPI_File_iwrite_all(..., MPI_Request *req) {...
MPI_Alltoall(...);... ADIO_WriteStrided(...);...
MPI_Grequest_complete(nio_req); }
Slide 16
State Machine-Based Implementation (2/3) Divide the original
routine into separate routines when the blocking operation is
changed PPMM 2015 15 MPI_File_iwrite_all(..., MPI_Request *req)
{... MPI_Ialltoall(..., &cur_req); } iwrite_all_fileop(...)
{... ADIO_IwriteStrided(..., &cur_req); }
MPI_File_iwrite_all(..., MPI_Request *req) { MPI_Request
cur_req;... MPI_Ialltoall(..., &cur_req);
MPI_Wait(&cur_req, &status)... ADIO_IwriteStrided(...,
&cur_req); MPI_Wait(&cur_req, &status)...
MPI_Grequest_complete(nio_req); } iwrite_all_fini(...) {...
MPI_Grequest_complete(nio_req); }
Slide 17
State Machine-Based Implementation (3/3) Manage the progress of
NBC I/O operations by using a state machine PPMM 2015 16
MPI_File_iwrite_all(..., MPI_Request *req) {... MPI_Ialltoall(...,
&cur_req); state = IWRITE_ALL_STATE_COMM; }
iwrite_all_fileop(...) {... ADIO_IwriteStrided(..., &cur_req);
state = IWRITE_ALL_STATE_FILEOP; } iwrite_all_fini(...) {... state
= IWRITE_ALL_STATE_COMPLETE; MPI_Grequest_complete(nio_req); }
IWRITE_ALL_STATE_COMM IWRITE_ALL_STATE_FILEOP
IWRITE_ALL_STATE_COMPLETE Implemented in the poll function of the
extended generalized request MPI_Test complete not yet MPI_Test
complete not yet
Slide 18
Progress of NBC I/O Operations The NBC I/O routine initiates
the I/O operation and returns a request handle, which must be
passed to a completion call [MPI standard] All nonblocking calls
are local and return immediately irrespective of the status of
other processes Progress of NBC I/O operations? Implicit or
explicit depending on the implementation Our implementation
currently requires explicit progress The user has to call MPI_Test
or MPI_Wait Currently a common practice in implementing nonblocking
operations Alternative? Exploit progress threads to support
asynchronous progress PPMM 2015 17
Slide 19
Evaluation Methodology Target platform Blues cluster at Argonne
National Laboratory 310 compute nodes + GPFS file system Each
compute node has two Intel Xeon E5-2670 (16 cores) MPI
implementation Implemented the NBC I/O routines inside ROMIO
Integrated into MPICH 3.2a2 or later as MPIX routines Benchmarks
coll_perf benchmark in the ROMIO test suite and its modifications
To use NBC I/O operations or to overlap collective operations and
computation A microbenchmark to overlap multiple I/O operations
PPMM 2015 18
Slide 20
I/O Bandwidth The coll_perf benchmark Measures the I/O
bandwidth for writing and reading a 3D block- distributed array to
a file Array size used: 2176 x 1152 x 1408 integers (about 14 GB)
Has the noncontiguous file access pattern For NBC I/O, blocking
collective I/O routines are replaced with their corresponding NBC
I/O routines followed by MPI_Wait Measure the I/O bandwidth of
blocking collective I/O and NBC I/O What do we expect? The NBC I/O
routines ideally should have more overhead only from additional
function calls and memory management. PPMM 2015 19
MPI_File_write_all(...) MPI_Request req; MPI_File_iwrite_all(...,
&req); MPI_Wait(&req,...)
Slide 21
I/O Bandwidth (contd) PPMM 2015 20 Our NBC I/O implementation
does not cause significant overhead!
Slide 22
Overlapping I/O and Computation Insert some synthetic
computation code into coll_perf 21 Blocking I/O with computation
MPI_File_write_all(...); Computation(); NBC I/O with computation
MPI_File_iwrite_all(..., &req); for (...) {
Small_Computation(); MPI_Test(&req, &flag,...); if (flag)
break; } Remaining_Computation(); MPI_Wait(&req,...);
MPI_File_iwrite_all(..., &req); Computation();
MPI_Wait(&req,...); Why not this? Because we need to make
progress explicitly PPMM 2015
Slide 23
Overlapping I/O and Computation (contd) PPMM 2015 22 84% of
write time and 83% of read time is overlapped, respectively. The
entire execution time is reduced by 36% for write and 34% for read.
84% 83%
Slide 24
Overlapping Multiple I/O Operations PPMM 2015 23 Multiple
collective I/O operations can be overlapped by using NBC I/O
routines! Initiate multiple collective I/O operations at a time and
wait for the completion of all posted operations 59% reduction 13%
reduction
Slide 25
Conclusions and Future Work MPI NBC I/O operations Can take
advantage of both nonblocking operations and collective operations
Will be part of the upcoming MPI 3.1 standard Initial work on the
implementation of MPI NBC I/O operations Done in the MPICH MPI
library Based on the extended two-phase algorithm Utilizes the
state machine and the extended generalized request Performs as well
as blocking collective I/O in terms of I/O bandwidth Capable of
overlapping I/O and other operations Can help users try nonblocking
collective I/O operations in their applications Future work
Asynchronous progress of NBC I/O operations To overcome the
shortcomings of the explicit progress requirement Real applications
study Comparison with other approaches PPMM 2015 24
Slide 26
Acknowledgment This material was based upon work supported by
the U.S. Department of Energy, Office of Science, Office of
Advanced Scientific Computing Research, under Contract DE- AC02-
06CH11357. We gratefully acknowledge the computing resources
provided on Blues, a high-performance computing cluster operated by
the Laboratory Computing Resource Center at Argonne National
Laboratory. PPMM 2015 25
Slide 27
Q&A Thank you for your attention! Questions? PPMM 2015
26
Slide 28
Related Work NBC I/O implementation Open MPI I/O library using
the libNBC library [Venkatesan et al., EuroMPI 11] Leverages the
concept of collective operations schedule in libNBC Requires
modification of the progress engine of libNBC Our implementation
Exploits the state machine and the extended generalized request
Does not need to modify the progress engine If the extended
generalized request interface is provided Plan to compare the
performance and efficiency of two implementations Collective I/O
research The two-phase method and its extensions Have been studied
by many researchers Widely used in collective I/O implementations
Our work is based on [Thakur et al., Frontiers 99] View-based
collective I/O [Blas et al., CCGrid 08] MPI collective I/O
implementation for better research platform [Coloma et al., Cluster
06] Collective I/O library with POSIX-like interfaces [Yu et al.,
IPDPS 13] 27 PPMM 2015