Sangmin Seo, Robert Latham, Junchao Zhang, Pavan Balaji Argonne National Laboratory {sseo, robl, jczhang, balaji}@anl.gov May 4, 2015 Implementation and

Sangmin Seo, Robert Latham, Junchao Zhang, Pavan Balaji Argonne National Laboratory {sseo, robl, jczhang, balaji}@anl.gov May 4, 2015 Implementation and Evaluation of MPI Nonblocking Collective I/O PPMM 2015

File I/O in HPC File I/O becomes more important as many HPC applications deal with larger datasets The well-known gap between relative CPU speeds and storage bandwidth results in the need for new strategies for managing I/O demands PPMM 2015 1 data source: http://kk.org/thetechnium/2009/07/was-moores-law/ CPU Performance HDD Performance CPU Storage Gap

MPI I/O Supports parallel I/O operations Has been included in the MPI standard since MPI 2.0 Proposed many I/O optimizations to improve the I/O performance and to help application developers optimize their I/O use cases Blocking individual I/O Nonblocking individual I/O Collective I/O Restrictive nonblocking collective I/O Missing part? General nonblocking collective (NBC) I/O Proposed for the upcoming MPI 3.1 standard This paper presents our initial work on the implementation of the MPI NBC I/O operations PPMM 2015 2

Outline Background and motivation Nonblocking collective (NBC) I/O operations Implementation of NBC I/O operations Collective I/O in ROMIO State machine-based implementation Evaluation Conclusions and future work PPMM 2015 3

Split Collective I/O The current MPI standard provides split collective I/O routine to support NBC I/O A single collective operation is divided into two parts A begin routine and an end routine For example, MPI_File_read_all = MPI_File_read_all_begin + MPI_File_read_all_end At most one active split collective operation is possible on each file handle at any time The user has to wait until the preceding operation is completed PPMM 2015 4 MPI_File_read_all_begin MPI_File_read_all_end MPI_File_read_all_begin MPI_File_read_all_end wait

Another Limitation of Split Collective I/O MPI_Request is not used in split collective I/O routines MPI_Test cannot be used May be difficult to implement efficiently if collective I/O algorithms require more than two steps Example: ROMIO A widely used MPI I/O implementation Does not provide a true immediate return implementation of split collective I/O routines Performs all I/O in the begin step and only a small amount of bookkeeping in the end step Cannot overlap computation and split collective I/O operations 5 MPI_File_read_all_begin MPI_File_read_all_end computation Overlap I/O and computation? PPMM 2015

NBC I/O Proposal for MPI 3.1 Standard The upcoming MPI 3.1 standard will include immediate nonblocking versions of collective I/O operations for individual file pointers and explicit offsets MPI_File_iread_all(..., MPI_Request *req) MPI_File_iwrite_all(..., MPI_Request *req) MPI_File_iread_at_all(..., MPI_Request *req) MPI_File_iwrite_at_all(..., MPI_Request *req) These will replace the current split collective I/O routines PPMM 2015 6

Implications for Applications Provide benefits of both collective I/O operations and nonblocking operations Enable different collective I/O operations to be overlapped PPMM 2015 7 MPI_File_iread_all MPI_Waitall multiple posts wait all MPI_File_iread_all MPI_Test/Wait computation Overlapping I/O and computation Optimized performance of collective I/O

Collective I/O in ROMIO Implemented using a generalized version of the extended two-phase method Any noncontiguous I/O requests can be handled Two-phase I/O method Basically splits a collective I/O operation into two phases Example of the write operation In the first phase, each process sends its noncontiguous data to other processes in order for each process to rearrange the data for a large contiguous region in a file In the second phase, each process writes a big contiguous regions of a file with collected data Combine a large number of noncontiguous requests into a small number of contiguous I/O operations Can improve performance PPMM 2015 8

Example: Collective File Write in ROMIO If we handle requests of all processes independently Each process needs three individual write operations PPMM 2015 9 P0 Request P1P2 Write to a file: A to 1, B to 4, C to 7 Write to a file: D to 2, E to 5, F to 8 Write to a file: G to 3, H to 6, I to 9 Data DEFGHI File 123456789ABC

Example: Collective File Write in ROMIO PPMM 2015 10 P0 Request P1P2 Write to a file: A to 1, B to 4, C to 7 Write to a file: D to 2, E to 5, F to 8 Write to a file: G to 3, H to 6, I to 9 Data Communication Buffer File Write Send B to P1 Send C to P2 Recv D from P1 Recv G from P2 Send D to P0 Send F to P2 Recv B from P0 Recv H from P2 Send G to P0 Send H to P1 Recv C from P0 Recv F from P1 BEHCFI123456789ADGDEFGHIABC Each process can write its buffer of three blocks to the contiguous region in the file

Implementation of NBC I/O Operations Use the same general algorithm for the blocking collective I/O operations in ROMIO Replace all blocking communication or I/O operations with nonblocking counterparts Use request handles to make progress or keep track of progress Divide the original routine into separate routines when the blocking operation is changed Manage the progress of NBC I/O operations using The extended generalized request A state machine PPMM 2015 11

Extended Generalized Request Standard generalized requests Allow users to add new nonblocking operations to MPI while still using many pieces of MPI infrastructure such as request objects and the progress notification routines Unable to make progress with the test or wait routines Their progress must occur completely outside the underlying MPI implementation (typically via pthreads or signal handlers) Extended generalized requests Add poll and wait routines Enable users to utilize the test and wait routines of MPI in order to check progress on or make progress on user- defined nonblocking operations 12 PPMM 2015

Using the Extended Generalized Request Exploit the extended generalized request to mange the progress of NBC I/O operations PPMM 2015 13 MPI_File_write_all(..., MPI_Status *status) {... MPI_Alltoall(...);... ADIO_WriteStrided(...);... } MPI_File_iwrite_all(..., MPI_Request *req) { MPI_Request nio_req; MPIX_Grequest_class_create(..., iwrite_all_poll_fn,..., &greq_class); MPIX_Grequest_class_allocate(greq_class, nio_status, &nio_req); memcpy(req, &nio_req, sizeof(MPI_Request));... MPI_Alltoall(...);... ADIO_WriteStrided(...);... MPI_Grequest_complete(nio_req); }

State Machine-Based Implementation (1/3) Use the same general algorithm for the blocking collective I/O operations in ROMIO Replace all blocking communication or I/O operations with nonblocking counterparts Use request handles to make progress or keep track of progress PPMM 2015 14 MPI_File_iwrite_all(..., MPI_Request *req) { MPI_Request cur_req;... MPI_Ialltoall(..., &cur_req); MPI_Wait(&cur_req, &status)... ADIO_IwriteStrided(..., &cur_req); MPI_Wait(&cur_req, &status)... MPI_Grequest_complete(nio_req); } MPI_File_iwrite_all(..., MPI_Request *req) {... MPI_Alltoall(...);... ADIO_WriteStrided(...);... MPI_Grequest_complete(nio_req); }

State Machine-Based Implementation (2/3) Divide the original routine into separate routines when the blocking operation is changed PPMM 2015 15 MPI_File_iwrite_all(..., MPI_Request *req) {... MPI_Ialltoall(..., &cur_req); } iwrite_all_fileop(...) {... ADIO_IwriteStrided(..., &cur_req); } MPI_File_iwrite_all(..., MPI_Request *req) { MPI_Request cur_req;... MPI_Ialltoall(..., &cur_req); MPI_Wait(&cur_req, &status)... ADIO_IwriteStrided(..., &cur_req); MPI_Wait(&cur_req, &status)... MPI_Grequest_complete(nio_req); } iwrite_all_fini(...) {... MPI_Grequest_complete(nio_req); }

State Machine-Based Implementation (3/3) Manage the progress of NBC I/O operations by using a state machine PPMM 2015 16 MPI_File_iwrite_all(..., MPI_Request *req) {... MPI_Ialltoall(..., &cur_req); state = IWRITE_ALL_STATE_COMM; } iwrite_all_fileop(...) {... ADIO_IwriteStrided(..., &cur_req); state = IWRITE_ALL_STATE_FILEOP; } iwrite_all_fini(...) {... state = IWRITE_ALL_STATE_COMPLETE; MPI_Grequest_complete(nio_req); } IWRITE_ALL_STATE_COMM IWRITE_ALL_STATE_FILEOP IWRITE_ALL_STATE_COMPLETE Implemented in the poll function of the extended generalized request MPI_Test complete not yet MPI_Test complete not yet

Progress of NBC I/O Operations The NBC I/O routine initiates the I/O operation and returns a request handle, which must be passed to a completion call [MPI standard] All nonblocking calls are local and return immediately irrespective of the status of other processes Progress of NBC I/O operations? Implicit or explicit depending on the implementation Our implementation currently requires explicit progress The user has to call MPI_Test or MPI_Wait Currently a common practice in implementing nonblocking operations Alternative? Exploit progress threads to support asynchronous progress PPMM 2015 17

Evaluation Methodology Target platform Blues cluster at Argonne National Laboratory 310 compute nodes + GPFS file system Each compute node has two Intel Xeon E5-2670 (16 cores) MPI implementation Implemented the NBC I/O routines inside ROMIO Integrated into MPICH 3.2a2 or later as MPIX routines Benchmarks coll_perf benchmark in the ROMIO test suite and its modifications To use NBC I/O operations or to overlap collective operations and computation A microbenchmark to overlap multiple I/O operations PPMM 2015 18

I/O Bandwidth The coll_perf benchmark Measures the I/O bandwidth for writing and reading a 3D block- distributed array to a file Array size used: 2176 x 1152 x 1408 integers (about 14 GB) Has the noncontiguous file access pattern For NBC I/O, blocking collective I/O routines are replaced with their corresponding NBC I/O routines followed by MPI_Wait Measure the I/O bandwidth of blocking collective I/O and NBC I/O What do we expect? The NBC I/O routines ideally should have more overhead only from additional function calls and memory management. PPMM 2015 19 MPI_File_write_all(...) MPI_Request req; MPI_File_iwrite_all(..., &req); MPI_Wait(&req,...)

I/O Bandwidth (contd) PPMM 2015 20 Our NBC I/O implementation does not cause significant overhead!

Overlapping I/O and Computation Insert some synthetic computation code into coll_perf 21 Blocking I/O with computation MPI_File_write_all(...); Computation(); NBC I/O with computation MPI_File_iwrite_all(..., &req); for (...) { Small_Computation(); MPI_Test(&req, &flag,...); if (flag) break; } Remaining_Computation(); MPI_Wait(&req,...); MPI_File_iwrite_all(..., &req); Computation(); MPI_Wait(&req,...); Why not this? Because we need to make progress explicitly PPMM 2015

Overlapping I/O and Computation (contd) PPMM 2015 22 84% of write time and 83% of read time is overlapped, respectively. The entire execution time is reduced by 36% for write and 34% for read. 84% 83%

Overlapping Multiple I/O Operations PPMM 2015 23 Multiple collective I/O operations can be overlapped by using NBC I/O routines! Initiate multiple collective I/O operations at a time and wait for the completion of all posted operations 59% reduction 13% reduction

Conclusions and Future Work MPI NBC I/O operations Can take advantage of both nonblocking operations and collective operations Will be part of the upcoming MPI 3.1 standard Initial work on the implementation of MPI NBC I/O operations Done in the MPICH MPI library Based on the extended two-phase algorithm Utilizes the state machine and the extended generalized request Performs as well as blocking collective I/O in terms of I/O bandwidth Capable of overlapping I/O and other operations Can help users try nonblocking collective I/O operations in their applications Future work Asynchronous progress of NBC I/O operations To overcome the shortcomings of the explicit progress requirement Real applications study Comparison with other approaches PPMM 2015 24

Acknowledgment This material was based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, under Contract DE- AC02- 06CH11357. We gratefully acknowledge the computing resources provided on Blues, a high-performance computing cluster operated by the Laboratory Computing Resource Center at Argonne National Laboratory. PPMM 2015 25

Q&A Thank you for your attention! Questions? PPMM 2015 26

Related Work NBC I/O implementation Open MPI I/O library using the libNBC library [Venkatesan et al., EuroMPI 11] Leverages the concept of collective operations schedule in libNBC Requires modification of the progress engine of libNBC Our implementation Exploits the state machine and the extended generalized request Does not need to modify the progress engine If the extended generalized request interface is provided Plan to compare the performance and efficiency of two implementations Collective I/O research The two-phase method and its extensions Have been studied by many researchers Widely used in collective I/O implementations Our work is based on [Thakur et al., Frontiers 99] View-based collective I/O [Blas et al., CCGrid 08] MPI collective I/O implementation for better research platform [Coloma et al., Cluster 06] Collective I/O library with POSIX-like interfaces [Yu et al., IPDPS 13] 27 PPMM 2015

Documents

Sangmin Seo, Robert Latham, Junchao Zhang, Pavan Balaji Argonne National Laboratory {sseo, robl, jczhang, balaji}@anl.gov May 4, 2015 Implementation and