Large Scale Parallel I/O with HDF5 Darren Adams, NCSA Considerations for Parallel File Systems on Distributed HPC Platforms

Large Scale Parallel I/O with HDF5

Darren Adams, NCSAConsiderations for Parallel File Systems

on Distributed HPC Platforms

Survey of Parallel I/O Techniques

• Root Process Gather and write: Yes, this is still common…

• N-N: Write independent file from each process

• N-1: Write to a single file from all processes.– MPIIO based Parallel I/0– Higher level I/O Library HDF5,NetCFD,Silo,etc.

Advantages of N-1

• The resulting file (can be) independent of process count.

• Single file is easier to analyze with other applications

• More conceptually compatible with the “global” data set that has been computed in a distributed fashion.

Disadvantages of N-1

• Usually results in a write performance penalty vs. N-N.

• Without particular care, can loose information about how the data was distributed when it was created.

• File size can become a problem.• Some approaches can result in pathologically

bad IO patterns vs. N-N.

A Couple of New Approaches to Optimize Performance

• PLFS –Virtual Parallel Log Structured File System– Sits in between the application and the file system

and intercepts the application's I/O requests.– Optimizes I/O for the underlying file system by

aligning stripe boundaries• ADIOS – ADaptable IO System– Change the IO behavior though an XML file that is

independent of the application.– Allows Fortran READ and WRITE statements to be

used in the application.

Other Considerations (besides parallel I/O performance)

• Portability across architecture and OS.• Ability to read and manipulate data with other

applications.• Stability of software (library) and availability

used to create file format.• Sharing data with collaborators.

Reasons to use HDF5

• Provides a rich API that is capable of creating a file system in a file

• Supports Parallel IO • Is highly stable and backward compatible.• Platform Independent.• Widely used and supported (will likely be

around for a long time)

Reasons Not to Use HDF5 (directly)

• Complicated, general purpose API• No standard is imposed on the data format

created.– While HDF5 has the capability to create a self-

describing well-formed data format, it does not require these practices.

• More of a general toolkit to build a higher application-level API (like NetCDF).

• More time and effort is needed for development of I/O routines.

Reasons to Use HDF5 (directly)• Maximum control over file format.• Access to all of the performance settings in the

underlying MPIIO layer.• Ability to restructure the data layout if needed to

improve performance of either the application writing the data OR visualization or other analysis software

• Can incrementally expose portions of the HDF5 API that are needed to the application rather than only having access to what a higher level API exposes (square hole – round peg).

Remarks on (P)HDF5

• HDF5 can be compiled with parallel MPIIO support.– PHDF5 must be linked separately from a serial

(non-mpi) HDF5 build. • PHDF5 opens up an interface to the MPI I/O

layer via MPIIO Hints and various internal MPI-specific properties.

• Using HDF5 directly gives full access to all of these settings

Data Requirements for DNS Application

The science code is a CFD solver that used domain-decomposed MPI processes to parallelize the computation.•Visualization Data

– The main data product for analyzing the scalar and vector field data included 3D scalar and 3D – 3vector field data.

– Important to have a workable data format that can be read with tools such as Visit

•Restart Data– State dump, only used by the simulation code can be highly

optimized, but should still be portable across architectures.•Statistics Data

– Smaller data sets and Integral quantities

Parallel I/O Approach for DNS Code

• Chose to use HDF5 directly– Developed simple I/O library that encapsulates

the HDF5 file structure• Exposes a simpler interface to the simulation code.• Allows I/O routines to be re-used in analysis programs• Can use the simple library to implement a Visit

Database reader.

– Exposes only a very well maintained and widely supported library dependency to the application.

Parallel I/O Approach for DNS Code

• Need to mitigate performance degradation of N-1 file I/O.– As long as I/O performance is “reasonable”, code

execution time will not be significantly impacted. There simply is not enough storage capacity on the file system to allow I/O to dominate.

Hmmm, what does “reasonable” mean…– Structured usable files are worth some level of

performance sacrifice to researchers.

HDF5 Datasets in Parallel Applications

• Use Hyperslabs to define a subset of a global data set for each process.– In CFD domain decomposition, each MPI rank has

data that is physically adjacent to another rank’s.– The “global” data set, in this context, is achieved

when data from all rank is “glued” together.– This approach creates a single glued-together data set

which can be supplemented by MPI rank info, but does not need to be.

• Let each process write to a separate data set.– Information about how to glue the data together

MUST be provided.

Considerations for the Cray XT System Kraken

• Lustre File System with ~160 OSTs ?.• Experience shows that Collective N-1 I/O can

lead to pathological performance degradation.– This is often due to network contention

introduced when processes overlap the Lustre stripe boundaries. In effect each process writes to several OST’s rather than striking a balance where all OSTs work in concert.

Lustre and MPI-IO: ADIO to the Rescue

• Recent additions to the MPICH ROMIO layer upon which most MPI distributions are based allow Lustre stripe size and count to be set via the MPI_Info object.– Setting these parameters will set the actual Lustre striping parameters

to the new file when it is created. The only other way to do this is to set striping parameter ahead of time to the I/O directory, or use the lustre c library an c-style open calls.

– Perhaps equally as important these parameters are sued by the MPIIO collective buffering stack. The MPI hint “CO” is introduced to set the client to OST ratio.

– These improvements directly address the shortcomings of N-1 file I/O cited by developers of PLFS and ADIOS. But, do they work?

A Note of Caution for Lustre

Lustre MPI-IO TweaksFrom “man mpi” on Kraken:MPICH_MPIIO_CB_ALIGN If set to 2, an algorithm is used to divide the I/O workload into Lustre stripe-sized pieces and assigns them to collective buffering nodes (aggregators), so that each aggregator always accesses the same set of stripes and no other aggregator accesses those stripes. This is generally the optimal collective buffering mode as it minimizes the Lustre file system extent lock contention and thus reduces system I/O time. However, the overhead associated with dividing the I/O workload can in some cases exceed the time otherwise saved by using this method.

striping_factor Specifies the number of Lustre file system stripes (stripe count) to assign to the file. This has no effect if the file already exists when the MPI_File_open() call is made. File striping cannot be changed after a file is created. Currently this hint applies only when MPICH_MPIIO_CB_ALIGN is set to 2.

Default: the default value for the Lustre file system, or the value for the directory in which the file is created if the lfs setstripe command was used to set the stripe count of the directory to a value other than the system default.

striping_unit Specifies in bytes the size of the Lustre file system stripes (stripe size) assigned to the file. This has no effect if the file already exists when the MPI_File_open() call is made. File striping cannot be changed after a file is created. Currently this hint applies only when MPICH_MPIIO_CB_ALIGN is set to 2.

Default: the default value for the Lustre file system, or the value for the directory in which the file is created if the lfs setstripe command was used to set the stripe size of the directory to a value other than the system default.

Conclusions• Log File formats may be the best approach to optimize I/O

performance if performance is the only goal.• More traditional file formats may not achieve the performance

peaks of a log file format, but are more usable after the run.• Recent Lustre ADIO improvements provide tools to mitigate poor

performance for N-1 I/O patterns enough to still justify their use.• Additional improvements can be made when approaching Peta-

scale while maintaining a consistent file format.– Deploy separate I/O aggregator processes with (perhaps) very large

buffer settings.– Implement a “N-M” approach where the single file is broken up, but

not to full N-N per-process. Can use internal HSDF5 file structure and linking to preserve the “global” process-independent dataset.

Future Work

• Develop routines for complete state dumps for DNS restarts using HDF5 and a N-1 approach– Want single file should be readable from different

process counts across platforms– Pull out all stops to get achieve peak write

performance – Is a log file format or I/O imposition layer really needed?

• Document optimal settings for large scale grids on Kraken at 1024 processes and beyond.

References

• PLFS: A Checkpoint Filesystem for Parallel Applications http://institute.lanl.gov/plfs/plfs.pdf

• ADIOS http://adiosapi.org/index.php5?title=Main_Page

• Lustre Technical White Paper: Lustre ADIO collective write driver http://wiki.lustre.org/images/6/65/Lustre_ADIO_Driver_Whitepaper_0926.pdf

http://institute.lanl.gov/plfs/plfs.pdf

http://institute.lanl.gov/plfs/plfs.pdf

http://adiosapi.org/index.php5?title=Main_Page

http://wiki.lustre.org/images/6/65/Lustre_ADIO_Driver_Whitepaper_0926.pdf

http://wiki.lustre.org/images/6/65/Lustre_ADIO_Driver_Whitepaper_0926.pdf

Documents

Large Scale Parallel I/O with HDF5 Darren Adams, NCSA Considerations for Parallel File Systems on Distributed HPC Platforms