Upload
erica-hines
View
216
Download
1
Tags:
Embed Size (px)
Citation preview
LLNL-PRES-673936This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract
DE-AC52-07NA27344. Lawrence Livermore National Security, LLC
MACSio 0.9Design and Current Capabilities
LLNL Internal Review
Mark C. Miller
June 2015
Lawrence Livermore National Laboratory LLNL-PRES-673936
Multi-purpose
Application-Centric
Scalable i/o
Proxy Application
We aim to use it to MACS-imize I/O performance
What is MACSio?
Lawrence Livermore National Laboratory LLNL-PRES-673936
This milestone will fill a gap in the DOE co-design space by developing a proxy application on which various parallel I/O methods and libraries can be easily tested and studied for scalability, performance impacts of software abstractions, impact of techniques such as compression or checksumming, and the use of burst buffers. The core of the application will be common system for generating collections of data structures (e.g. arrays [float|double|int], strings, key/value pairs) that in aggregate closely mimic the checkpoint data of real applications. Generation of the data will be user-definable through fully parameterized attributes for the types, quantity, lengths, and randomness of data. A plug-in architecture will be designed to allow users to study different I/O libraries (e.g. HDF5 vs netCDF), parallel I/O techniques (e.g. file-per-processor vs collective), and support for burst buffers (e.g. SCR).
L2 Milestone Description
Lawrence Livermore National Laboratory LLNL-PRES-673936
A functional proxy app with abstractions to support at least two I/O libraries commonly used at the ASC labs is demonstrated, and initial feedback sought from the broader community. Silo (r/w), HDF5 (w), Exodus (w), Raw Posix (w)
Full documentation on how the proxy app can be extended to support new plug-ins. Doxygenated sources and examples + design docs
A report (white paper or presentation) is delivered describing the proxy app, and reporting on performance results gathered through its use on at least two platforms (e.g. TLCC2 and Sequoia with Lustre). Design Doc + This presentation
L2 Milestone Completion Criteria
Lawrence Livermore National Laboratory LLNL-PRES-673936
Existing “Benchmarks” are limited in scope• More on that in a later slide
Whole apps and app kernels are often inflexible• Often ignore I/O entirely• Able to test only one specific way of doing I/O
Measuring and diagnosing I/O performance is complicated• I/O Stack and level of abstraction• File systems are getting more complex• There are a myriad of options
Aim to evaluate a variety of I/O relevant options• HDF5 vs. netCDF (+ params within each lib)• Overhead of Silo on HDF5 or Exodus on netCDF• Collective vs. Independent• Different parallel I/O paradigms• hzip vs. szip vs. no-compression (of realistic data)• Burst buffers
Why an I/O Proxy App is Needed
Lawrence Livermore National Laboratory LLNL-PRES-673936
Single, Shared File (SSF), Aka “N:1”, “Rich Mans”• Concurrent access to a single, shared file• Requires a “true” parallel file system• Sometimes further divided into strided and segmented
— My take: These terms suggest granularity of different MPI rank’s data co-mingling in file
• Often a challenge to get good, scalable performance• Best for academic, single physics, homogeneous, and/or structured, array codes
Multiple Independent File (MIF), Aka “N:M”, “Poor Mans”• Concurrent (app-managed) access to multiple files• File count independent of MPI comm size• Easier than SSF to get good, scalable performance• Best for multi-physics, heterogeneous, and/or unstructured mesh codes
File Per Processor (FPP), Aka “N:N”• Really just a special case of MIF (but also very common)• At extreme scale, places huge demand on file system metadata• Sometimes need “throttling” knob (number of simultaneous files)
Collective vs. Independent I/O Requests• Possible within any paradigm but collective typical/easiest with SSF
Common Parallel I/O Paradigms
P0 P2P1 P3 P4
Collective
Independent
Single Shared
File
P0 P2P1 P3 P4
Mutiple Indep. File
0
Mutiple Indep. File
1
P0 P2P1 P3 P4Proc File
0Proc File
1Proc File
2Proc File
3Proc File
4
SSF (N:1)(strided shown)
MIF (N:M)
FPP (N:N)
Lawrence Livermore National Laboratory LLNL-PRES-673936
Restart• Save state of code to restart it at a later time• Protect cost to re-compute against various failures• Other aspects
— Precision: Full, typically cannot tolerate loss— Extent of content: Everything needed to successfully restart the code— Frequency: MTBF, re-compute cost & user tolerance— Longevity: short, often never migrated to main file system (SCR)
— Portability (of data/files): Only writer needs to read it (not always) and often only on same platform
Analysis (AKA: plot, post-process, presentation, movies, etc.)• Save only key mesh(s) & var(s) from code essential to analysis• Other aspects:
— Precision: Single, often can tolerate loss— Extent of content: Only those data objects needed for the down-stream analysis— Frequency: varies widely, governed by down-stream presentation/analysis needs— Longevity: Can be years/decades, always migrated to main file system
— Portability (of data/files): Across many tools, platforms, file systems, versions of I/O libs and years of use leads to the need for useful data abstractions
Why codes do I/O?
Lawrence Livermore National Laboratory LLNL-PRES-673936
Notes to Self and What They Teach us about the HPC I/O Stack When I write a note I never
expect anyone else to read• Often, result is something lik this...
• Sadly, if too much time passes, I can’t even read it myself ;)
Moral: Writing stuff you want others to read requires• care
• common conventions
• mutually agreed upon terms
• formalisms
• data models and abstractions
• “others”=“self” a year from now
Lawrence Livermore National Laboratory LLNL-PRES-673936
Abstraction Levels
Real World Phenomena
Continuous Mathematics
Discrete Numerical Models
Prog. Language Constructs
Platform Primary Storage
File System
Secondary Storage (SW)
Secondary Storage (HW)
Picture includes levels we don’t often think of as part of HPC I/O stack
Level of Abstraction (LOA)and the HPC I/O Stack
Increasing LOA
Lawrence Livermore National Laboratory LLNL-PRES-673936
Abstraction Levels
Real World Phenomena
Continuous Mathematics
Discrete Numerical Models
Prog. Language Constructs
Platform Primary Storage
File System
Secondary Storage (SW)
Secondary Storage (HW)
The HPC I/O Stack is similar toIP Protocol Stack
Increasing LOA
Lawrence Livermore National Laboratory LLNL-PRES-673936
Abstraction Levels Abstraction Objects
Real World Phenomena Physics, Chemistry, Materials...
Continuous MathematicsPDEs, Fields, Topologies,
Manifolds
Discrete Numerical Models Meshes, Materials, Variables
Prog. Language Constructs Arrays, Structs, Lists, Trees
Platform Primary StorageInts, Floats, Pointers, Offsets,
Lengths
File SystemFiles, Dirs, Links, Permissions,
Modes
Secondary Storage (SW) Pages, Inodes, FATs, OSTs, OSDs
Secondary Storage (HW) Bits, Volumes, Sectors, Tracks
Level of Abstraction (LOA)and the HPC I/O Stack
Increasing LOA
Lawrence Livermore National Laboratory LLNL-PRES-673936
Abstraction Levels Abstraction Objects Example Implementations
Real World Phenomena Physics, Chemistry, Materials... ALE3D, Albany, LAPS
Continuous MathematicsPDEs, Fields, Topologies,
ManifoldsMFEM, FiberBundles, Sheafs, SAF
Discrete Numerical Models Meshes, Materials, Variables Silo, LibMesh, Exodus, ITAPS, VTK
Prog. Language Constructs Arrays, Structs, Lists, Trees HDF5, netCDF, ArrayIO, PDB
Platform Primary StorageInts, Floats, Pointers, Offsets,
LengthsMPI-IO, XDR, stdio, aio, mmap
File SystemFiles, Dirs, Links, Permissions,
ModesPOSIX IO
Secondary Storage (SW) Pages, Inodes, FATs, OSTs, OSDs
PLFS, GPFS, HDFS, Lustre
ext2/3, zfs, xfs, hfs
Secondary Storage (HW) Bits, Volumes, Sectors, Tracks hd, sd, cd, dvd, tape, ramdisk, flash
Level of Abstraction (LOA)and the HPC I/O Stack
Increasing LOA
Lawrence Livermore National Laboratory LLNL-PRES-673936
Abstraction Levels Abstraction Objects Example Implementations
Real World Phenomena Physics, Chemistry, Materials... ALE3D, Albany, LAPS
Continuous MathematicsPDEs, Fields, Topologies,
ManifoldsMFEM, FiberBundles, Sheafs, SAF
Discrete Numerical Models Meshes, Materials, Variables Silo, LibMesh, Exodus, ITAPS, VTK
Prog. Language Constructs Arrays, Structs, Lists, Trees HDF5, netCDF, ArrayIO, PDB
Platform Primary StorageInts, Floats, Pointers, Offsets,
LengthsMPI-IO, XDR, stdio, aio, mmap
File SystemFiles, Dirs, Links, Permissions,
ModesPOSIX IO
Secondary Storage (SW) Pages, Inodes, FATs, OSTs, OSDs
PLFS, GPFS, HDFS, Lustre
ext2/3, zfs, xfs, hfs
Secondary Storage (HW) Bits, Volumes, Sectors, Tracks hd, sd, cd, dvd, tape, ramdisk, flash
LOA of Existing Benchmarking Tools
Many Tools
Increasing LOA
Fewer Tools
Lawrence Livermore National Laboratory LLNL-PRES-673936
Abstraction Levels Abstraction Objects Example Implementations
Real World Phenomena Physics, Chemistry, Materials... ALE3D, Albany, LAPS
Continuous MathematicsPDEs, Fields, Topologies,
ManifoldsMFEM, FiberBundles, Sheafs, SAF
Discrete Numerical Models Meshes, Materials, Variables Silo, LibMesh, Exodus, ITAPS, VTK
Prog. Language Constructs Arrays, Structs, Lists, Trees HDF5, netCDF, ArrayIO, PDB
Platform Primary StorageInts, Floats, Pointers, Offsets,
LengthsMPI-IO, XDR, stdio, aio, mmap
File SystemFiles, Dirs, Links, Permissions,
ModesPOSIX IO
Secondary Storage (SW) Pages, Inodes, FATs, OSTs, OSDs
PLFS, GPFS, HDFS, Lustre
ext2/3, zfs, xfs, hfs
Secondary Storage (HW) Bits, Volumes, Sectors, Tracks hd, sd, cd, dvd, tape, ramdisk, flash
LOA of MACSio
ExistingTools
MACSio
Increasing LOA
Lawrence Livermore National Laboratory LLNL-PRES-673936
LOA=Level of Abstraction PMode=Parallel Mode
• SSF=Single Shared File• MIF=Multiple Independent File• FPP=file per processor
Coll=Collective I/Os Abs Keep=Abstraction Preserving?
DIT=Data in transit services• precision, cksum, compress, re-order...
MPI+=MPI+MC/OpenMP/Threads Perf DB=Performance Database EZ-Extend (how hard to add)
• Parallel Mode, I/O library, X in MPI+X
Evaluation of Existing Benchmarks
Lawrence Livermore National Laboratory LLNL-PRES-673936
What I/O performance will application X see on platform Y?
What overhead does Silo cost over HDF5?
Can collective I/O to SSF achieve better performance than MIF?
How much will compression algorithm X improve performance?
Why does HDF5 with >10000 groups hit a wall?
Application A writes 100 Gb/s via stdio, why are we seeing only 1 Gb/s with library B?
Simple Questions, Hard to Answer
Lawrence Livermore National Laboratory LLNL-PRES-673936
Notional I/O Performance
Silo
HDF5
Lustre
Raw
Request Size
I/O
Ban
dw
idth
% o
f d
um
p
Disk hardware limit
20%
Overheads @this request size
Dow
n H
PC
I/O
Sta
ck
{{{
request from app
request fromsomewhere in stack
Lawrence Livermore National Laboratory LLNL-PRES-673936
Useful Features of an I/O Proxy Appand how MACSio addresses them Easy scaling of data sizes
Physically realistic data, Not all zeros/random()
Variety of I/O libraries
Various I/O Paradigms
Data Validation
Able to drive DIT services
Able to drive variety & mix of I/O loads
Easy Timing/Logging
Integrate with other tools
Mesh part size & # parts/rank
Variable exprs + noise, Vary size, shape across rank
Plugins: HDF5, Silo, Exodus...
SSF, MIF, FPP, coll./indep.
Use plugin reader + cksums
float conv., compress, cksum, etc.
Restart/plot, movie, linker, time-histories
Timing and logging classes
SCR, VisIt, Darshan, Caliper
Lawrence Livermore National Laboratory LLNL-PRES-673936
Images above are variations on Ken Perlin’s procedural texture algorithms
Metric: It kinda looks like stuff HPC generates
Perlin Noise or some other Chaotic Process for Numerically Similar Data
Lawrence Livermore National Laboratory LLNL-PRES-673936
Very simple mesh + variable generation• Vars are expressions over global spatial extents box
Very simple decomp of mesh across MPI ranks• Different ranks can have different # of mesh parts• Currently, individual parts are all same size
— Should add option for randomization around nominal size
Static plugins linked into MACSio @ link time• All methods in a plugin are file-scope static• MACSio main gets pointers to key methods during init
Uses modified JSON-C library for marshaling data between MACSio and plugins• Opted for JSON-C to support Fortran plugins
How MACSio 0.9 Operates
Lawrence Livermore National Laboratory LLNL-PRES-673936
{“clargs”:<clarg-obj>,
“parallel”: {“mpi_size”:<int>, “mpi_rank”:<int>},
“global_mesh”: {“num_parts”:<int>,
“bounds”: [<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>]},
“local_mesh”: [<part-obj>,<part-obj>,...]
}
MACSio’s Main Object (Dictionary)
Lawrence Livermore National Laboratory LLNL-PRES-673936
Would like to use VisIt’s/LibSim2 Metadata
MACSio’s Main Object (Dictionary) "Mesh": { "MeshType": "rectilinear", "ChunkID": 0, # unique per part "GeomDim": 2, "TopoDim": 2, "LogDims": [50,100], "Bounds": [0,0,0,1,1,0], "Coords": { "CoordBasis": "X,Y,Z", "XAxisCoords": [...], "YAxisCoords": [...] }, "Topology": { "Type": "Templated", "DomainDim": 2, "RangeDim": 0, "ElemType": "Quad4", "Template": [0,1,51,50] } },
"Vars": [ {...}, {...}, { "name": "spherical", "centering": "zone", "data": [...] }, {...}, {...}, {...} ], "GlobalLogIndices": [...], "GlobalLogOrigin": [...]}
Lawrence Livermore National Laboratory LLNL-PRES-673936
Global parts are identical across ranks
Local parts are specific to each rank
No communication involved (yet)
Once generated, call plugin’s Dump method
/* do the dump */
(*(iface->dumpFunc))(argi, argc, argv, main_obj, dumpNum, dumpTime);
JSON Data object constructed on all processors
Lawrence Livermore National Laboratory LLNL-PRES-673936
Traverse main JSON-C object
Query for various metadatarank = JsonGetInt(main_obj, "parallel/mpi_rank");
size = JsonGetInt(main_obj, "parallel/mpi_size");
int ndims = JsonGetInt(part, "Mesh/GeomDim");
Decide equivalent object for plugin to write• For rect mesh & Silo, DBPutQuadmesh/DBPutQuadvar
Convert where necessary (time apart from i/o)
Output the data (time as i/o)
MACSio Plugin Operation
Lawrence Livermore National Laboratory LLNL-PRES-673936
For events, diagnotics, debugging, etc.
Single row & column formatted ascii file• Default #cols is 128 but settable on CL• Default #rows per mpi is 64 but settable on CL
Each rank’s section acts like a circular queue• Can loose messages but always have most recent N
prior to a serious issue
MACSio Logging
Lawrence Livermore National Laboratory LLNL-PRES-673936
mpirun –np 4 macsio –interface silo • defaults to write, 1 part per rank, 1 file per processor (MIF), 80000 byte requests, 10 dumps
mpirun –np 4 macsio –parallel_file_mode MIF 3 –part_size 1M \ –avg_num_parts 2.5 –interface hdf5• each mesh part is 1Mb, 10 dumps to 3 MIF files using HDF5 interface
mpirun –np 4 macsio –interface hdf5 \
–parallel_file_mode SIF –plugin-args –compression gzip level=9
mpirun –np 4 macsio –read_path foo.silo• Have attempted to make Silo read general enough to read any code’s restart/plot
mpirun –np 4 macsio –interface exodus –plugin_args \ –use_large_model always
mpirun –np 4 macsio –interface hdf5 –meta_type tabular \--meta_size 5M 20M
Sample usage
Lawrence Livermore National Laboratory LLNL-PRES-673936
MACSio Examples
4 processors, avg_num_parts = 2.5, tot=10 parts
Silo output
Showing spherical, sinusoid and random vars
Lawrence Livermore National Laboratory LLNL-PRES-673936
Initial look at Perlin noise
Lawrence Livermore National Laboratory LLNL-PRES-673936
Parts vs. Ranks avg_num_parts=2.5
• Some ranks get 2• Some ranks get 3
Minor bug:• More realistic if middle
row parts were swapped
Lawrence Livermore National Laboratory LLNL-PRES-673936
Preliminary runs on Vulcan
Lawrence Livermore National Laboratory LLNL-PRES-673936
How many bytes in a “Kilo”-byte?
Does “Kilo” mean 1000 or 1024?• 2.4% error for “Kilo” ~10% error for “Tera”
International Electrotechnical Commission (IEC)• “Decimal” prefixes (1000 bytes), Kb, Mb, Gb, Tb, Pb• “Binary” prefixes (1024 bytes), Ki, Mi, Gi, Ti, Pi— Kibibyte, Mebibyte, Gibibyte, Tebibyte, Pebibyte.
MACSio will use either (default is Binary)• --units_prefix_system CL argument
A note about SI Prefixes
Lawrence Livermore National Laboratory LLNL-PRES-673936
Preliminary tests on vulcan
0 500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000 45000000
500
1000
1500
2000
2500
Bandwidth as a function of request sizevulcan (1 core per node to 1K nodes in pdebug, then
2/node, 4/node, etc.)Silo, writes, MIF
256-16
512-32
1024-32
2048-64
4096-64
8192-128
Per task request size in 1000's of bytes
Ag
gre
ga
te B
an
dw
idth
in
Mi/
se
c
Lawrence Livermore National Laboratory LLNL-PRES-673936
Same data, scaling plot
0 1000 2000 3000 4000 5000 6000 7000 8000 90000
500
1000
1500
2000
2500
Preliminary Performance Gathering with MACSio on Vulcanwrites, MIF, Silo
400 bytes
4000 bytes
40000 bytes
80000 bytes
160000 bytes
320000 bytes
640000 bytes
1000000 bytes
4000000 bytes
# of processors
Ag
gre
ga
te B
an
dw
idth
(M
i/s
ec
)
Lawrence Livermore National Laboratory LLNL-PRES-673936
0 50000 100000 150000 200000 250000 300000 3500000
200
400
600
800
1000
1200
1400
1600
1800
2000
HDF5 4-way MIF (Sierra, No compression)
12
24
48
96
192
Lawrence Livermore National Laboratory LLNL-PRES-673936
0 50000 100000 150000 200000 250000 300000 3500000
50
100
150
200
250
HDF5 4-way MIF (Sierra, gzip level=9)
12 procs
24 procs
48 procs
96 procs
Request Size (bytes)
Ban
dw
idth
Mi/
sec
Lawrence Livermore National Laboratory LLNL-PRES-673936
HDF5 SIF on Sierra
0 50000 100000 150000 200000 250000 300000 3500000
10
20
30
40
50
60
70
80
90
HDF5 SIF (Sierra)
12 proc
24 proc
Lawrence Livermore National Laboratory LLNL-PRES-673936
srun –n and –N argument confusion
Silo BG-optimized VFD not getting used• default setting for “bgp” and “bgl” but no “bgq”
Not going to large enough request sizes
Not getting enough resources for largest runs
What is better measure for aggregate BW?• Sum of each BW observed at each task?• Total bytes / (Last Finisher’s Time – First start’s Time)— This is generally 10-20% lower
Initial Performance DataCollection Missteps
Lawrence Livermore National Laboratory LLNL-PRES-673936
Documentation
Lawrence Livermore National Laboratory LLNL-PRES-673936
Currently all C and uses GMake
MACSio main + utils: ~4500 lines (including doc)
MACSio plugins• Silo: ~900 lines• HDF5: ~600 lines• Exodus: ~600 lines• Raw-Posix: ~350 lines— Its currently called the “miftmpl” for MIF Template
About MACSio
Lawrence Livermore National Laboratory LLNL-PRES-673936
Currently on LLNL CZ Git/Stash• Can give permission to anyone with LLNL CZ Token
BSD Open Source Release in progress• Expected BSD release end of June • Put up link on codesign.llnl.gov• Maybe “mirror” on GitHub or (suggestions)
Volunteers for new plugins? Users?• Contact: [email protected]
MACSio Availability & Next Steps
Lawrence Livermore National Laboratory LLNL-PRES-673936
Short Term:• Mark to continue support/development but taper down
Medium Term:• Shift responsibilities to a dedicated proxy app team
Long Term:• Evolve to a true community open source project
Next Steps
Lawrence Livermore National Laboratory LLNL-PRES-673936
Task Difficulty Estimated Completion
Remaining documentation low Q3-15
Handle amorphous metadata all plugins low Q3-15
BSD Release and support new users low Q3-15
Finish Filesystem ops functionality med Q4-15
Fix node duplication low Q4-15
Checksum validation (on read) med Q4-15
Add “trickle” dump class med Q1-16
Variables on only some mesh parts med Q1-16
Add at least one bi-modal plugin (ITAPS) low Q2-16
Mesh subsets (nodesets, facesets, etc) med Q2-16
Integrate Caliper (replace timing/log) med Q2-16
Variable exprs on command-line med Q2-16
Additional mesh types, variables, materials med Q3-16
Short & Medium Term Priorities
Lawrence Livermore National Laboratory LLNL-PRES-673936
Task Difficulty Estimated Time
Sanity check MACSio vs. Code via Darshan hi Q3-16
Controllable task Mappings (files, I/O nodes) hi Q4-16
Per-code CL arg-sets db hi Q4-16
Performance db stand up and population hi Q1-17
Asynchronous I/O (need a lib capable) w/Compute hi Q3-17
Longer Term Thoughts
Lawrence Livermore National Laboratory LLNL-PRES-673936
This milestone will fill a gap in the DOE co-design space by developing a proxy application on which various parallel I/O methods and libraries can be easily tested and studied for scalability, performance impacts of software abstractions, impact of techniques such as compression or checksumming, and the use of burst buffers. The core of the application will be common system for generating collections of data structures (e.g. arrays [float|double|int], strings, key/value pairs) that in aggregate closely mimic the checkpoint data of real applications. Generation of the data will be user-definable through fully parameterized attributes for the types, quantity, lengths, and randomness of data. A plug-in architecture will be designed to allow users to study different I/O libraries (e.g. HDF5 vs netCDF), parallel I/O techniques (e.g. file-per-processor vs collective), and support for burst buffers (e.g. SCR).
L2 Milestone Description
Lawrence Livermore National Laboratory LLNL-PRES-673936
A functional proxy app with abstractions to support at least two I/O libraries commonly used at the ASC labs is demonstrated, and initial feedback sought from the broader community. Silo (r/w), HDF5 (w), Exodus (w), Raw Posix (r/w)
Full documentation on how the proxy app can be extended to support new plug-ins. Doxygenated sources and examples + design docs
A report (white paper or presentation) is delivered describing the proxy app, and reporting on performance results gathered through its use on at least two platforms (e.g. TLCC2 and Sequoia with Lustre). Design Doc + This presentation
L2 Milestone Completion Criteria
Lawrence Livermore National Laboratory LLNL-PRES-673936
Rob Neely and Chris Clouse• For funding me to work on this
Eric Brugger and Cyrus Harrison• For allowing me time away from my other
responsibilities
Thanks